Pavel Kovanic
1 Introduction
Mathematical Gnostics is a non-statistical tool used to efficiently treat small samples of strongly uncertain data. It consists of
- The axiomatic theory of uncertainty of individual uncertain data and small data samples,
- Numerical characteristics of data uncertainty resulting from this theory,
- Algorithms and programs estimating true data values along with characteristics of their uncertainty.
This approach resulted from the long-term scientific activity of the Institute of Information Theory and Automation of the Czechoslovak Academy of Sciences in Prague. The theory was published by P. Kovanic (1984, 1986). Pragmatic readers not interested in the theoretical fundaments of the approach are recommended to leave out the two following abstract paragraphs and to go directly to programs and applications.
2 The Gnostic Theory of Individual Data
The theoretical model of uncertainty of individual data uses the axioms of the measurement theory created by von Helmholtz (1887) as a system of conditions necessary and sufficient for consistent quantification (counting or measuring) of real quantities. The measurement theory considers only uncertainty-free quantification, leaving the possible errors to the statistical treatment. Unlike this, the Gnostic model of uncertain quantification is bi-dimensional, applying the aforementioned conditions to both (true and uncertain) components of the measured quantity. The fundamental assumption of the theory is thus that the real data to be treated were obtained by orderly (consistent) measurements or counting. No assumptions of a statistical nature are used. The consistency conditions accepted as the first axiom of the theory lead to the data item representation as an element of a commutative bi-algebra, which implies the Minkowskian metric to the data space. The surprising consequence of this metric can be interpreted as maximization of the „damage“ caused by Nature’s (uncertain) contribution to the quantified data value by moving the data item along a special (extreme) path in the Minkowskian space. To confront this damage, the analyst looks for a way of minimizing the uncertainty, for the best possible path back to the hidden true value. The form of the former (“quantifying”, Minkowskian) path along with the latter (“estimating”, Euclidean) opposite path, justify introduction of two pairs of non-linear uncertainty measures, quantifying and estimating data weight and data irrelevance. Using a plausible Gedanken-experiment, close connections to the classical (Clausius', pre-statistical) concept of thermodynamic entropy are shown allowing the entropy change caused by an individual data's uncertainty to be evaluated and corresponding information of the data item to be estimated. Both quantifying and estimating versions of these quantities reach extreme values when the quantifying and estimating processes follow the mentioned paths forming the closed Gnostic Cycle. When passing this Cycle, the entropy change causes the information change and vice versa. An important, but natural principle is proved: the effect of a finite contribution of uncertainty to a data item value can be minimized by the optimum estimation, but it cannot ever be completely removed. This result can be interpreted as an information complement to the Second Law of Thermodynamics.
Gnostic data weight and irrelevance manifest their natural robustness of two kinds: the estimating characteristics are robust with respect to outliers (peripheral data) while the quantifying ones are robust to inliers (to central data and noises) of the data sample). This feature makes them suitable to be applied to strongly uncertain data. However, it can be shown, that all these characteristics converge to statistical ones when the uncertainty is very weak: data irrelevances converge to ordinary (Euclidean) errors and the data weights converge to squared errors. This presents the classical (non-robust) statistics as a tool for handling „relatively good“ data and Gnostics as its robust extension having its own, different theoretical fundament justifying its applications to small data samples.
3 The Gnostic Theory of Small Data Samples.
Samples consisting of 30 data are considered large enough in statistics to at least approximate the application of the Central Limit Theorem. Unfortunately, there are applications not providing an analyst with samples of this size or larger. Moreover, many tasks solved by statistics are based on a priori assumptions of the statistical model of data which can neither be tested nor justified by small data samples.
Unlike this, the Gnostic model of individual data uncertainty is based only on the assumption that the underlying data are real, i.e. obtained by consistent measuring process. To go over from the individual uncertainty to the uncertainty of a data sample, one needs a well justified Aggregation Law for uncertain data and their characteristics. Fortunately, the Minkowskian metric proved for the space of the quantification process has a surprising consequence: there exists a Lorentz-invariant and linear isomorphism between each quantifying pair of data weight and data irrelevance and a corresponding energy-momentum pair of charge-free relativistic particle. The relativistic composition law is well justified by the relativistic Energy and Momentum Conservation Law, which is additive. The requirement of quantification/estimation consistency motivates the additive Composition Law to be accepted for both quantifying and estimating data weight and data irrelevance (as the second axiom of the theory).
Data weights and irrelevances are parameterized by a ratio of the observed and true data value. However, the true data value is unknown. It is to be estimated by using all components of the data sample. The missing true value is estimated by maximizing the information of the sample's aggregated data items.
Accepting the data weights and irrelevances (which are non-linearly dependent on the data values) as uncertainty measures, one uses geometry of the Riemannian type valid in a curved space. However, the curvature of the space and its geometry is determined not subjectively by an analyst, but objectively by the observed data. This is how Gnostics satisfy the requirement “Let the data speak for themselves”.
4 Gnostic Algorithms
The development of the theory was always running in close interaction with verification of the algorithms using the Gnostic formulae in applications. There are two classes of analytic tasks: the marginal (one-dimensional) and multi-dimensional analysis. To solve them using the Gnostic algorithms, one needs only data, no statistical model assumptions are used.
The Gnostic one-dimensional analysis is based on four non-standard types of probability distribution functions and their densities: ELDF (Estimating Local), EGDF (Estimating Global), QLDF (Quantifying Local) and QGDF (Quantifying Global) distribution function. The estimating functions are robust with respect to outliers, the quantifying ones to inliers. The ELDF’s flexibility can be controlled by the scale parameter to reveal details of the structure of a non-homogeneous data structure. Unlike this, the EGDF is relatively rigid to provide an overall view on a homogeneous data sample. Its rigidity enables not only the “ordinary” tasks (estimation of probability and quantiles), but also some special tasks to be robustly solved:
- Estimation of the bounds of the data domain (data support),
- Objective estimation of the membership in a sample,
- Estimation of the left- and right-censored and interval data,
- Reliable testing of data homogeneity,
- Reliable one- and two-samples hypotheses testing,
- Estimation of covariances and correlations,
- Robust filtering of data,
- Probabilistic prediction and monitoring of processes,
- Automatic data exploration and classification robustly providing the detailed information on data features and on their quality.
The quantifying distribution functions can be advantageously applied to data contaminated by noises partly masking the larger “signals” to carry out analogous functions in the “noise-robust” manner.
The Gnostic multi-dimensional analysis is mainly based on robust identification of several types of regression models. All of them use a Gnostic “influence” function to maximize the results‘ information when conducting the method of Iterated Weighted Least Squares. Both explicit (ordinary) and implicit forms of the regression model are solved along with the regression in probabilities expressing the interdependence of data probabilities instead of the data itself. All the types offer some advantages and find reasonable applications.
Being supported by the cohesive theory, these algorithms are applicable not only under some special assumptive conditions but more universally and objectively because everything needed for data processing is determined by the data. Optimality of the results consists in maximizing information mined from data.
5 Applications
There is a rich experience with successful applications of Gnostic programs in many fields of science and technology:
- Environmental control: analysis of pollutants in waters, air, in human organisms and of toxicity.
- Economics: financial statement analysis, market predictions, marketing, objective rating.
- Medicine: objective diagnostic limits, decision making support, hypotheses testing.
- Quality assessment control in chemical industry and mechanical engineering.
- Survival models of parts of heavy trucks.
- “Cleanroom” control for production of highly integrated chips.
- Identification of acoustic signals.
- Image treatment (simultaneous noise suppression and contours enhancement).
- Estimation of particle size distribution in atmospheric aerosols (hroch486.icpf.cas.cz/wagner/gnostics).
- Archaeology: analysis of economic impacts on historical coins.
6 Examples
Examples of many applications are described in publications cited below (No.7, 8, 10).
Literature
- Kovanic P.: Gnostical Theory of Individual Data, Problems of Control and Information Theory 13 (1984), 4, 259–274 (download)
- Kovanic P.: Gnostical Theory of Small Samples of Real Data, Problems of Control and Information Theory 13 (1984), 5, 303–319 (download)
- Kovanic P.: On Relations between Information and Physics, Problems of Control and Information Theory 13 (1984), 6, 383–399 (download)
- Kovanic P.: A New Theoretical and Algorithmical Basis for Estimation, Identification and Information, The IX-th World Congress IFAC '84, Preprints, IFAC Budapest (1984), Vol.XI, 122–131
- Kovanic P.: A New Theoretical and Algorithmical Basis for Estimation, Identification and Control, Automatica IFAC 22 (1986), 6, 657–674 (download)
- Helmholtz H. von, Zaehlen und Messen erkenntniss-theoretisch betrachtet, in Philosophische Aufsaetze Eduard Zeller gewidmet, Leipzig (1887), 17–52
- Kovanic P., Humber M.B.: The Economics of Information (Mathematical Gnostics for Data Analysis), (2003), 707 pp. (download)
- Wagner Z., Kovanic P.: Advanced Data Analysis for Industrial Applications, Modelling Smart Grids 2015, Prague, September 10–11, 2015. (download)
- Kovanic P., The Mathematical Gnostics (Advanced Data Analysis), Proceedings IPMU 2016, Communications in Computer and Information Science 610, Part I, p. 177, (2016), Springer Verlag.
- Kovanic P., Advanced Data Analysis by Mathematical Gnostics, presentation at IPMU 2016, June 21, 2016, Eindhoven. (download)