Japanese Version

Algebraic structure of non-identifiable learning machines :

E-mail: swatanab@pi.titech.ac.jp

(1) A Problem:

Let us consider a learning machine represented by a parametric probability distribution p(x,w), where x is an input and w is a parameter. Here p(x,w) estimates the true distribution q(x), using a set of empirical samples {xi;i=1,2,3,...,n } independetly taken from q(x). Let p( w | {xi} ) be the Bayesian a posteriori distribution of the parameter w under the condititon that samples {xi} are given. The estimated probability density p( x | {xi}) is defined by

p( x | {xi} ) = \int p( x | w) p ( w | {xi} ) dw.

Our main purpose is to clarify how fast the estimated learner p( x | {xi} ) attains the true distribution q(x). We define the average Kullback distance or the average relative entropy,

K(n) = E { \int q(x) \log [ q(x) / p(x | {xi} ) ] dx },

where E represents the expectation value over all sets of empirical samples. We want to know the asymptotic expansion of $K(n)$ for the case n tends to infinity.

Remarks:

(1) Note that K(n)=0 if and only if p(x | {xi} ) = q(x).
(2) In statistical learning theory, K(n) is often called Learning Curve' or Generalization Error'.
(3) I heard that physists often omit E by chanting a spell "Self-Averaging !".

(2) Stochastic Complexity and Kullback Distance :

The stochastic complexity or the Free energy of a learning machine is defined by

F(n) = - E { log \int exp( - H_{n}(w) ) r(w)dw },

where H_{n}(w) is the empirical Kullback information or the random Hamiltonian,

H_{n}(w)=\sum_{i=1}^{n} \log [ q(xi) / p(xi,w) ].

The stochastic complexity F(n) plays a central role in information theory, Bayesian statistics, statistical physics, and statistical learning theory. For example, Amari and Murata [1] showed that the average Kullback distance is equal to the increase of the stochastic complexity,

K(n)=F(n+1)-F(n),

and proved that, if the learning model p(x,w) satisfies the regularity condition for the asymptotic normality of the maximum likelihood estimator, then

Regular Models: K(n)=d/2n + o(1/n),

where d is the number of parameters and n is the number of training samples. However, it has been left unclear how fast K(n) goes to zero for the case when p(x,w) is not a regular model. This problem was left unknown even in mathematical statistics and mathematical physics. Dr. Hagiwara and Dr. Fukumizu proved that a lot of complex learning machines are not regular models, in general. For example, multi-layer perceptrons, radial basis functions, and mixtuies of normal distributions do not satisfy the regularity condition for the asymptotic normality of the maximum likelihood estimator. The sets of true paramters in such models are analytic sets or anlgebraic varieties with singularities. It should be emphasized that H_{n}(w) can not be approximated by any quadratic form.

(3) Our recent results:

Assume that p(x,w) is analytic for w. We obtained the complete asymptotic form of F(n)

F(n)= A log n - (B-1) log log n + const.,

where A is a positive rational number and B is a natural number. (A is not larger than d/2, and B is not larger than d, where d is the number of parameters.) Our result claims that, if K(n) can be asymptotically expanded, then

K(n)=A/n - (B-1)/ ( n log n ) .

Let H(w) be the Kullback information for a parameter w,

H(w)= \int q(x) log [ q(x)/ p(x,w)] dx.

In order to prove the asymptotic expansion of F(n), we need the Sato-Bernstein's b-function [M2]. The b-function is the polynomial b(z) that satisfies the relation

P(z,w,dw)H(w)^{z+1}=b(z)H(w)^{z},

where P(z,w,dw) is a finite order differential operator for w with coefficients of holomorphic functions and a polynoimial for z. The existence of b-function was proved by Sato, Bernstein, Bjork, and Kashiwara. An algorithm to calculate b-function was developed by Oaku [M3].

The constants -A and B are important in real world computing. They are characterized as the maximum pole and its multiplicity of the complex function

J(z) = \int H(w)^{z} r(w)dw.

The algorithm to calculte A and B is also established by employing resolution of singularities in algebraic geometry, the famous theorem by Hironaka [M1]. In fact, based on the resolution theorem, we can find a d-dimensional real manifold U and a resolution map g: U -> W, such that locally near each point in U,

H(g(u))=a(u)u_{1}^{k1}u_{2}^{k2} *** u_{d}^{kd},

where a(u) is an invertible function, and k1,k2,...,kd are non-negative even integers. The resolution of singularites enables us to calculate A and B. The resolution map g(u) can be found by using the finite times blowing ups of the non-singular manifolds in singularities. I heard that Hironaka's theorem not only proves the existence of the desingularization map but also gives an algorithm to compute it constructively. Also I heard that we can find the Grobner basis firstly in his proof, which is referred to as Hironaka's standard basis.

Here, we consider a learning machine with high complexity, and its learning can be clarified based on some results in modern mathematics. We expect that this result is the mathematical foundation in the research of computational intelligence.

(4) Problem for the future:

Our result shows that the Bayesian estimation case is clarified essentially. However, the maximum likelihood case is not yet clarified. Dr. Hagiwara in Mie University proved that the average expectation value is larger than (log n)/n, if the input distrubution is given as the finite sum of delta functions [3]. Dr. Fukumizu in Institute of Mathematical Statistics proved that the average expectation value of the reduced rank approximation is characterized as the average of the eigen values of random matrices subject to Wishart distribution [2]. The maximum likelihood case for the general complex learning machines is the important problem for the future.

Related Papers:

[1] Amari, S. and Murata, N., (1993) Statistical Theory of learning curves under entropic loss criterion. Neural Computation, 5, 140-153.
[2] Fukumizu, K., (1999) Generalization error of linear neural networks in unidentifiable cases. Lecture notes on computer sciences, 1720, 51-62.
[3] Hagiwara, K. Kuno, K, and Usui, S., (1998) Degenerate Fisher information matrix and average expectation error of neural networks, Symposium on statistical estimation and information theory, 95-102.
[4] Watanabe, S., (1998) On the generalization error of layered statistical models with Bayesian estimation. IEICE Trans. J-81-A, 1442-1552. The English version is to appear in Electronics and Communications in Japan, John Wiley and Sons.
[5] Watanabe, S., (1999) Algebraic analysis for singular statistical estimation. Lecture Notes in Computer Sciences, Vol.1720, pp.39-50.
[6] Watanabe,S., (2000) Algebraic analysis for non-regular learning machines. Neural Information Processing Systems. Vol.12, pp.356-362, MIT Press.
Article, Postscript file, gzipped
Correction of mistake, Postscript file, gzipped
[7] Watanabe, S., (2000) Algebraic analysis for non-idetifiable learning machines. to appear in Neural Computation.
Article, Postscript file, gzipped
[8] Watanabe, S.,(2001) Algebraic Information Geometry for Learning Machines with Singularities. to appear in Advances in Neural Information Processing Systems.
Article, Postscript file, gzipped

Related Pure Mathematics:

[M1] Hironaka, H., (1964) Resolution of singularities of an algebraic variety over a field of characteristic zero. Ann. of Math. 79, 109-326.
[M2] Kashiwara, M., (1976) B-functions and holonomic systems. Inventions Math., 38, 33-53.
[M3] Oaku, T., (1997) Algorithms for the b-function and D-modules associated with a polynomial. J. Pure and Appl. Alg., 117, 495-518.