Singularities in Learning Machines






Algebraic Geometry and Learning Theory


Algebraic Geometry and Statistical Learning Theory

Sumio Watanabe, ``Algebraic Geometry and Statistical Learning Theory", 2009, August, Cambridge University Press.

Return to Sumio Watanabe Homepage

In 1998, we discovered a mathematical relation between algebraic geometry and statistical learning theory.

Statistical estimation of the hidden structure is mathematically equivalent to discovery of singularities in the manifold of statistical models.

Google Scholar


The Kullback-Leibler information of a learning machine is a function from the set of parameters to the set of real numbers. In singular learning machines such as neural networks, normal mixtures, Bayes networks, Boltzmann machines, reduced rank regressions, hidden Markov models, and stochastic context-free grammers, the KL information has singularities. In neighborhooods of singularities, the log likelihood can not be approximated by any quadratic form of parameters, hence the conventional statistical asymptotic theory does not hold.

The Hironaka's desingularization theorem ensures that there exists a real analytic map from a manifold to the set of learning machines such that the KL information has only normal crossing singularities. Based on this theorem, the likelihood function can be written by the standard form, resulting that

(1) the pole of the zeta function of a learning machine can be derived,

(2) asymptotic forms of stochastic complexities and generalization errors were clarified,

(3) the generalization error in Bayes and Gibbs can be estimated by Bayes and Gibbs training error without any knowledge of singularities,

(4) and the symmetry of generalization errors and training errors in maximum likelihood method is proved.

The coefficient of the main term of the stochastic complexity has a mathematical relation to the roots of Bernstein-Sato's b-function , the log canonical threshold, and complex singularity exponent , which is defined as the largest pole of the zeta function of a learning machine.

Suimio Watanabe,"Algebraic analysis for singular statistical estimation," Lecture Notes in Computer Sciences, (Springer), Vol.1720, pp.39-50, 1999.

Sumio Watanabe,"Algebraic analysis for non-regular learning machines," Advances in Neural Information Processing Systems, Vol.12, 356-362,2000.

Sumio Watanabe,"Algebraic Analysis for Nonidentifiable Learning Machines," Neural Computation, (MIT Press), Vol.13, No.4, pp.899-933, 2001. This paper was communicated by Professor David Mumford.

Sumio Watanabe, "Algebraic geometrical methods for hiearchical learning machines," International Journal of Neural Networks, Vol.14, No.8,pp.1049-1060, 2001.

Sumio Watanabe,"Algebraic geometry of singular learning machines and symmetry of generalization and training errors," Neurocomputing, Vol.67, pp.198-213,2005.

Sumio Watanabe, "Almost all learning machines are singular," Invited Paper in IEEE International Symposium FOCI 2007.

Sumio Watanabe, "Equations of states in singular statistical estimation," arXiv:0712.0653, 2007.

The formula based on algebraic geometry has been applied to neural networks, gaussian mixtures, Bayesian networks, hidden Markov models, Boltzmann machines, reduced rank regressions, and stochastic context-free grammers. In fact, Dr. Rusakov and Professor Geiger, Professor Aoyagi, and Professor Yamazaki are now constructing a new research field in machine learning. For example, Professor Aoyagi found the complete resolution map of reduced rank regression,


M.Aoyagi, S.Watanabe,``Stochastic complexities of reduced rank regression in Bayesian estimation," Neural Networks, Vol.18,No.7,pp.924-933,2005.


As you know very well, pure mathemtics is far from the real world. However, we found a bridge,

algebraic geometry = singularity theory = zeta function = hyperfunction = empirical process = likelihood function = Statistics

By using this bridge, we can predict the behavior of any learning machine based on resolution of singularities. It should be emphasized that the statistical properties of singular learning machines can not be clarified without algebraic geometry or algebraic analysis.




(Remark) You can easily find the problem caused by singularities in almost all learning machines. In singular learning machines, the maximum likelihood method does not give efficient estimation or efficient hypothesis testing.

(Remark) The research of singularities was reported by the early papers. In these papers, however, I did not arrived at the relation between algebraic geometry and learning theory.

* Sumio Watanabe, ``A generalized Bayesian framework for neural networks with singular Fisher information matrices," Proc. of International Symposium on Nonlinear Theory and Its applications, (Las Vegas), pp.207-210, 1995.

* Sumio Watanabe,"On the essential difference between neural networks and regular statistical models," Proc. of Int. Conf. on Computational Intelligence and Neuroscience, Vol.2, pp.149-152, 1997.
*
Sumio Watanabe,"Inequalities of Generalization Errors for Layered Neural Networks in Bayesian Learning," Proc. of Int. Conf. on Neural Information Processing, pp.59-62, 1998.


After these researches, I found that singular learning machines have the mathematical relation to algebraic analysis and algebraic geometry in 1998.




Zeta function




(Remark) By using the reslution of singularities, we can prove that the zeta fuction of a learning machine can be analytically continued to the meromorphic function on the entire complex plane. The largest pole of the zeta function determines the learning coefficient.




Zeta and Partition function




Remark. It is well known that the generalization error G is given by

E[G]= E[F(n+1)]-E[F(n)]-nS,

where S is the entropy of the true distribution. The average generalization error E[G] is given by

E[G]= \lambda/n


Here the constant \lambda is called learning coefficient , which is algebriac geometrically determined by the pair (statistical model, true distribution). The average generalization error as a function of the number of training samples is called a learning curve . The learning curve is determined by singularities.




Singularities and Statistics, Introduction
Singularities and Statistics, More Detail