## Singular Learning Machines

 Return to Sumio Watanabe Homepage In machine learning and statistical inference, we need to evaluate how precise the probabilistic model is. Let W be a set of probability distributions (or equivalently a set of statistical models and learning machines). The set W can often be identified as a manifold which is made of parameters. The most fundamental question in statistics and machine learning is to clarify the probabilistic phenomenon when sample data are taken from the unknown probability distirbution W_{0}. Sometimes W_{0} is called an information source. In regular statistical models, W_{0} consists of one point and Fisher information matrix is positive definite at W_{0}. However, in non-regular statistical models and learning machines, W_{0} is an analytic set with singularities or an algebraic variety. This is the main reason why algebraic geometry and algebraic analysis of the pair (W,W_{0}) are needed. The following figure illustrates an example of the pair. In this figure, W={(a,b,c)} is 3-dimensional Euclidean space and W_{0}={(a,b,c)\in W; p(x|a,b,c)=q(x)}. By introducing Kullback-Leibler information K(a,b,c)=\sum_{x} q(x)(log q(x)-log p(x|a,b,c)), The set W_{0} is equal to the set of zeros of K(a,b,c), which is an analytic set with singularities. If a learning machine has a hidden variable or a hierarchical structure, then it is singular in general. Singular learning machines are not special statistical models. Almost all learning machines used in computational intelligence and information engineering are singular. For example, artificial neural networks, hidden Markov models, Bayesian betworks, reduced rank regressions, stochastic context-free grammar, Boltzmann machines, normal mixtures, mixture of binomial distributions, and so on, are all singular learning machines. In a singular learning machine, the Kullback-Leibler information can not be represented by any quadratic form, resulting that Fisher information matrix is singular. Therefore, AIC is not equal to the average generalization error, BIC is not equal to the Bayes marginal likehood, MDL is not equal to the minimum description length. The log likelihood ratio in hypothesis testing does not converge to X^{2}-distribution. In order to estabilish a statistical learning theory of singular learning machines, we need a new mathematical foundation. We proposed that algebraic geometry and algebraic analysis play a central role in these issues. In a singular learning machine, the log likelihood ratio is equal to zero on an analytic set or an algebraic variety. In order to treat the log likelihood ratio function, we need emprical process theory on the analytic set and algebraic variety. In a singular learning machine, the a posteriori distribution converges to a singular distirbuion which is quite different from the normal distribution. The Zeta function of a learning machine is a powerful tool to analyze such a probability distribution. Here is the fundamental theorem which enables us to observe the phenomenon of learning process in singular machines. Algebraic analysis for nonidentifiable learning machines . (Remark to mathematicians)------------------------ The learning coefficient has a close relation to the real log canonical threshold, the real jumping numbers, and zeros of b-function. They are determined by the pair (W,W_{0}). The proper analytic map g can be found by recursive blowing-ups or toric modifications. We found that singular learning theory has close relation to a lot of concepts in mathematics. SINGULAR LEARNING THEORY : Discovery of the hidden true probability distribution W_{0} in the Family W is mathematically equivalent to algebraic study of the pair (W,W_{0}). In 1998, we discovered Algebraic Geometrical Methods .