Learning Curves in Singular Learning Machines








Return to Sumio Watanabe Homepage


The average generalization error as a funciton of the number of examples is called a learning curve. When the number of examples tends to infinity, then the learning curve goes to zero in a lot of learning machines. We have a question how fast the generalization error goes to zero. Also we have a question what determines its speed.

Case1

(Remark) In some singular learning machines such as normal mixtures, the maximum likelihood estimator always diverges. The generalization error is not defined. Thus the maximum likelihood estimator is not appropriate.




In singular learning machines, the generalization error of Bayes estimation is far smaller than that of the maximum likelihood. The Bayes learning curve is determined by singularities of the learnining machine. The maximum likelihood learning curve is determined by the maximum of a gaussian stochastic process.

Case2


(Remark) It should be emphasized that, in singular learning machines, the log likelihood ratio converges to the gaussian stochastic process by the topology of L. Shwarz distribution or of M. Sato's hyperfunction. It does not converges to the gassian stochastic process with the uniform norm or the supremum norm. This is the main reason why the maximum likelihood estimator is not appropriate in singular learning machines.


(Remark) Sometimes one can consider the statistical inference should be done in the coordinate-free manner. If one should select the statistical inference in coordinate-free methods, then the Bayes estimation with Jeffreys' prior is recommended. The maximum likelihood method is not recommended.

In general situations, the coordinate-free statistical inference is not appropriate for the getter generalization performance. We propose that the problem of statistical inference is studied as the relation between the probability model p(x|w) and the true distribution q(x). In algebraic geometry, such a problem is studied as the pair of the algebraic varieties (W,W_{0}). It is mathematically important that the learning curve is determined by the real log canonical threshold of this pair.




The Bayes case is written in the following paper.

S. Watanabe, "Algebraic geometrical methods for hierarchical learning machines," Neural Networks, Vol.14, No.8,pp.1049-1060, 2001.

The maximum likelihood case is written in the following paper.

S.Watanabe,``Algebraic geometry of singular learning machines and symmetry of generalization and training errors," Neurocomputing, Vol.67,pp.198-213,2005.