Sumio Watanabe, Ph.D.




(Postal Mail)


Sumio Watanabe, Ph.D.

Professor


Department of Computational Intelligence and Systems Science,
Tokyo Institute of Technology,
Mailbox: G5-19
4259 Nagatsuta, Midori-ku, Yokohama,
226-8502 Japan.


(E-mail)

swatanab (AT) dis . titech . ac . jp


(Remark) Old address, swatanab (AT) pi . titech . ac . jp, is not available now.


Japanese Homepage

DBLP: Computer Science Bibliography

Paper Information

Return to Watanabe Lab.

Algebraic Geometry and Learning Theory

In 1998, we found a bridge between algebraic geometry and learning theory.



Beyond Laplace and Fisher

WAIC and WBIC
WAIC and WBIC




Both WAIC and WBIC are supported by

Algebraic Geometry and Statistical Learning Theory


Algebraic Geometry and Statistical Learning Theory

Sumio Watanabe,
Algebraic Geometry and Statistical Learning Theory,
Cambridge University Press, 2009.


New statistical theory is established that holds even for non-regular models such as a normal mixture, a neural network, and hidden Markov models. The resolution theorem in algebraic geometry transforms the likelihood function to a new standard form in statistics. The asymptotic behavior of the log likelihood ratio function is given by the limit empirical process on algebraic variety. This theory contains regular statistical theory as a very special part.

We can make generalized concepts of AIC and BIC, even if a true distribution is unrealizable by or singular for a statistical model. In fact, WAIC and WBIC are derived. It is very easy to apply them to practical applications.

Both WAIC and WBIC are based on the completely new statistical theory. Neither positive definiteness of Fisher information matrix, asymptotic normality of MLE, nor Laplace approximation is necessary in our new theory. Thus, our theory holds for wide range of statistical models.

Let's see the true likelihood function.
Algebraic geometry is essential to new statistics



Let's compare WAIC with DIC.
Widely Applicable Information Criterion



Let's compare CV with WAIC.
CV and WAIC



Let's compare PSISCV with WAIC.
PSISCV and WAIC



Let's try WBIC
Widely Applicable Bayesian Information Criterion







Let's compare CV, PSIS, with WAIC
as Estimator of Generalization Error



(1) CV and WAIC had better be compared with the generalization error.


(2) There is a case when WAIC is better than PSISCV as a statistical estimator.


The cross validation (CV), Pareto Smooting importance sampling cross validation (PSISCV), and WAIC are asymptotically equivalent uder the condition that data are inpendent. However, the purpose of WAIC is not approximating CV but estimating the generalization error. Thus we had better compare them as estimators of the generalization error.


It is easy for you to conduct the same experiment.


I recommend that you had beter see the true experimtal result by your own eyes.


Experiment Description (PDF)


We study a simple regression problem, Y = aX^2 + Noise, with fixed inputs,

X=0.1, 0.2, ..., 1.0,

the following shows the experimental result of 10000 random trials (sample size n=10), where


|WAIC-GE| : absolute value of the difference between WAIC and the generalization error.

|ISCV-GE| : absolute value of the difference between the importance sampling cross validation and the generalization error,

|PSIS-GE| : absolute value of the difference between the Pareto smoothing importance sampling cross validation and the generalization error.


matlab program


Statistics of 10000 trials:

WAIC(mean,std) = 0.098, 0.123
ISCV(mean,std) = 0.116, 0.131
PSIS(mean,std) = 0.112, 0.128
GEN (mean,std) = 0.097, 0.115
mean( |WAIC-GE| ) = 0.164
mean( |ISCV-GE| ) = 0.173
mean( |PSIS-GE| ) = 0.170


WAIC is the better approximator of the generalization error than the impostance sampling cross validation.


The following shows histogram of |ISCV-GE|-|WAIC-GE|.


CV and WAIC


The following shows histogram of |PSIS-GE|-|WAIC-GE|.


PSIS and WAIC


WAIC is the better approximator of the generalization error than the pareto smoothing importance sampling cross validation.


The Pareto smoothing cross validation may be the better approximator of the cross validation than WAIC, however, it is not of the generalization error.











For another case (a leverage sample point is contained).

X=0.1, 0.2, ...,0.9, 10,

where the last one is a leverale sample ponit.

Statistics of 10000 trials:

WAIC(mean,std) = 0.075, 0.120
ISCV(mean,std) = 0.113, 0.130
PSIS(mean,std) = 0.102, 0.123
GEN (mean,std) = 0.082, 0.100
mean( |WAIC-GE| ) = 0.148
mean( |ISCV-GE| ) = 0.165
mean( |PSIS-GE| ) = 0.158


The following shows histogram of |ISCV-GE|-|WAIC-GE|.


CV and WAIC


The following shows histogram of |PSIS-GE|-|WAIC-GE|.


PSIS and WAIC














WAIC is the better estimator of the generalization error than the cross validation also in Poisson Distribution














The follwing shows another experiment.

If a leverage sample point is contained, then the importance sampling cross validation and the Pareto smoothing cross validation have the larger variance than WAIC.




Widely Applicable Information Criterion








Singular Learning Theory

A learning machine or a statistical model is called singular if its Fisher information matrix is singular. (A matrix A is singular if det(A)=0). Almost all learning machines which have hidden variables or hierarchical structure are singular. In singular learning machines, asymptotic normality of the maximum likelihood estimator does not hold. We are now establishing a new statistics based on algebraic geometry and algebraic analysis. The asymptotic statistical theory of regular models is being generalized onto singular statistical models. The theory is mathematically beautiful and statistically useful. Singular Learning Theory .

New Information


Can we optimize hyperparameters by cross validation, WAIC, DIC, and the marginal likelihood ?

Our answer is arXiv:1503.07970 .(2015/March/27).




New Research Results



Links

The Annals of Mathematics
Inventiones Mathematicae
Communications in Mathematical Physics
The Annals of Statistics
Journal of the Royal Statistical Society
Mathematical Journals
AMS
MathSciNet
Applied Mathematics Research Express (AMRX)
Neural Networks
Journal of Machine Learning Research
Geometry and Statistics in Neural Network Learning Theory
Mathematical Calendar
ACM Calendar
Scirus
AMS Meeting
Math. and Phys.
Open Problems
Research Institute for Mathematical Sciences
Institute for Advanced Study, Princeton , Math Link
Isaac Newton Institute for Mathematical Sciences
Institut de Recherche Mathematique Avancee
The Clay Mathematics Institute
Mathematical Sciences Research Institute
American Institute of Mathematics
Statistical and Applied Mathematical Sciences Institute
Institute for Mathematics and Its Applications
Minimum Description Length
Neural Nets
ISAAC




People



Professor Huzihiro Araki (Mathematical Science) Professor Araki, a Poincare medalist, is famous for contribution to operator algebra and mathematical physics.
Hal Tasaki (Theoretical Physics)
Takashi Hara (Mathematical Physics)
Tadayuki Takahashi (Astro Physics)
David Mumford (Algebraic geometry, Pattern Theory)
Stephan E. Feinberg (Mathematical Statistics)
Angelika van der Linde (Mathematical Statistics)
Andrew Gelman (Statistics and Political Science)
Aki Vehtari (Statistics and Brain Science)
Bernd Sturmfels (Algebraic Statistics)
Nobuki Takayama (Algebraic Analysis)
Akimichi Takemura (Mathematical Statistics)
Giovanni Pistone (Algebraic Statistics)
Lior Pachter (Mathematical Biology)
Seth Sullivant (Algebraic Statistics)
Russell Steele (Mathematical Statistics)
Mathias Drton (Algebraic Statistics)
Ruriko Yoshida (Algebraic Statistics)
Luis David Garcia Puente (Algebraic Statistics)
Jason Morton (Algebraic Statistics)
Shaowei Lin (Algebraic Statistics)
Piotr Zwiernik (Algebraic Statistics)
Caroline Uhler (Algebraic Statistics)
Dan Geiger (Computer Science)
Dmitry Rusakov (Computer Science)
Miki Aoyagi (Complex Manifold Theory)
Keisuke Yamazaki (Statistical Learning Theory)
Shinichi Nakajima (Statistical Learning Theory)
Kazuho Watanabe (Statistical Learning Theory)
Kenji Nagata (Statistical Learning Theory)




********** Sumio Watanabe Homepage **********

Research Field:

Probability Theory, Mathematical Statistics, and Learning Theory.

Research Purpose:

(1) To establish mathematical foundation of statistical learning.
(2) To construct a new research field between mathematics and neuroscience.

Recently Published Books:

(1) S. Watanabe, "Neural Networks for Robotic Control - Theory and Applications," Prentice Hall, 1996.
(2) S.Watanabe and K.Fukumizu, "Algorithms and Architectures," Academic Press, 1998.
(3) S.Watanabe, Algebraic Geometry and Statistical Learning Theory UK, US (2009/August).

Watanabe's Main Formulas

If you are a mathematician or a statistician, you can understand the importance of the following formulas. From the mathematical point of view, these formula clarified the relation between algebraic geometry and statistics. From the statistical point of view, these are generalized BIC and AIC for singular statistical models. These two main formulas are mathematically very beautiful and statistically very useful.

Main Formula 1

Let X1, X2, ..., Xn are random variables which are independently subject to the probability distribution q(x)dx. Even if the Fisher information matrix of a statistical model p(x|w) is degenerate, the following formula holds. The stochastic complexity or the Bayes marginal likelihood

F = -log \int p(X1|w) p(X2|w) ... p(Xn|w) \phi(w) dw

has asymptotic expansion

F= nS + A log n - (B-1) log log n + R

where nS is equal to n times empirical entropy of the true distribution, A is a positive rational number, B is a natural number, and R is a random variable of constant order. Here (-A) and B are determined as the largest pole and its order of the Zeta function of the statistical model,

J(z)= \int H(w)^{z} \phi(w)dw

which can be analytically continued to the entire complex plane. Here H(w) is the Kullback distance from the true distribution q(x)dx to the parametric model p(x|w)dx. The Zeta function J(z) is the mathematical bridge between statistics and algebraic geometry. The expectation value of the Bayes generalization error is asymptotically equal to

E[Bg] = A/n - (B-1)/(n log n) + o(1/(nlog n)),

where E[ ] is the expectation value over all sets of random samles.

Also we can algorithmically calculate A and B by applying Hironaka's resolution of singularities to the Kullback information, and obtain that A is not larger than D/2, and that B is not larger than D, where D is the number of parameters. The constants A and B are determined by the algebraic geometrical structure of the learning machine. If a model p(x|w) is regular, then A=D/2 and B=1, hence this formula is the generalized version of BIC and MDL.

Main Formula 2

Let Bg, Bt, Gg, Gt be Bayes generalization error, Bayes training error, Gibbs generalization error, and Gibbs training error, respectively. Then the following formulas hold for arbitrary true distribution, arbitrary parametric model, arbitrary a priori distribution, and arbitrary singularities.

E[Bg]=E[Bt]+2bE[Gt-Bt],
E[Gg]=E[Gt]+2bE[Gt-Bt],

where b is the inverse temperature of the a posteriori distribution. By using these formulas, we can estimate Bayes and Gibbs generalization errors from Bayes and Gibbs training errors. If a model is regular then E[Gt-Bt]=D/2, where d is the dimension of the parameter space. Hence these formulas contain AIC as a very special case.

For references,
S. Watanabe, ``Algebraic analysis for nonidentifiable learning machines," Neural Computation, Vol.13, No.4, pp.899-933, 2001.
S. Watanabe, ``A formula of equations of states in singular learning machines," Proceedings of WCCI, Hongkong, 2008.

We would like to claim that these results were firstly discovered, which were unknown even in statistics, information theory, and learning theory. Also we expect that these results mathematically clarify the essential difference between neural networks and regular statistical models. In other words, the reason why neural networks in Bayesian estimation are more useful than regular statistical models is firstly proven mathematically. Algebraic geometry and algebraic analysis play an important role in the theory of complicated learning machines. For more detail, see Singular learning theory .

Mathematical Structure

Related Topic: Zeta functions

(1) As is well known, the Riemann zeta function is defined by

f(z)=\sum_{n=1}^{\infty} 1/n^{z}

which can be analytically continued to the entire complex plane. You must know the Riemann's Hypothesis that will clarify the distribution of prime numbers.

(2) The zeta function of Kullback information H(w) and the prior p(w) is defined by

J(z)=\int H(w)^{z}\phi(w)dw

which can be analytically continued to the entire complex plane. Professor Gel'fand conjectured in 1954 that this function is meromorphic, and both Professor Bernstein and Professor Atiyah clarified this fact. In Atiyah's method, Hironaka's resolution of singularities plays a central role. This function clarifies the Bayesian Statistics as I have shown in the foregoing sections.

(3) The zeta function of the Replica method will be defined by

L(z)=E[ Z(X1,X2,...,Xn)^{z}]

where X1,X2,...,Xn are training samples taken from the true distribution and E shows the expectation value overall sets of training samples. Z(X1,X2,...,Xn) is the partition function or the evidence of the learning. It is strongly expected that this function plays an important role in clarifying the mathematical structure of the Replica method in mathematical physics.

Published Papers with Comments:


Papers



Sumio Watanabe, Ph. D.

Sumio Watanabe, Ph.D.

Curriculum Vitae

Sumio Watanabe was born in Japan, March 31, 1959. He received the B.S. degree in Physics from University of Tokyo, Japan, the M.S. degree in Mathematics from Research Institute for Mathematical Sciences (RIMS), Kyoto University, Japan, and the Ph. D. degree in applied electronics in 1993 from Tokyo Institute of Technology, Japan.

Dr. Watanabe is currently a professor at Department of Computational Intelligence and Systems Science in Tokyo Institute of Technology.

His research interest includes probability theory, applied algebraic geometry, and Bayesian statistics. He firstly discovered algebraic geometrical structure in statistical learning theory, and proposed that the standard form of the likelihood function can be derived from resolution of singularities, on which we can establish a new mathematical statistics. He found WAIC and WBIC which can be used in both regular and singular statistical model evaluation.