Sumio Watanabe


(Recent Paper)

If the posterior can be approximated by some normal distribution, then CV and WAIC are asymptotically equivalent to each other in the second order.

Sumio Watanabe, Higher Order Equivalence of Bayes Cross Validation and WAIC , Springer Proceedings of Mathematics and Statistics, Information geometry and its applications (4), pp.47-73, 2018.

Thus, minimizations of CV and WAIC are asymptotically equivalent in prior optimization. In experiments, the variance by WAIC is smaller than that by CV.
Prior optimization by CV and WAIC (mp4)

Hyperparameter Optimization

In hyperparameter optimization, WAIC is better than ISCV (importance sampling cross validation).

The reason why WAIC is preferable is that the variance of WAIC in MCMC is smaller than that of ISCV.

Hyperparameter optimization

The above figure shows that, in a simple regression problem Y=aX+N(0,1/s), the MCMC fluctuation is not small even if the posterior sample size of MCMC is 10000 (sample size of data is n=30).

sample data, sample program (MATLAB).

The following figure shows the case that the MCMC sample size is 100000 for the same problem.

Hyperparameter optimization

Remark: In simple artificial simulations, WAIC and ISCV have the almost same values, however, in practical applications, the fluctuation of WAIC in MCMC is often smaller than that of ISCV. In hyperparameter otpimization problems, such phenomenon is often observed. In practical cases, I recommend that you had better calculate both WAIC and ISCV and compare their fluctuations in MCMC. The computational costs of WAIC and ISCV are equal to each other.

Summary of Singular Learning Theory

Bayes Theory Essential . You can learn Bayes theory in 5 minutes.

Neural Networks and Singular Learning Theory .

Singular Learning Theory and Information Criterion .


Mathematical theory of deep learning was already discovered. A neural network in Bayes (mp4) .

Sumio Watanabe, Mathematical Theory of Bayesian Statistics, CRC Press, 2018

Sumio Watanabe, Algebraic geometry and statistical learning theory, Cambridge University Press, 2009.

Watanabe, S. Algebraic geometrical methods for hierarchical learning machines. Neural Networks. Vol.14,No.8,pp.1049-1060. 2001. DOI: 10.1016/S0893-6080(01)00069-7

Watanabe, S. Algebraic analysis for nonidentifiable learning machines Neural Computation. Vol.13, No.4. pp.899-933, 2001. DOI: 10.1162/089976601300014402

Applications to

Beyond Laplace and Fisher


WAIC(2010) is the generalized version of AIC.
WBIC(2013) is the generalized version of BIC.


WAIC and WBIC can be used even if the posterior distribution is far from any normal distribution.

Both WAIC and WBIC are supported by

Algebraic Geometry and Statistical Learning Theory

Algebraic Geometry and Statistical Learning Theory

Sumio Watanabe,
Algebraic Geometry and Statistical Learning Theory,
Cambridge University Press, 2009.

New statistical theory is established that holds even for non-regular models such as a normal mixture, a neural network, and hidden Markov models. The resolution theorem in algebraic geometry transforms the likelihood function to a new standard form in statistics. The asymptotic behavior of the log likelihood ratio function is given by the limit empirical process on algebraic variety. This theory contains regular statistical theory as a very special part.

We can make generalized concepts of AIC and BIC, even if a true distribution is unrealizable by or singular for a statistical model. In fact, WAIC and WBIC are derived. It is very easy to apply them to practical applications.

Both WAIC and WBIC are based on the completely new statistical theory. Neither positive definiteness of Fisher information matrix, asymptotic normality of MLE, nor Laplace approximation is necessary in our new theory. Thus, our theory holds for wide range of statistical models.

Let's see the true likelihood function.
Algebraic geometry is essential to new statistics

Let's compare WAIC with DIC.
Widely Applicable Information Criterion

Let's compare CV with WAIC.

Let's compare PSISCV with WAIC.

Let's try WBIC
Widely Applicable Bayesian Information Criterion

Let's compare CV, PSIS, with WAIC as Estimator of Generalization Error.

(1) CV and WAIC had better be compared with the generalization error. CV is not always the best estimator of the generalization error.

(2) WAIC is not an approximation of CV but an estimator of the generalization error. In fact, there exist cases when WAIC can estimate the generalization error, even if a sample consists of dependent variables.

(3) When n is small, WAIC is also the better estimator of the generalization error than CV.

The cross validation (CV), Pareto Smooting importance sampling cross validation (PSISCV), and WAIC are asymptotically equivalent uder the condition that a sample consists of independent random variables. However, the purpose of WAIC is not approximating CV but estimating the generalization error. Thus we had better compare them as estimators of the generalization error.

It is easy for you to conduct the same experiment.

I recommend that you had beter see the true experimtal result by your own eyes.

Experiment Description (PDF)

We study a simple regression problem, Y = aX^2 + Noise, with fixed inputs,

X=0.1, 0.2, ..., 1.0,

the following shows the experimental result of 10000 random trials (sample size n=10), where

|WAIC-GE| : absolute value of the difference between WAIC and the generalization error.

|ISCV-GE| : absolute value of the difference between the importance sampling cross validation and the generalization error,

|PSIS-GE| : absolute value of the difference between the Pareto smoothing importance sampling cross validation and the generalization error.

matlab program

Statistics of 10000 trials:

WAIC(mean,std) = 0.098, 0.123
ISCV(mean,std) = 0.116, 0.131
PSIS(mean,std) = 0.112, 0.128
GEN (mean,std) = 0.097, 0.115
mean( |WAIC-GE| ) = 0.164
mean( |ISCV-GE| ) = 0.173
mean( |PSIS-GE| ) = 0.170

WAIC is the better approximator of the generalization error than the impostance sampling cross validation.

The following shows histogram of |ISCV-GE|-|WAIC-GE|.


The following shows histogram of |PSIS-GE|-|WAIC-GE|.


WAIC is the better approximator of the generalization error than the pareto smoothing importance sampling cross validation.

The Pareto smoothing cross validation may be the better estimator of the cross validation than WAIC, however, it is not that of the generalization error.

For another case (a leverage sample point is contained).

X=0.1, 0.2, ...,0.9, 10,

where the last one is a leverale sample ponit.

Statistics of 10000 trials:

WAIC(mean,std) = 0.075, 0.120
ISCV(mean,std) = 0.113, 0.130
PSIS(mean,std) = 0.102, 0.123
GEN (mean,std) = 0.082, 0.100
mean( |WAIC-GE| ) = 0.148
mean( |ISCV-GE| ) = 0.165
mean( |PSIS-GE| ) = 0.158

The following shows histogram of |ISCV-GE|-|WAIC-GE|.


The following shows histogram of |PSIS-GE|-|WAIC-GE|.


The following figure shows the difference of LOOCV and WAIC as estimators of the generalization loss in the case of a linear regression problem on 5 dimensional space.

"Vertical axis >0" is equivalent to "WAIC is better than LOOCV".



WAIC is the better estimator of the generalization error than the cross validation also in Poisson Distribution

The follwing shows another experiment.

If a leverage sample point is contained, then the importance sampling cross validation and the Pareto smoothing cross validation have the larger variance than WAIC.

Widely Applicable Information Criterion

Singular Learning Theory

A learning machine or a statistical model is called singular if its Fisher information matrix is singular. (A matrix A is singular if det(A)=0). Almost all learning machines which have hidden variables or hierarchical structure are singular. In singular learning machines, asymptotic normality of the maximum likelihood estimator does not hold. We are now establishing a new statistics based on algebraic geometry and algebraic analysis. The asymptotic statistical theory of regular models is being generalized onto singular statistical models. The theory is mathematically beautiful and statistically useful. Singular Learning Theory .

New Information

Can we optimize hyperparameters by cross validation, WAIC, DIC, and the marginal likelihood ?

Our answer is arXiv:1503.07970 .(2015/March/27).

New Research Results


The Annals of Mathematics
Inventiones Mathematicae
Communications in Mathematical Physics
The Annals of Statistics
Journal of the Royal Statistical Society
Mathematical Journals
Applied Mathematics Research Express (AMRX)
Neural Networks
Journal of Machine Learning Research
Geometry and Statistics in Neural Network Learning Theory
Mathematical Calendar
ACM Calendar
AMS Meeting
Math. and Phys.
Open Problems
Research Institute for Mathematical Sciences
Institute for Advanced Study, Princeton , Math Link
Isaac Newton Institute for Mathematical Sciences
Institut de Recherche Mathematique Avancee
The Clay Mathematics Institute
Mathematical Sciences Research Institute
American Institute of Mathematics
Statistical and Applied Mathematical Sciences Institute
Institute for Mathematics and Its Applications
Minimum Description Length
Neural Nets


Professor Huzihiro Araki (Mathematical Science) Professor Araki, a Poincare medalist, is famous for contribution to operator algebra and mathematical physics.
Hal Tasaki (Theoretical Physics)
Takashi Hara (Mathematical Physics)
Tadayuki Takahashi (Astro Physics)
David Mumford (Algebraic geometry, Pattern Theory)
Stephan E. Feinberg (Mathematical Statistics)
Angelika van der Linde (Mathematical Statistics)
Andrew Gelman (Statistics and Political Science)
Aki Vehtari (Statistics and Brain Science)
Bernd Sturmfels (Algebraic Statistics)
Nobuki Takayama (Algebraic Analysis)
Akimichi Takemura (Mathematical Statistics)
Giovanni Pistone (Algebraic Statistics)
Lior Pachter (Mathematical Biology)
Seth Sullivant (Algebraic Statistics)
Russell Steele (Mathematical Statistics)
Mathias Drton (Algebraic Statistics)
Ruriko Yoshida (Algebraic Statistics)
Luis David Garcia Puente (Algebraic Statistics)
Jason Morton (Algebraic Statistics)
Shaowei Lin (Algebraic Statistics)
Piotr Zwiernik (Algebraic Statistics)
Caroline Uhler (Algebraic Statistics)
Dan Geiger (Computer Science)
Dmitry Rusakov (Computer Science)
Miki Aoyagi (Complex Manifold Theory)
Keisuke Yamazaki (Statistical Learning Theory)
Shinichi Nakajima (Statistical Learning Theory)
Kazuho Watanabe (Statistical Learning Theory)
Kenji Nagata (Statistical Learning Theory)

********** Sumio Watanabe Homepage **********

Research Field:

Probability Theory, Mathematical Statistics, and Learning Theory.

Research Purpose:

(1) To establish mathematical foundation of statistical learning.
(2) To construct a new research field between mathematics and neuroscience.

Recently Published Books:

(1) S. Watanabe, "Neural Networks for Robotic Control - Theory and Applications," Prentice Hall, 1996.
(2) S.Watanabe and K.Fukumizu, "Algorithms and Architectures," Academic Press, 1998.
(3) S.Watanabe, Algebraic Geometry and Statistical Learning Theory UK, US (2009/August).

Watanabe's Main Formulas

If you are a mathematician or a statistician, you can understand the importance of the following formulas. From the mathematical point of view, these formula clarified the relation between algebraic geometry and statistics. From the statistical point of view, these are generalized BIC and AIC for singular statistical models. These two main formulas are mathematically very beautiful and statistically very useful.

Main Formula 1

Let X1, X2, ..., Xn are random variables which are independently subject to the probability distribution q(x)dx. Even if the Fisher information matrix of a statistical model p(x|w) is degenerate, the following formula holds. The stochastic complexity or the Bayes marginal likelihood

F = -log \int p(X1|w) p(X2|w) ... p(Xn|w) \phi(w) dw

has asymptotic expansion

F= nS + A log n - (B-1) log log n + R

where nS is equal to n times empirical entropy of the true distribution, A is a positive rational number, B is a natural number, and R is a random variable of constant order. Here (-A) and B are determined as the largest pole and its order of the Zeta function of the statistical model,

J(z)= \int H(w)^{z} \phi(w)dw

which can be analytically continued to the entire complex plane. Here H(w) is the Kullback distance from the true distribution q(x)dx to the parametric model p(x|w)dx. The Zeta function J(z) is the mathematical bridge between statistics and algebraic geometry. The expectation value of the Bayes generalization error is asymptotically equal to

E[Bg] = A/n - (B-1)/(n log n) + o(1/(nlog n)),

where E[ ] is the expectation value over all sets of random samles.

Also we can algorithmically calculate A and B by applying Hironaka's resolution of singularities to the Kullback information, and obtain that A is not larger than D/2, and that B is not larger than D, where D is the number of parameters. The constants A and B are determined by the algebraic geometrical structure of the learning machine. If a model p(x|w) is regular, then A=D/2 and B=1, hence this formula is the generalized version of BIC and MDL.

Main Formula 2

Let Bg, Bt, Gg, Gt be Bayes generalization error, Bayes training error, Gibbs generalization error, and Gibbs training error, respectively. Then the following formulas hold for arbitrary true distribution, arbitrary parametric model, arbitrary a priori distribution, and arbitrary singularities.


where b is the inverse temperature of the a posteriori distribution. By using these formulas, we can estimate Bayes and Gibbs generalization errors from Bayes and Gibbs training errors. If a model is regular then E[Gt-Bt]=D/2, where d is the dimension of the parameter space. Hence these formulas contain AIC as a very special case.

For references,
S. Watanabe, ``Algebraic analysis for nonidentifiable learning machines," Neural Computation, Vol.13, No.4, pp.899-933, 2001.
S. Watanabe, ``A formula of equations of states in singular learning machines," Proceedings of WCCI, Hongkong, 2008.

We would like to claim that these results were firstly discovered, which were unknown even in statistics, information theory, and learning theory. Also we expect that these results mathematically clarify the essential difference between neural networks and regular statistical models. In other words, the reason why neural networks in Bayesian estimation are more useful than regular statistical models is firstly proven mathematically. Algebraic geometry and algebraic analysis play an important role in the theory of complicated learning machines. For more detail, see Singular learning theory .

Mathematical Structure

Related Topic: Zeta functions

(1) As is well known, the Riemann zeta function is defined by

f(z)=\sum_{n=1}^{\infty} 1/n^{z}

which can be analytically continued to the entire complex plane. You must know the Riemann's Hypothesis that will clarify the distribution of prime numbers.

(2) The zeta function of Kullback information H(w) and the prior p(w) is defined by

J(z)=\int H(w)^{z}\phi(w)dw

which can be analytically continued to the entire complex plane. Professor Gel'fand conjectured in 1954 that this function is meromorphic, and both Professor Bernstein and Professor Atiyah clarified this fact. In Atiyah's method, Hironaka's resolution of singularities plays a central role. This function clarifies the Bayesian Statistics as I have shown in the foregoing sections.

(3) The zeta function of the Replica method will be defined by

L(z)=E[ Z(X1,X2,...,Xn)^{z}]

where X1,X2,...,Xn are training samples taken from the true distribution and E shows the expectation value overall sets of training samples. Z(X1,X2,...,Xn) is the partition function or the evidence of the learning. It is strongly expected that this function plays an important role in clarifying the mathematical structure of the Replica method in mathematical physics.

Published Papers with Comments:


Sumio Watanabe, Ph. D.

Sumio Watanabe, Ph.D.

Curriculum Vitae

Sumio Watanabe was born in Japan, March 31, 1959. He received the B.S. degree in Physics from University of Tokyo, Japan, the M.S. degree in Mathematics from Research Institute for Mathematical Sciences (RIMS), Kyoto University, Japan, and the Ph. D. degree in applied electronics in 1993 from Tokyo Institute of Technology, Japan.

Dr. Watanabe is currently a professor at Mathematical and Computing Science in Tokyo Institute of Technology.

His research interest includes probability theory, applied algebraic geometry, and Bayesian statistics. He firstly discovered algebraic geometrical structure in statistical learning theory, and proposed that the standard form of the likelihood function can be derived from resolution of singularities, on which we can establish a new mathematical statistics. He found WAIC and WBIC which can be used in both regular and singular statistical model evaluation.

Did you know old works ?

Sumio Watanabe, 1993, Sparse learning of neural networks
Sumio Watanabe, 1993, Solvable model of neural networks
S. Watanabe. Algebraic analysis for nonregular learning machines. Vol.12, 2000, 356-362.

Our pioneering achievements

Neural networks and many learning machies have singular Fisher information matrices. Almost all learning machines are singular. Algebraic geometry gives the concrete learning theory using birational invariants.

(1) S. Watanabe, Algebraic geometrical methods for hierarchical learning machines. Neural Networks, Vol.14, No.8,pp.1049-1060, 2001.
(2) K.Yamazaki, S.Watanabe. Singularities in mixture models and upper bounds of stochastic complexity. Neural Networks, Vol.16, No.7, pp.1029-1038, 2003.
(3) K. Watanabe, S. Watanabe. Stochastic complexities of gaussian mixtures in variational bayesian approximation. Journal of Machine Learning Research, Vol.7, pp.625-644, 2006.
(4) T. Iwagaki, S.Watanabe. Generalization Error by Langevin Equation in Singular Learning Machines. 2009 International Symposium on Nonlinear Theory and its Applications NOLTA'09, Sapporo, Japan, October 18-21, 2009
(5) Sumio Watanabe, Equations of states in singular statistical estimation. Neural Networks. 2010 Jan. vol.23 (1):pp.20-34.
(5) Sumio Watanabe, Asymptotic equivalence of Bayes cross validation and widely applicable information criterion. JMLR, 2010, pp.3571-3594.
(7) Sumio Watanabe, A widely applicable Bayesian information criterion. JMLR, 2013, pp.867-897.