## Sumio Watanabe

 Return (Recent Paper) If the posterior can be approximated by some normal distribution, then CV and WAIC are asymptotically equivalent to each other in the second order. Sumio Watanabe, Higher Order Equivalence of Bayes Cross Validation and WAIC , Springer Proceedings of Mathematics and Statistics, Information geometry and its applications (4), pp.47-73, 2018. Thus, minimizations of CV and WAIC are asymptotically equivalent in prior optimization. In experiments, the variance by WAIC is smaller than that by CV. Prior optimization by CV and WAIC (mp4) Hyperparameter Optimization In hyperparameter optimization, WAIC is better than ISCV (importance sampling cross validation). The reason why WAIC is preferable is that the variance of WAIC in MCMC is smaller than that of ISCV. The above figure shows that, in a simple regression problem Y=aX+N(0,1/s), the MCMC fluctuation is not small even if the posterior sample size of MCMC is 10000 (sample size of data is n=30). sample data, sample program (MATLAB). The following figure shows the case that the MCMC sample size is 100000 for the same problem. Remark: In simple artificial simulations, WAIC and ISCV have the almost same values, however, in practical applications, the fluctuation of WAIC in MCMC is often smaller than that of ISCV. In hyperparameter otpimization problems, such phenomenon is often observed. In practical cases, I recommend that you had better calculate both WAIC and ISCV and compare their fluctuations in MCMC. The computational costs of WAIC and ISCV are equal to each other. Summary of Singular Learning Theory Bayes Theory Essential . You can learn Bayes theory in 5 minutes. Neural Networks and Singular Learning Theory . Singular Learning Theory and Information Criterion . (NEW BOOK) Mathematical theory of deep learning was already discovered. A neural network in Bayes (mp4) . Sumio Watanabe, Mathematical Theory of Bayesian Statistics, CRC Press, 2018 Sumio Watanabe, Algebraic geometry and statistical learning theory, Cambridge University Press, 2009. Watanabe, S. Algebraic geometrical methods for hierarchical learning machines. Neural Networks. Vol.14,No.8,pp.1049-1060. 2001. DOI: 10.1016/S0893-6080(01)00069-7 Watanabe, S. Algebraic analysis for nonidentifiable learning machines Neural Computation. Vol.13, No.4. pp.899-933, 2001. DOI: 10.1162/089976601300014402 Applications to WAIC and WBIC
 Beyond Laplace and Fisher WAIC and WBIC WAIC(2010) is the generalized version of AIC. WBIC(2013) is the generalized version of BIC. WAIC and WBIC can be used even if the posterior distribution is far from any normal distribution.

Both WAIC and WBIC are supported by

### Algebraic Geometry and Statistical Learning Theory

 Sumio Watanabe, Algebraic Geometry and Statistical Learning Theory, Cambridge University Press, 2009. New statistical theory is established that holds even for non-regular models such as a normal mixture, a neural network, and hidden Markov models. The resolution theorem in algebraic geometry transforms the likelihood function to a new standard form in statistics. The asymptotic behavior of the log likelihood ratio function is given by the limit empirical process on algebraic variety. This theory contains regular statistical theory as a very special part. We can make generalized concepts of AIC and BIC, even if a true distribution is unrealizable by or singular for a statistical model. In fact, WAIC and WBIC are derived. It is very easy to apply them to practical applications. Both WAIC and WBIC are based on the completely new statistical theory. Neither positive definiteness of Fisher information matrix, asymptotic normality of MLE, nor Laplace approximation is necessary in our new theory. Thus, our theory holds for wide range of statistical models. Let's see the true likelihood function. Let's compare WAIC with DIC. Let's compare CV with WAIC. Let's compare PSISCV with WAIC. Let's try WBIC

Let's compare CV, PSIS, with WAIC as Estimator of Generalization Error.

(1) CV and WAIC had better be compared with the generalization error. CV is not always the best estimator of the generalization error.

(2) WAIC is not an approximation of CV but an estimator of the generalization error. In fact, there exist cases when WAIC can estimate the generalization error, even if a sample consists of dependent variables.

(3) When n is small, WAIC is also the better estimator of the generalization error than CV.

The cross validation (CV), Pareto Smooting importance sampling cross validation (PSISCV), and WAIC are asymptotically equivalent uder the condition that a sample consists of independent random variables. However, the purpose of WAIC is not approximating CV but estimating the generalization error. Thus we had better compare them as estimators of the generalization error.

It is easy for you to conduct the same experiment.

I recommend that you had beter see the true experimtal result by your own eyes.

Experiment Description (PDF)

We study a simple regression problem, Y = aX^2 + Noise, with fixed inputs,

X=0.1, 0.2, ..., 1.0,

the following shows the experimental result of 10000 random trials (sample size n=10), where

|WAIC-GE| : absolute value of the difference between WAIC and the generalization error.

|ISCV-GE| : absolute value of the difference between the importance sampling cross validation and the generalization error,

|PSIS-GE| : absolute value of the difference between the Pareto smoothing importance sampling cross validation and the generalization error.

matlab program

Statistics of 10000 trials:

WAIC(mean,std) = 0.098, 0.123
ISCV(mean,std) = 0.116, 0.131
PSIS(mean,std) = 0.112, 0.128
GEN (mean,std) = 0.097, 0.115
mean( |WAIC-GE| ) = 0.164
mean( |ISCV-GE| ) = 0.173
mean( |PSIS-GE| ) = 0.170

WAIC is the better approximator of the generalization error than the impostance sampling cross validation.

The following shows histogram of |ISCV-GE|-|WAIC-GE|.

The following shows histogram of |PSIS-GE|-|WAIC-GE|.

WAIC is the better approximator of the generalization error than the pareto smoothing importance sampling cross validation.

The Pareto smoothing cross validation may be the better estimator of the cross validation than WAIC, however, it is not that of the generalization error.

For another case (a leverage sample point is contained).

X=0.1, 0.2, ...,0.9, 10,

where the last one is a leverale sample ponit.

Statistics of 10000 trials:

WAIC(mean,std) = 0.075, 0.120
ISCV(mean,std) = 0.113, 0.130
PSIS(mean,std) = 0.102, 0.123
GEN (mean,std) = 0.082, 0.100
mean( |WAIC-GE| ) = 0.148
mean( |ISCV-GE| ) = 0.165
mean( |PSIS-GE| ) = 0.158

The following shows histogram of |ISCV-GE|-|WAIC-GE|.

The following shows histogram of |PSIS-GE|-|WAIC-GE|.

The following figure shows the difference of LOOCV and WAIC as estimators of the generalization loss in the case of a linear regression problem on 5 dimensional space.

"Vertical axis >0" is equivalent to "WAIC is better than LOOCV".

WAIC is the better estimator of the generalization error than the cross validation also in Poisson Distribution

The follwing shows another experiment.

If a leverage sample point is contained, then the importance sampling cross validation and the Pareto smoothing cross validation have the larger variance than WAIC.

### Singular Learning Theory

A learning machine or a statistical model is called singular if its Fisher information matrix is singular. (A matrix A is singular if det(A)=0). Almost all learning machines which have hidden variables or hierarchical structure are singular. In singular learning machines, asymptotic normality of the maximum likelihood estimator does not hold. We are now establishing a new statistics based on algebraic geometry and algebraic analysis. The asymptotic statistical theory of regular models is being generalized onto singular statistical models. The theory is mathematically beautiful and statistically useful. Singular Learning Theory .

### New Information

Can we optimize hyperparameters by cross validation, WAIC, DIC, and the marginal likelihood ?

Our answer is arXiv:1503.07970 .(2015/March/27).

### New Research Results

• Information Geometry and its Applications IV, June 13-17, 2016, Liblice, Czech Republic.
• Prof. Vehtari, Prof. Gelman, and Prof. Gabry reported a new interesting research result, Paper by Vehtari, Gelman, and Gabry (arXiv). They recommend ISLOOCV using the Pareto Smoothed ISCV (PSISCV) rather than WAIC. My comments are as follows.
(1) In numerical calculation of importance sampling cross validation, it is known that the integration of the importance weight 1/p(x_i|w) over the posterior distribution has a large variance. The authors propose PSISCV, in which larger 20% importance weights are replaced by the estimated Pareto distribution.
(2) The purpose of ISLOOCV and WAIC is to estimate the generalization error (GE), or Kullback Leibler distance between the true and estimated distributions. In the above paper, the authors only compared PSIS-ISLOOCV and WAIC as the numerical approximators of CV. It is trivial that there may exist a better numerical approximator of CV than WAIC, because the exact CV is different from the exact WAIC. However, PSISCV and WAIC should be compared with GE from the viewpoint of bias and variance. They are random variables. Please see a comparison of PSISCV with WAIC .
(3) Even if the exact CV and WAIC are obtained (without posterior sampling), CV is not always better than WAIC as the estimator of the generalization error. Please see CV and WAIC.
(4) A statistical estimation is called influential observation if a leverage sample is contained in a training set. A leverage sample affects statistical estimation very strongly. Its effect may be good or bad, which depends on randomness. Thus a statistician considers whether the leverage sample should be contained or removed. On the other hand, in influential observation, CV and WAIC are different from each other, and ISLOOCV has the infinite variance. Therefore, in the practical applications, if ISLOOCV and WAIC are different, then a statistician recognizes a leverage sample is contained. I would like to recommend that comparison of ISLOOCV with WAIC is useful to find a leverage sample.
(5) Theoretically speaking, WAIC is not an approximator of CV. WAIC was firsly derived by the partial integration of the generalization error with respect the empirical process. After such a mathematical theorem was found, asymptotic equivalence of CV and WAIC was proved.
(6) We showed that the exact CV has the larger variance than the exact WAIC in hyperparameter evaluation. arXiv:1503.07970 .

• In the prior design problem, we proved the relation between Bayesian cross validation, WAIC, and the generalization error, in regular asymptotic theory.
Sumio Watanabe, Bayesian Cross Validation and WAIC for Predictive Prior Design in Regular Asymptotic Theory", arXiv:1503.07970 ,(2015/March/27).
• Dr. Tokuda, Dr. Nagata, and Prof. Okada opened a new research result for RLCT in RBF network. Tokuda, Nagata, Okada, A numerical analysis of learning coefficient in radial basis function network, IPSJ Transactions on Mathematical Modeling and Its Applications.
• Prof. Aki Vehtari and Prof. Andrew Gelman opened a very important new research result, Aki Vehtari and Andrew Gelman, WAIC and cross-validation in Stan .
• Prof. M.B.Hooten and Prof. N.T.Hobbs showed a guide monograph, M.B.Hooten and Prof. N.T.Hobbs, A Guide to Bayesian Model Selection for Ecologies. This paper contains several mistakes. I would like to add several comments.
(1) WAIC, a widely applicable information criterion, is a statistic which needs neither Laplace approximation nor Fisher asymptotic theory. Widely applicable" means that it is avaliable even when the posterior is far from any Gaussian function. On the other hand, DIC is a criterion which requires the posterior normality. WAIC's proposal and its mathematics was firstly given in S.Watanabe, Equations of states in singular statistical estimation. Neural Networks, Vol.23, No.1, 2010, pp.20-34. There is no paper before them which made the information criterion that can be used even in sigular posterior distribution.
(2) In the other 2010 paper, Asymptotic equivalence of Cross-Validation and WAIC was proved, S. Watanabe, Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory. Journal of Machine Learning Research. Vol.11,pp.3571-3594, 2010.
(3) If AIC, DIC, or cross validation can be applicable, then WAIC is also applicable. In other words, if WAIC can not be applied to a statistical model, then no information criterion can not be applicable.
(4) In the 2013 paper, WAIC was NOT studied, S. Watanabe, A Widely Applicable Bayesian Information Criterion, Vol.14, pp.867-897, 2013. In the 2013 paper, a new method, WBIC, was introduced, which enables us to compute the marginal likelihood with a quite low computational costs. WBIC can be used even when the posterior is singular.
(5) WAIC needs an assumption that samples are taken from probability distribution and that the expectation value over all sample sets is statistically well-defined. If such assumption cannot be adopted, then WAIC is NOT available. In such cases, there is no information criterion that can be used.
• Dr. Longhai Li, Dr. Shi Qiu, Dr. Bei Zhang, and Dr. Cindy X. Feng propose integrated IS and WAIC for statistical models with latent variables, Longhai Li, Shi Qiu, Bei Zhang, Cindy X. Feng, Approximating Cross-validatory Predictive Evaluation in Bayesian Latent Variables Models with Integrated IS and WAIC, arXiv:1404.2918.
• The great statisticians who created DIC (deviance information criterion) published a new important article, David J. Spiegelhalter, Nicola G. Best, Bradley P. Carlin and Angelika van der Linde,The deviance information criterion: 12 years on", Journal of the Royal Statistical Society: Series B (Statistical Methodology) April 8 2014, DOI: 10.1111/rssb.12062. All statisticians will be interested in this article. I'm very glad to build a bridge, DIC === WAIC === Cross Validation.
• Prof. Akimichi Takemura, Prof. Caroline Uhler, and Prof. Ruriko Yoshida organize a conference on Algebraic Statistics , July 14-17, 2014, NIMS, South Korea.
• Professor Miki Aoyagi published a new paper. Consideration on Singularities in Learning Theory and the Learning Coefficient. Entropy 2013, 15, 3714-3733; doi:10.3390/e15093714.
• Professor Mathias Drton and Professor Martyn Plummer proposed a new method to estimate BIC in singular models. A Bayesian information criterion for singular models, arXiv1309.0911 . In Drton-Plummer method, real log canonical thresholds (RCLTs) are used but MCMC processes are not.
• I would like to thank Professor Kenji Yamanishi and Professor Jun'ichi Takeuchi for WITMISE2013, August 26-29 .
• I would like to thank Professor Russell Steele for the invited session of singular statistical model selection, in Joint Statistical Meeting 2013, August 3-8 .
• I would like to thank Professor Mathias Drton for the special session of singular learning theory, in SIAM Applied Algebraic Geometry 2013 , SIAM Applied Algebraic Geometry 2013, August 1-4, 2013.
• A.Gelman, J.B.Carlin, H.S.Stern, D.B. Dunson, A.Vehtari, and D.B.Rubin, Bayesian Data Analysis, 3rd Edition, Chapman and Hall/CRC was published, which is the most popular and exellent textbook of Bayesian statistics.
• The great Bayes statisticians, Prof. A. Gelman, Prof. J.Hwang, and Prof. A. Vehtari, gave evaluation of AIC, DIC, WAIC, and CV (cross-validation) as predictive information criteria in the paper, Andrew Gelman, Jessica Hwang, Aki Vehtari, Understanding predictive information criteria for Bayesian models," Statistics and Computing, DOI 10.1007/s11222-013-9416-2, 2013
(Remark 1) If we want to find the true model among several candidate models, then WBIC is better than WAIC. On the other hand, if we want to estimate the generalization error, then WAIC is useful. Both WAIC and WBIC can be employed under any circumstance, for example, even if the posterior distribution is far from any normal distribution.
(Remark 2) From the theoretical point of view, WAIC is important because the nontrivial variance of CV can be derived from WAIC theory. Let S and Sn be respectively the average and empirical entropies of the true distribution. Also let BgL be the Bayes generalization loss of the Bayes predictive distribution. We can derive the fact that the variance of (BgL-S) is asymptotically equal to that of (WAIC-Sn), even in singular case. Therefore, by the asymptotic equivalence of WAIC and CV, the same relation is derived for cross-validation, which is not trivial. Theorem 2 in (JMLR,2010) shows such relations.
• Aki Vehtari and Janne Ojanen, A survery of Bayesian predictive methods for model assessment, selection and comparison, Statistics Surveys, Vol.6, pp.142-228, 2012.
• Singularities in Geometry and Topology
• STAN is the package for obtaining Bayesian inference using the No-U-Turn sampler, a variant of Hamiltonian Monte Carlo. The method how to construct the posterior distribution is very important for Bayes statistics.
• JAGS is a program for analysis of Bayesian hierarchical models using Markov Chain Monte Carlo (MCMC) simulation.
• BUGS project is concerned with flexible software for the Bayesian analysis of complex statistical models using Markov chain Monte Carlo (MCMC) methods.
• Dr. Jarno Vanhatalo, Dr. Jaakko Riihimäki, Dr. Jouni Hartikainen, Dr. Pasi Jylänki, Dr. Ville Tolvanen, and Dr. Aki Vehtari in Alto University reported the project GPstuff toolbox, GPstuff: Bayesian Modeling with Gaussian Processes . The software is on Machine Learning Open Source Software .
• We established a widely applicable Bayesian information criterion (WBIC).
It is well known that the computational cost in numerical calculation of the Bayes marginal likelihood is quite high. Its asymptotic expansion is given by the formula using the real log canonical threshold (RLCT), but we do not know the true distribution in practical applications, hence RLCT is also unknown. To overcome such difficulty, we defined WBIC and proved by an algebraic geometrical method that WBIC has the same asymptotic expansion as the Bayes marginal likelihood, even if a true distribution is singular for or unrealizable by a statistical model. By using WBIC, we can estimate both the Bayes marginal and the real log canonical threshold with a low computational cost without any information about the true distribution.
A widely applicable Bayesian information criterion, arXiv:1208.6338 (2012/8/31). The revised paper was published in Journal of Machine Learning Research. A widely applicable Bayesian information criterion .
(Remark 1) WAIC and WBIC are respectively generalized concepts of AIC and BIC onto singular statistical models, they are asymptotic unbiased estimators of the generalization loss and the marginal likelihood respectively. You know the difference between AIC and BIC. AIC is an asymptotically unbiased estimator of the generalization error but it does not have consistency in model selection, whereas BIC is not an asymptotically unbiased estimator of the generalization error, but it has consistency in model selection. The difference of WAIC and WBIC is conjectured to be same as that of AIC and BIC. In Bayes estimation, The generalization errors of singular models are smaller than those of regular models, even if a statistical model is redundant compared to a true distribution. Therefore WAICs are also smaller. If one wants to select the true model, WBIC may be better than WAIC under the condition that the true model is contained in candidate models.
(Remark 2) If you are a Bayes statistician, you may have a software for MCMC. It is very easy to calculate WBIC by your software with a low computational cost. In fact, firstly, the inverse temperature is set as 1/log n, secondly one MCMC procedure is conducted. That's all. Let's try WBIC .
(Remark 3) If you are a mathematician, please enjoy the fact that there exists a random variable which converges to the real log canonical threshold in probability, even if we do not know a true distribution. A birational invariant has naturalness in both mathematics and statistics. It is well known in algebraic geometry that the log canonical threshold indicates the relative singular complexity of a pair of algebraic varieties. In statistics, it shows the relative statistical complexity of a pair of the set of true parameters and and that of all parameters.
• Recently, tensor networks, graphical models, and deep learning are being well studied. Professor Jason Morton gave a very good presentation, J. Morton, An algebraic perspective on deep learning . As he pointed out, almost all learning machines that have deep layers have a lot of singularities, hence statistical evaluation of them needs singular learning theory. The typical singular learning machine is a layered neural network. I would like to claim that both their generalization performance and Bayes free energy can be estimated by using WAIC and WBIC.
• Recently, the very important advances in algebraic statistics and algebraic machine learning are reported. New mathematics and statistics will be created from these studies.
• The workshop Algebraic geometry and model selection" was successfully held in AIM. I would like to thank Prof. Russell Steele, Prof. Bernd Sturmfels, and all participants of this workshop. Also I would like to thank American Institute of Mathematics.
Singular Learning Wiki
Singular Learning in UC Berkeley Math
• K. Yamada and I introduced a new statistical concept Quasi-Regular Cases", which are not regular cases but have the same properties as the regular cases. For quasi-regular cases, we can derive both the real log canonical threshold and the singular fluctuation, hence they are useful to study singular statistical estimation. arXiv:1111.1832 . (2011/Nov/10).
• E. Riegler, V. Morgenshtern, G. Durisi, S. Lin, B. Sturmfels, and H. Bolcskei found a new lower bound on the noncoherent capacity pre-log of SIMO channel, by algebraic geometrical method. http://arxiv.org/abs/1105.6009 . (2011/June/02).
• Professor I. Ojima and K. Okamura in Research Institute for Mathematical Sciences (RIMS) propose large deviation strategy in mathematical quantum estimation theory. arxiv:1101.3690v1 (2011/Jan/20).
• We proved that Bayes cross validation (BCV) is asymptotically equivalent to WAIC even for singular models.
Also we showed that the sum of BCV and Bayes generalization error is asymptotically equal to 2*lambda/n as a random variable, where lambda is the log canonical threshold and n is the number of training samples. Note that the expectation value of BCV is equal to that of the generalization error by the definition. However, we can not know the variance of BCV. Since the variance of WAIC is equal to that of the generalization error, it is aslo proved that the variance of BCV is equal to that of the generalization error. It should be emphasized that the variance of WAIC can not be derived without algebraic geometry. arXiv:1004.2316 , (2010/April/15). This paper was improved. The theoretical and experimental comparison of DIC, BCV, and WAIC was included. In singular statistical models, DIC is different from BCV and WAIC, even asymptotically. (2010/Oct/15). This paper was published. Journal of Machine Learning Research, Vol.11, (DEC), pp.3571-3594,2010. (2010/DEC/06).
• Dr. M. Aoyagi in Nihon University gave the resolution of singularities for a restricted Boltzmann machine and derived its log canonical threshold. JMLR, Vol.11 . (2010/April)
• Dr. Shaowei Lin in UC Berkeley and Dr. Piotr Zwiernik in Warwick Univ. found new research results in algebraic geometry, algebraic statistics, and symbolic computation. S. Lin's paper and P. Zweirnik's paper , new paper. (2010/March/30), (2010/Dec/07).
• The universal learning curve is proved under the renormalizable condition, even if a true distribution is unrealizable and singular for a learning machine. Sumio Watanabe, Asymptotic Learning Curve and Renormalizable Condition in Statistical Learning Theory," arXiv:1001.2957, 2010. (2010/Jan/18), Journal of Physics Coneference Series, Vol.233, No.1, 2010, 012014. doi: 10.1088/1742-6596/233/1/012014, Journal PDF
• International Conference on Mathematical Quantum Field Theory and Renormalization Theory, Chairs: Professor Takashi Hara, Professor Taku Matsui, and Professor Fumio Hiroshima. , A Singular Limit Theorem in Statistical Learning Theory (Kyushu, Japan, 26-29/Nov/2009).
• The fifth Symposium on Singularities (IRMA Institute, France, 24-28/Aug/2009). The PDF file .
• Professor Sturmfels visited Japan by Professor Takemura's invitation. (2009/July/6-10).
• Equations of states hold even if the true distribution is not contained in a parametric model.
Sumio Watanabe, Equations of States in Statistical Learning for a Nonparametrizable and Regular Case , arXiv:0906.0211 (1/June/2009). This paper will appear in IEICE Transactions, whose title is Equations of states in statistical learning for an unrealizable and regular case." This paper was published, whose title was Equations of states in statistical learning for an unrealizable and regular case." IEICE Transactions, Vol.E93-A, No.3, pp.617-626, 2010.
• AIC in a regression problem is generalized so that it can be used in singular statistical models.
Sumio Watanabe, A limit theorem in singular regression problem, arXiv:0901.2376v1 (16/Jan/2009). This paper was accepted for publication in Advanced Studies of Pure Mathematics , Also see here (9/Aug/2009).
• Algebraic Statistics (15-18/Dec/2008)
• 2008-09 Program on Algebraic Methods in Systems Biology and Statistics (14-17/September/2008)
• Probabilistic Approach to Geometry (July 28 - August 8 /2008).
• We proved that exchange probability in Monte Carlo is determined by the real log canonical threshold.
Kenji Nagata, Sumio Watanabe, Asymptotic Behavior of Exchange Ratio in Exchange Monte Carlo Method,'' International Journal of Neural Networks, Vol. 21, No. 7, pp. 980-988, 2008.
• In World Congress on Computational Intelligence, the following paper was published, S.Watanabe, "A formula of equations of states in statistical estimation," Proc. of WCCI, Honkong, 2008, June, 1-6. (Jun/2008).
• The widely applicable information criterion (WAIC) was discovered.
We found the Equations of states in singular statistical estimation , by which we can predict both Bayes and Gibbs generalization errors from Bayes and Gibbs training errors without any knowledge of the true distribution (5/Dec/2007). The most important fact is that these equations hold even if the true distribution is singular for a statistical model. This paper was submitted (5/Dec/2007), and published in International Journal of Neural Networks, Vol.23, No.1, pp.20-34, 2010. The review process was one year and eight months.
Misprint. In table.1, E[WAIC1] shows the expectation value of E[WAIC-Sn].
• Niigata Workshop on Complex Geometry and Singularities (26/July/2007).
• Professor Morihiko Saito in Research Institute for Mathematical Sciences in Kyoto University proved the relation between the real log canonical threshold and roots of b-function. On real log canonical thresholds . (17/July/2007).
• S. Watanabe, Almost all learning machines are singular." invited paper in IEEE FOCI 2007 (2007/4/5).
• In The Clay Mathematics Institute, a workshop of Algebraic Statistics and Computational Biology (2005/Nov/11-15).
• The following paper has been published in Neurocomputing. geometrical methods in neural network and learning . Sumio Watanabe,Algebraic geometry of singular learning machines and symmetry of generalization and training errors," Neurocomputing, Vol.67, pp.198-213,2005. PDF file . (2005)
• We derived the real log canonical threshold for reduced rank regression in arbitrary case. (2005)
M.Aoyagi, S. Watanabe, "Stochastic complexities of reduced rank regression in Bayesian estimation," PDF file, Neural Networks, Vol.18,No.7,pp.924-933,2005.
• In 2003, we clarified the effect of singularities in a normal mixture.
K.Yamazaki, S.Watanabe,Singularities in mixture models and upper bounds of stochastic complexity." International Journal of Neural Networks, Vol.16, No.7, pp.1029-1038,2003.
• AMS Meeting : Biological Computation and Learning in Intelligent Systems AMS 2002 Fall Central Section Meeting was held in Madison, Wisconsin, October 12-13, 2002 University of Wisconsin. (2002).
• Frontier of Non-Commutative Analysis and Mathematical Quantum Theory (2002).
• We discovered that Bayes marginal is given by algebraic analysis and algebraic geometry.
Sumio Watanabe,"Algebraic analysis for nonidentifiable learning machines", Neural Computation, Vol.13, No.4, pp.899-933, 2001.
The communicater of this paper was Professor David Mumford. (2001)
• In 2001, we proved that the Bayesian stochastic complexity using Jeffreys' prior is asymptotically equal to that of regular statistical model.
S.Watanabe,"Algebraic information geometry for learning machines with singularities", Advances in Neural Information Processing Systems, (Denver, USA), pp.329-336. 2001.
• In 2000, algebraic theory in statistical learning was clarified. (2000)
S.Watanabe,"Algebraic analysis for non-regular learning machines," Advances in Neural Information Processing Systems, Vol.12, 2000, 356-362.
• In 1999, we reported the algebraic analysis of learning theory.
S.Watanabe,"Algebraic analysis for singular statistical estimation," Lecture Notes in Computer Sciences, Vol.1720, pp.39-50, 1999.
• In 1998, We found a bridge between pure mathematics and statistical learning theory B-function and learning theory
• In 1995, we pointed out the Bayesian posterior distribution is singular.
S.Watanabe, A generalized Bayesian framework for neural networks with singular Fisher information matrices," Proc. of International Symposium on Nonlinear Theory and Its applications, (Las Vegas), pp.207-210, 1995.
• In 1994, a method to find the optimal structure of a neural network was proposed. (1994)
S.Watanabe, gAn optimization method of artificial neural networks based on the modified information criterion,h Advances in Neural Information Processing Systems, Morgan Kaufmann, New York, Vol.6, pp.293-300, 1994.

The Annals of Mathematics
Inventiones Mathematicae
Communications in Mathematical Physics
The Annals of Statistics
Journal of the Royal Statistical Society
Mathematical Journals
AMS
MathSciNet
Applied Mathematics Research Express (AMRX)
Neural Networks
Journal of Machine Learning Research
Geometry and Statistics in Neural Network Learning Theory
Mathematical Calendar
ACM Calendar
Scirus
AMS Meeting
Math. and Phys.
Open Problems
Research Institute for Mathematical Sciences
Institute for Advanced Study, Princeton , Math Link
Isaac Newton Institute for Mathematical Sciences
Institut de Recherche Mathematique Avancee
The Clay Mathematics Institute
Mathematical Sciences Research Institute
American Institute of Mathematics
Statistical and Applied Mathematical Sciences Institute
Institute for Mathematics and Its Applications
Minimum Description Length
Neural Nets
ISAAC

### People

Professor Huzihiro Araki (Mathematical Science) Professor Araki, a Poincare medalist, is famous for contribution to operator algebra and mathematical physics.
Hal Tasaki (Theoretical Physics)
Takashi Hara (Mathematical Physics)
Tadayuki Takahashi (Astro Physics)
David Mumford (Algebraic geometry, Pattern Theory)
Stephan E. Feinberg (Mathematical Statistics)
Angelika van der Linde (Mathematical Statistics)
Andrew Gelman (Statistics and Political Science)
Aki Vehtari (Statistics and Brain Science)
Bernd Sturmfels (Algebraic Statistics)
Nobuki Takayama (Algebraic Analysis)
Akimichi Takemura (Mathematical Statistics)
Giovanni Pistone (Algebraic Statistics)
Lior Pachter (Mathematical Biology)
Seth Sullivant (Algebraic Statistics)
Russell Steele (Mathematical Statistics)
Mathias Drton (Algebraic Statistics)
Ruriko Yoshida (Algebraic Statistics)
Luis David Garcia Puente (Algebraic Statistics)
Jason Morton (Algebraic Statistics)
Shaowei Lin (Algebraic Statistics)
Piotr Zwiernik (Algebraic Statistics)
Caroline Uhler (Algebraic Statistics)
Dan Geiger (Computer Science)
Dmitry Rusakov (Computer Science)
Miki Aoyagi (Complex Manifold Theory)
Keisuke Yamazaki (Statistical Learning Theory)
Shinichi Nakajima (Statistical Learning Theory)
Kazuho Watanabe (Statistical Learning Theory)
Kenji Nagata (Statistical Learning Theory)

********** Sumio Watanabe Homepage **********

## Research Field:

Probability Theory, Mathematical Statistics, and Learning Theory.

## Research Purpose:

(1) To establish mathematical foundation of statistical learning.
(2) To construct a new research field between mathematics and neuroscience.

## Recently Published Books:

(1) S. Watanabe, "Neural Networks for Robotic Control - Theory and Applications," Prentice Hall, 1996.
(2) S.Watanabe and K.Fukumizu, "Algorithms and Architectures," Academic Press, 1998.
(3) S.Watanabe, Algebraic Geometry and Statistical Learning Theory UK, US (2009/August).

## Watanabe's Main Formulas

If you are a mathematician or a statistician, you can understand the importance of the following formulas. From the mathematical point of view, these formula clarified the relation between algebraic geometry and statistics. From the statistical point of view, these are generalized BIC and AIC for singular statistical models. These two main formulas are mathematically very beautiful and statistically very useful.

Main Formula 1

Let X1, X2, ..., Xn are random variables which are independently subject to the probability distribution q(x)dx. Even if the Fisher information matrix of a statistical model p(x|w) is degenerate, the following formula holds. The stochastic complexity or the Bayes marginal likelihood

F = -log \int p(X1|w) p(X2|w) ... p(Xn|w) \phi(w) dw

has asymptotic expansion

F= nS + A log n - (B-1) log log n + R

where nS is equal to n times empirical entropy of the true distribution, A is a positive rational number, B is a natural number, and R is a random variable of constant order. Here (-A) and B are determined as the largest pole and its order of the Zeta function of the statistical model,

J(z)= \int H(w)^{z} \phi(w)dw

which can be analytically continued to the entire complex plane. Here H(w) is the Kullback distance from the true distribution q(x)dx to the parametric model p(x|w)dx. The Zeta function J(z) is the mathematical bridge between statistics and algebraic geometry. The expectation value of the Bayes generalization error is asymptotically equal to

E[Bg] = A/n - (B-1)/(n log n) + o(1/(nlog n)),

where E[ ] is the expectation value over all sets of random samles.

Also we can algorithmically calculate A and B by applying Hironaka's resolution of singularities to the Kullback information, and obtain that A is not larger than D/2, and that B is not larger than D, where D is the number of parameters. The constants A and B are determined by the algebraic geometrical structure of the learning machine. If a model p(x|w) is regular, then A=D/2 and B=1, hence this formula is the generalized version of BIC and MDL.

Main Formula 2

Let Bg, Bt, Gg, Gt be Bayes generalization error, Bayes training error, Gibbs generalization error, and Gibbs training error, respectively. Then the following formulas hold for arbitrary true distribution, arbitrary parametric model, arbitrary a priori distribution, and arbitrary singularities.

E[Bg]=E[Bt]+2bE[Gt-Bt],
E[Gg]=E[Gt]+2bE[Gt-Bt],

where b is the inverse temperature of the a posteriori distribution. By using these formulas, we can estimate Bayes and Gibbs generalization errors from Bayes and Gibbs training errors. If a model is regular then E[Gt-Bt]=D/2, where d is the dimension of the parameter space. Hence these formulas contain AIC as a very special case.

For references,
S. Watanabe, Algebraic analysis for nonidentifiable learning machines," Neural Computation, Vol.13, No.4, pp.899-933, 2001.
S. Watanabe, A formula of equations of states in singular learning machines," Proceedings of WCCI, Hongkong, 2008.

We would like to claim that these results were firstly discovered, which were unknown even in statistics, information theory, and learning theory. Also we expect that these results mathematically clarify the essential difference between neural networks and regular statistical models. In other words, the reason why neural networks in Bayesian estimation are more useful than regular statistical models is firstly proven mathematically. Algebraic geometry and algebraic analysis play an important role in the theory of complicated learning machines. For more detail, see Singular learning theory .

Mathematical Structure

## Related Topic: Zeta functions

(1) As is well known, the Riemann zeta function is defined by

f(z)=\sum_{n=1}^{\infty} 1/n^{z}

which can be analytically continued to the entire complex plane. You must know the Riemann's Hypothesis that will clarify the distribution of prime numbers.

(2) The zeta function of Kullback information H(w) and the prior p(w) is defined by

J(z)=\int H(w)^{z}\phi(w)dw

which can be analytically continued to the entire complex plane. Professor Gel'fand conjectured in 1954 that this function is meromorphic, and both Professor Bernstein and Professor Atiyah clarified this fact. In Atiyah's method, Hironaka's resolution of singularities plays a central role. This function clarifies the Bayesian Statistics as I have shown in the foregoing sections.

(3) The zeta function of the Replica method will be defined by

L(z)=E[ Z(X1,X2,...,Xn)^{z}]

where X1,X2,...,Xn are training samples taken from the true distribution and E shows the expectation value overall sets of training samples. Z(X1,X2,...,Xn) is the partition function or the evidence of the learning. It is strongly expected that this function plays an important role in clarifying the mathematical structure of the Replica method in mathematical physics.

Papers

## Sumio Watanabe, Ph. D.

Curriculum Vitae

Sumio Watanabe was born in Japan, March 31, 1959. He received the B.S. degree in Physics from University of Tokyo, Japan, the M.S. degree in Mathematics from Research Institute for Mathematical Sciences (RIMS), Kyoto University, Japan, and the Ph. D. degree in applied electronics in 1993 from Tokyo Institute of Technology, Japan.

Dr. Watanabe is currently a professor at Mathematical and Computing Science in Tokyo Institute of Technology.

His research interest includes probability theory, applied algebraic geometry, and Bayesian statistics. He firstly discovered algebraic geometrical structure in statistical learning theory, and proposed that the standard form of the likelihood function can be derived from resolution of singularities, on which we can establish a new mathematical statistics. He found WAIC and WBIC which can be used in both regular and singular statistical model evaluation.

Did you know old works ?

Sumio Watanabe, 1993, Sparse learning of neural networks
Sumio Watanabe, 1993, Solvable model of neural networks
S. Watanabe. Algebraic analysis for nonregular learning machines. Vol.12, 2000, 356-362.

Our pioneering achievements

Neural networks and many learning machies have singular Fisher information matrices. Almost all learning machines are singular. Algebraic geometry gives the concrete learning theory using birational invariants.

(1) S. Watanabe, Algebraic geometrical methods for hierarchical learning machines. Neural Networks, Vol.14, No.8,pp.1049-1060, 2001.
(2) K.Yamazaki, S.Watanabe. Singularities in mixture models and upper bounds of stochastic complexity. Neural Networks, Vol.16, No.7, pp.1029-1038, 2003.
(3) K. Watanabe, S. Watanabe. Stochastic complexities of gaussian mixtures in variational bayesian approximation. Journal of Machine Learning Research, Vol.7, pp.625-644, 2006.
(4) T. Iwagaki, S.Watanabe. Generalization Error by Langevin Equation in Singular Learning Machines. 2009 International Symposium on Nonlinear Theory and its Applications NOLTA'09, Sapporo, Japan, October 18-21, 2009
(5) Sumio Watanabe, Equations of states in singular statistical estimation. Neural Networks. 2010 Jan. vol.23 (1):pp.20-34.
(5) Sumio Watanabe, Asymptotic equivalence of Bayes cross validation and widely applicable information criterion. JMLR, 2010, pp.3571-3594.
(7) Sumio Watanabe, A widely applicable Bayesian information criterion. JMLR, 2013, pp.867-897.