Sumio Watanabe




(Postal Mail)


Sumio Watanabe, Ph.D.

Professor of
Department of Mathematical and Computing Science
Tokyo Institute of Technology,
Mail-Box W8-42, 2-12-1, Oookayama, Meguro-ku, Tokyo,
152-8552, Japan


(E-mail)

swatanab (AT) c . titech . ac . jp
Japanese Homepage
Google Scholar
DBLP: Computer Science Bibliography
Paper Information

Return to Watanabe Lab.

Algebraic Geometry and Learning Theory

In 1998, we found a bridge between algebraic geometry and learning theory.



Please search WAIC in studies of covid-19.

WAIC in Practical Problems




Seasonal Institute of Mathematics : Deepening and Evolution of Applied Singularity Theory, 2022, Nov, 20-25,Yokohama Japan., Singularity Theory in Statistical Science

Statistical problems caused by singularities can be mathematically resolved based on algebraic geometry.



Did you see the true shape of the posterior distribution ?
Enjoy Mathematics




Sumio Watanabe, Recent Advances in Algebraic Geometry and Bayesian Statistics
to appear in Half a century of information geometry

This article is a review of theoretical advances in the research field of algebraic geometry and Bayesian statistics in the last two decades. Many statistical models and learning machines which contain hierarchical structures or latent variables are called nonidentifiable, because the map from a parameter to a statistical model is not one-to-one. In nonidentifiable models, both the likelihood function and the posterior distribution have singularities in general, hence it was difficult to analyze their statistical properties. However, from the end of the 20th century, new theory and methodology based on algebraic geometry have been established which enables us to investigate such models and machines in the real world. In this article, the following results in recent advances are reported. First, we explain the framework of Bayesian statistics and introduce a new perspective from the birational geometry. Second, two mathematical solutions are derived based on algebraic geometry. An appropriate parameter space can be found by a resolution map, which makes the posterior distribution be normal crossing and the log likelihood ratio function be well-defined. Third, three applications to statistics are introduced. The posterior distribution is represented by the renormalized form, the asymptotic free energy is derived, and the universal formula among the generalization loss, the cross validation, and the information criterion is established. Two mathematical solutions and three applications to statistics based on algebraic geometry reported in this article are now being used in many practical fields in data science and artificial intelligence.

arxiv: 2211.10049



Sumio Watanabe, Mathematical Theory of Bayesian Statistics for Unknown Information Source,
to appear in Philosophical Transactions of the Royal Society A.

In statistical inference, uncertainty is unknown and all models are wrong. That is to say, a person who makes a statistical model and a prior distribution is simultaneously aware that both are fictional candidates. To study such cases, statistical measures have been constructed, such as cross validation, information criteria, and marginal likelihood, however, their mathematical properties have not yet been completely clarified when statistical models are under- and over- parametrized. We introduce a place of mathematical theory of Bayesian statistics for unknown uncertainty, which clarifies general properties of cross validation, information criteria, and marginal likelihood, even if an unknown data-generating process is unrealizable by a model or even if the posterior distribution cannot be approximated by any normal distribution. Hence it gives a helpful standpoint for a person who cannot believe in any specific model and prior. This paper consists of three parts. The first is a new result, whereas the second and third are well-known previous results with new experiments. We show there exists a more precise estimator of the generalization loss than leave-one-out cross validation, there exists a more accurate approximation of marginal likelihood than BIC, and the optimal hyperparameters for generalization loss and marginal likelihood are different.
https://arxiv.org/abs/2206.05630 , github





Sumio Watanabe, Mathematical theory of Bayesian statistics where all models are wrong. Advancements in Bayesian Methods and Implementations, Handbook of statistics, vol.47, Elsevier, 2022, September.

Nowadays, we know all models are wrong, both subjectively and objectively. We are aware that, if a model is wrong, coherent inference takes us to a wrong prediction and wrong science. Statisticians should construct new theory and method which can be employed even if a person is aware that all models are wrong.

In this chaper of the book, we show that there are mathematical theorems in Bayesian statistics, which hold even if statistical models are wrong or overparametrized. This article introduces such mathematical theorems, which enable us to find useful models and prior distributions in unknown large world.





Lecture Notes on Statstical Leaning Theory


Singular learning theory is introduced for students in mathematics and computer science. Mathematical learning theory was establised based on algebraic geometry, which is truly useful in unknown real world.

Overparametrized statistical models and learning machines can be studied by algebraic geometry. The real world is connected to algebraic geometry via zeta function.

Algebraic Geometry and Learning Theory







Sumio Watanabe (2021) Information criteria and cross validation for Bayesian inference in regular and singular cases.
Japanese Journal of Statistics and Data Science.
https://doi.org/10.1007/s42081-021-00121-3


In data science, an unknown information source is estimated by a predictive distribution defined from a statistical model and a prior. In an older Bayesian framework, it was explained that the Bayesian predictive distribution should be the best on the assumption that a statistical model is convinced to be correct and a prior is given by a subjective belief in a small world. However, such a restricted treatment of Bayesian inference cannot be applied to highly complicated statistical models and learning machines in a large world. In 1980, a new scientific paradigm of Bayesian inference was proposed by Akaike, in which both a model and a prior are candidate systems and they had better be designed by mathematical procedures so that the predictive distribution is the better approximation of unknown information source. Nowadays, Akaikefs proposal is widely accepted in statistics, data science, and machine learning. In this paper, in order to establish a mathematical foundation for developing a measure of a statistical model and a prior, we show the relation among the generalization loss, the information criteria, and the cross-validation loss, then compare them from three different points of view. First, their performances are compared in singular problems where the posterior distribution is far from any normal distribution. Second, they are studied in the case when a leverage sample point is contained in data. And last, their stochastic properties are clarified when they are used for the prior optimization problem. The mathematical and experimental comparison shows the equivalence and the difference among them, which we expect useful in practical applications.

(Note) The difference between cross validation and information criteria is shown both theoretically and experimentally. Cross validation needs simultaneous independency of {(Xi,Yi)}, whereas information criteria does only conditional independency of {(Yi|Xi)}. That is to say, information criteria can be used even if {Xi} is not independent.

For example, cross validation cannot be used in the autoregressive models, whereas WAIC can be.




Sumio Watanabe (2021) WAIC and WBIC for mixture models. Bihaviormetrika,
https://doi.org/10.1007/s41237-021-00133-z


In this paper, we introduce mathematical foundation and computing methods of WAIC and WBIC in a normal mixture which is a typically singular statistical model, and discuss their properties in statistical inference. Also, we study the case that samples are not independently and identically distributed, for example, they are conditional independent or exchangeable. Furthermore, a simple calculation method of WBIC in mixture models is proposed.

Note: In a Gibbs sampler, it was difficult to calculate WBIC. However, we made a simple calculation method of WBIC in the above paper. The following is a Matlab file.

Matlab file : WAIC and WBIC for a normal mixture



Mathematical Theory of Bayesian Statistics

This book may be useful those who want the scientific Bayesian framework.

Nowadays, we know that all models are wrong, both subjectively and objectively.

In statistics and machine learning, a data-generating distributuon (DGP) is estimated by a predictive distribution defined from a statistical model and a prior. In an older Bayesian framework in the 20th century, it was explained that the Bayesian predictive distribution should be the best decision, on the assumption that a statistical model is convinced to be correct and a prior is given by a subjective belief in a small world. However, such a restricted formalism of Bayesian inference cannot be applied to highly complicated statistical models and learning machines in a large world.

In this book, in order to establish the mathematical foundation for Bayesian stistsics in the large world, it is shown that mathematical theorems universally hold for an arbitrary triple (a data-generating distribution, a statistical model, a prior).

This book may be useful for the readers who are interested in the followings.

(1) In the real world, all models are wrong. If a model is wrong, coherent inference takes us to a wrong prediction.

(2) If we assume that 'uncertainty' is represented by a model and 'subjective belief' is represented by a prior in the older Bayesian framework, we automatically accept the wrong assumption that the model and the prior are DGP.

(3) If we think that our own model and prior are DGP, our decision is not scientific. Because we know that our model is not DGP in the real world.

(4) Thus, we need to check or test both a model and a prior, even in Bayesian data analysis. A pair of a model and a prior is only a candiate system which is not DGP. We should improve both a model and a prior.

(4.1). The older Bayesian formalism prohibited to improve a statistical model and a prior, which is not appropriate in science.

(4.2). The old Bayesian formalism said that a probability distribution can be assigned for even non-probabilistic phenomenon by the belief of subjective decision. In science, such a claim may take us to misconducted researches. In science, we should verify both a model and a prior based on experiments.

(5) There are new mathematical theorems in Bayesian statistics. They hold even if a model is too small or too redundant. We had better know them in the unknown real world. All models are wrong, but we may find some useful models, based on mathematical theory. They depend on the sample size in general.

(6) Even if the posterior distribution cannot be approximated by any normal distribution, its generalization performance is determined by the birational invariant, the real log canonical threshold.

You can learn the new mathematical theory in Bayesian statistics: Statistical Leaning Theory







You can find phase transition in neural network learning. Sumio Watanabe, Algebraic geometrical methods for hierarchical learning machines. Neural Networks, 14(8), 2001, pp.1049-1060


Neural networks and normal mixtures are singular. You can find the phase transition phenomenon in Bayesian posterior distribution, as the sample size increases. For details, see Phase transition .

NN is singular

NN is singular

NN is singular






The reason why singular learning theory is necessary in deep learning.

A small neural network was trained so that it classifies O and X.
NN is singular

Eigenvales of the neural network are almost equal to zero.
NN is singular


Singular learning theory can be applied to neural networks, because it holds even if Fisher information matrix is highly singular. The generalization loss and log marginal likelihood are given by the real log canonical threshold, which is the volume dimension of singularities. WAIC and WBIC can be used in neural networks. AIC is far larger than the Generalization loss.
WAIC, LOO, AIC, and Generalization Loss in Neural Network (mp4) .

LOO requires independency of (Xi,Yi), whereas information criteria do only conditional independency of (Yi|Xi). Thus information criteria can be used more widely than LOO, for example AIC and WAIC can be applied to AR model in time series analysis.
(See: S. Watanabe, Mathematical theory of Bayesian statistics, 2018, CRC Press).






Sumio Watanabe, homepage continued