## Algebraic Geometry and Statistical Learning Theory Welcome to Author's Page

Sumio Watanabe, 2009, Cambridge Univesity Press

Sumio Watanabe, Algebraic geometry and statistical learning theory," Cambridge University Press, Cambridge University Press, Cambridge, UK, 2009.

Cambridge University Press in Cambridge Monographs on Applied and Computational Mathematics

Google Search : Algebraic geometry and statistical learning theory
Google Search : ISBN=9780521864671

Thank you very much for book reviews. This book tries to build a new bridge between mathematics, statistics, and artificial intelligence.

List of Misprints

Sumio Watanabe Homepage

Purpose

A new statistical learning theory is established by algebraic geometrical methodology.

A Short Introduction

You can get a PDF file of a short introduction, An introduction to algebraic geometry and statistical learning theory .

Why algebraic geometry ?

Main Points

(1) Almost all statistical learning machines are singular. They are nonidentifiable and their Fisher information matrices are singular. Neither AIC, BIC, DIC, nor MDL holds in statistical machine learning.
(2) We need a new mathematical theory for statistical learning machines. Algebraic geometry is the essential base for new theory.
(3) This book gives the concrete, useful, and nontrivial results in statistical learning theory. AIC, BIC, DIC, and MDL are completely generalized for statistical learning machines.

If you study statistics or mathematics, you find new theory and methodology in the following theorems.

(A) Algebraic geometry of likelihood function, Short Version (PDF)
(B) Algebraic geometry of Bayes integrals, Short Version (PDF)
(C) Equation of state in statistical learning, Short Version (PDF)
(D) Algebraic geometry of maximum likelihood, Short Version (PDF)

The book, "algebraic geometry and statistical learning theory", proves these theorems. A new mathematical base is established, on which statistical learning theory is studied. Algebraic geometry is explained for non-specialists and non-mathematicians.

Special Remark

Please see the true likelihood function or the posterior distribution. The likelihood function or the posterior distribution can not be approximated by the normal distribution.

The Bayes integral Z is defined by the log likelihood function L(w) and the prior distribution p(w),

Z = \int exp( n L(w) ) p(w) dw.

Remark that L(w) is not a function but a random process , hence it is not trivial to derive the asymptotic expansion of Z even if we apply algebraic geometry to L(w). Our method firstly derived its asymptotic expansion.

Let E[L(w)] is the function defined by the expetation value of L(w). Then

L(w)=E[L(w)]+(L(w)-E[L(w)])

Although the variance of $L(w)-E[L(w)]$ is in proportion to 1/n, we have to investigate this function. The asymptotic expansion of exp(nL(w)) is not equal to exp(nE[L(w)]). Therefore, our result is not a simple application of any previous researches in algebraic geometry and algebraic analysis. If the variance of L(w) is larger than its average, there is a case when such an asymptotic expansion does not hold. Please see Here .

Abstract

A lot of statistical models and learning machines are singular, hence the conventional statistical theory of regular models can not be applied. To construct a new mathematical foundation for singular models and machines, we establish a bridge among algebraic geometry, zeta function, Schwartz distribution, empirical process, and statistical learning theory. Based on the bridge, four main formulas are proved.

Firstly, any log likelihood function can be written in the common standard form even if Fisher information matrix is singular. In regular statistical models, the log likelihood ratio function can be approximated by a quadratic form of the parameter, whereas in singular statistical models, it is written by the normal crossing function based on Hironaka's resolution of singularities.

Secondly, the asymptotic expansion of Bayes marginal likelihood, stochastic complexity, or evidence is derived, which contains regular models' BIC and MDL as special cases. It is clarified that such asymptotic expansion is mathematically determined by the zeta function of a statistical model whose largest pole is the real log canonical threshold.

Thirdly, the Bayes and Gibbs generalization errors are estimated by Bayes and Gibbs training errors without any knowledge of the true distribution. This is the generalized version of AIC. The difference between generalization and training is equal to the singular fluctuation of the posteriordistribution.

And lastly, symmetry of the generalization and training errors in the maximum likelihood and a posteriori estimator is proved. In singular learning machines the generalization error is equal to the maximum value of the gaussian process, which is far larger than regular models.

In this book, regular statistical theory is generalized in the case that Fisher information matrix is singular. The main formulas are not only mathematically beautiful but also statistically useful.

Even if one can not understand the mathematical proofs of theorems, the obtained results are important for practical applications. We have to use algebraic geometrical method to evaluate how appropriate learning machines are.

Basic Point

One may think that the set of parameters which have degenerate Fisher information matrix is a measure zero subset in a parameter set, resulting that it can be negligible in generic statistical estimation. From the statistical point of view, such consideration is wrong. In generic statistical estimation in a normal mixture, for example, a statistical model made of five normal distributions is employed. We should evaluate whether a model of five components is appropriate for the true probability distribution in the fluctuation of empirical samples, therefore we should analyze the effect of singularities. In practical applications, it seems that models with four, five, and six components are almost appropriate. Then the posteriordistribution contains singularities. See Why algebraic geometry ? . Such a problem is called statistical model selection or hypothesis testing. Hence we need singular learning theory in almost all generic statistical problems. It should be emphasized that, even if the true distribution is made of five components and if we employ the same model, we need singular learning theory in order to evaluate whether each component is appropriately estimated or not. The conventinal statistical theory can not be applied to such a basic problem.

Regular and Singular

From the mathematical point of view, singular learning theory is the generalized version of the regular statistical theory.

 (1) Statistics regular singular (2) Fisher metric positive definite semi-positive definite (3) Basic algebra linear algebra ring and ideal (4) Basic geometry differential geometry algebraic geometry (5) Basic analysis real-valued function function-valued function (6) Basic Probability Theory central limit theorem empirical process theory (7) Estimator is defined as a function from samples to parameter from samples to probability distribution (8) Theory is invariant under diffeo-morphism birational transform (9) True distribution single point analytic set with singularities (10) Log likelihood ratio function quadratic form normal crossing (11) Maximum likelihood estimator asymptotically efficient does not exist or not efficient (12) posteriordistribution asymptotically normal asymptotically singular (13) Cramer-Rao inequality holds has no meaning (14) Log Bayes marginal (dimension/2) log n (log canonical threshold) log n (15) (Generalization) - (Training) (dimension) / n 2 (singular fluctuation) / n (16) Information Criterion AIC WAIC , or equation of state (17) Bayes Information Criterion BIC WBIC

As Fisher's asymptotic theory was the statistical base for regular statistical estimation, new theory is that for both regular and singular estimations. A lot of statistical methods are now being re-studied from the algebraic geometrical point of view. For example, it was clarified why the constant (d/2) appears in AIC and BIC in regular models, where d is the dimension of the parameter space.

The followings are recent advances.

(1) We proved that Bayes cross validation (CV) is asymptotically equivalent to WAIC even for singular models. Also we showed that the sum of CV and Bayes generalization error is equal to the constant 2*lambda/n as a random variable, where lambda is the real log canonical threshold and n is the number of random samples. arXiv:1004.2316 , (2010/April/15). The average of WAIC is equal to that of the generalization error, and the variance of WAIC is equal to that of the generalization error, even if the true parameter is in the neighborhood of singularities. It was also clarified that DIC is different from CV and WAIC, even asymptotically (2010/Oct/14). You can read the paper from Journal of Machine Learning Research, Vol.11, (DEC), pp.3571-3594,2010.

(2) The universal learning curve was proved under the renormalizable condition, even if a true distribution is unrealizable and singular for a learning machine.
Sumio Watanabe, Asymptotic Learning Curve and Renormalizable Condition in Statistical Learning Theory," arXiv:1001.2957, 2010, which was published in Journal of Physics Conference Series . (2010/Jan/18).

(3) Equations of states hold even if the true distribution is not contained in a parametric model.
Sumio Watanabe, Equations of States in Statistical Learning for a Nonparametrizable and Regular Case , arXiv:0906.0211 (1/June/2009). This paper was published, whose title was Equations of states in statistical learning for an unrealizable and regular case." IEICE Transactions, Vol.E93-A, No.3, pp.617-626, 2010.

(4) We found a WBIC which enables us to estimate the Bayes free energy with a small computational cost. WBIC, Journal of Machine Learning Research .

Q.1. Why do we need singular learning theory ?

A.1. Because almost all learning machines are singular. For example, artificial neural networks, radial basis function, normal mixtures, binomial mixtures, Boltzmann machines, hidden Markov models, reduced rank regression, and stochastic context-free grammars are singular. If a learning machine or a statistical model has hierarchical structure or a hidden variable, then it is not regular but singular. A regular statistical model is a very special example of singular models.

Q.2. We are satisfied with regular statistical theory.

A.2. In singular statistical models, Fisher information matrix is singular. Cramer-Rao inequality has no meaning. The asymptotic normality of the maximum likelihood estimator (MLE) does not hold. MLE is not the asymptotically best estimator. The Bayes a posteriori distribution does not converge to any normal distribution. Laplace approximation can not be employed. Neither AIC nor DIC is asymptotically equal to the average generalization error. Neither BIC nor MDL is asymptotically equal to the minus log Bayes marginal. Log likelihood ratio does not converge to the chi-square distribution. We need the completely new statistical theory for singular statistical models.

Q.3. Is algebraic geometry essential to statistical learning theory ?

Yes. The set of statistical models or learning machines is an analytic set with singularities. Singularities should be studied appropriately, because they determine the statistical estimation process. Algebraic geometry and singularity theory provide the mathematical foundation on which a new statistical learning theory is constructed. For example, resolution of singularities is a powerful method which makes the log likelihood function be a common standard form. Algebraic geometry is definitely important, because there is no alternative method.

Q.4. Does algebraic geometry enable us to construct any concrete methodology ?

A.4. Yes. Four main formulas are introduced based on algebraic geometry. Firstly, the log density ratio function of any statistical model can be written by a common standard form based on resolution of singularities. Secondly, the asymptotic form of log Bayes marginal is given by the log canonical threshold. This is the generalized version of BIC and MDL, which contains regular model's BIC and MDL as special cases. Thirdly, the difference between the generalization and training errors are given by the singular fluctuation. This is the generalized version of AIC, which contains regular model's AIC as a special case and called Widely Applicable information Criterion. Although AIC can not be applied to singular models, WAIC can be applied without regard whether the model is regular or singular. And lastly, the symmetry of generalization and training errors in the maximum likelihood method is proved. In singular models, the training error by ML is made to be very small, whereas the generalization error very large. Therefore, the maximum likelihood method is not appropriate for singular statistical models. These four formulas open new perspective and methodology in statistical learning theory.

Q.5. Does this book contain any new information ?

A.5. Yes. Singular statistics is firstly introduced by this book. Regular statistical theory was constructed by R.A. Fisher in 1925. However, singular statistical theory has been left unknown, because it contained a lot of mathematical difficulties. Recently, the author of this book established singular statistical theory based on algebraic geometry, zeta function theory, singular integral, and empirical process. This is the first book which introduces such advances to non-specialists.

Q.6. It seems to be difficult for non-mathematicians to study algebraic geometry.

A.6. Sometimes it is difficult for non-mathematicians to understand mathematical books. However, in this book, we introduce algebraic geometry for non-specialists. Basic concepts in algebraic geometry are kindly explained, the correspondence between algebra and geometry, blow-ups, resolution of singularities, zeta functions, and log canonical threshold. Such concepts are important not only in mathematics but also in statistics and machine learning. In many mathematical books, resolution of singularities is explained using abstract concepts. In this book, resolution theorem is introduced using concrete and useful description. The resolution theorem is fundamental one in algebraic geometry which is important also in probability theory and statistics.

Q.7. I am an algebraic geometer. Does this book give any pure mathematical perspective to algebraic geometry ?

A.7. In this book, algebraic geometry is introduced for non-mathematicians. The main purpose of this book is to construct a mathematical bridge between algebraic geometry and the real world,

Algebraic Geometry === Singularity Theory === Zeta Function === Empirical Process === Real World

Based on this bridge, using algebraic geometrical concepts such as the log canonical threshold, you can predict the behavior of an information system in the real world. Yes, algebraic geometry is connected to the real world.

However, this book may be important even in pure mathematics. It is well known that Professor V.F.R. Jones found the famous knot invariant, Jones polynomial, in the research of mathematically rigorous statistical physics and operator algebra. From the theoretical physics point of view, statistical learning theory is equivalent to the statistical physics of a random Hamiltonian on algebraic varieties, and observables are naturally birational invariants. In this book, it is shown that the real log canonical threshold, the well known birational invariant, is equal to the average of the generalization error, and that the singular fluctuation, the new birational invariant, is equal to the difference of the generalization error and the training error. The former plays the important role in higher dimensional algebraic geometry. It is still unknown what the singular fluctuation is in algebraic geometry. If you are an algebraic geometer, please enjoy the fact that algebraic geometrical structure plays the important role in statistical learning theory. The singular learning theory can not be made without algebraic geometry. The set of learning machines which discover the unknown structure from random samples is equal to algebraic varieties, and their learning process is determined by the birational invariants. The knowledge to be discovered is often equal to the singularity of the learning machines. We expect that new mathematical concepts are found in statistical learning theory.

Q.8. I am a theoretical physist. Does this book have any connection to theoretical physics ?

A.8. Yes. This book clarifies the method how to study the partition function of the random Hamiltonian in the mathematically rigorous way. It was clarified by Professor Araki and Haag that quantum field theory and statistical mechanics are characterized by the algebraic structure of operators. This book shows that, even for random Hamiltonian, algebraic structure determines the physical phenomenon. This book studies the statistical mechanics defined on algebraic varieties. Although the dimmension of an algebraic variety is finite, the phase transition phenomenon occurs because of the existence of singularities. Also, the mathematical relation between the partition function, the state density function, and the zeta function is clarified. If you are a physist, you find a new methodology for theoretical physics in this book. Please see Journal of Physics Conference Series .

Q.9. I am an information engineer. Do I need this book ?

A.9. If you are not interested in the problem how statistical estimation or machine learning works appropriately in your applications, you may not need this book. However, if you want to evaluate the information system in statistical estimation or machine learning, you need this book. Note that almost all statistical learning machines are not regular, hence you can not examine them using AIC, BIC, or MDL. Even if you are too busy to understand the mathematical proofs in this book, the theorems are very simple, resulting that it is easy for everyone to apply them to practical applications.

Q.10. I am a lecturer of statistics and information science in a university. Should I teach the results of this book to students ?

A.10. One of the most important points is that Fisher's asymptotic theory holds only when the statistical model is identifiable and Fisher information matrix is positive definite. In a lot of statistical problems in practical applications such as hypothesis testing or model selection, the function from the parameter to the probability distribution is not one-to-one or Fisher information matrix is not positive definite. If Fisher information matrix has zero eigen value, then the asymptotic normality of the maximum likelihood estimator (MLE) does not hold. In such cases, MLE is not the asymptotically best estimator. Your students will use AIC, BIC, or MDL in practical applications, however, such information criteria can not be used. Therefore, you had better teach your students that Fisher's asymptotic theory holds only for regular statistical models.

Q.11. I am a statistician. What is the main difference between WAIC and the conventional information criteria ?

A.11. WAIC has the theoretical support for arbitrary set of (true distribution, statistical model, and prior), whereas AIC, TIC, BIC, DIC, and MDL have no theoretical support. Note that AIC, TIC, BIC, and MDL give the true asymptotic values only when the statistical model is regular, that is to say, the set of the optimal parameters consists of a single point and the Hessian matrix is positive definite. The widely applicable information criterion, WAIC, is equal to the asymptotic generalization error even if the set of optimal parameters consists of an analytic set with singularities and even if the Hessian matrix is not positive definite. WAIC can be used in artiticial neural networks, normal mixtures, hidden Markov models, reduced rank regressions, and Bayes networks. In such statistical models, the posterior distribution is nonlocally distributed and far from the normal distribution. The deviance information criterion, DIC, is asymptotically equal to the effective number of parameters only when the posterior distribution can be approximated by the normal distribution. However, WAIC can be used in arbitrary posterior distribution. WAIC can be used even if the true distribution is outside of the parametric model. Moreover, in singular statistical models, the distribution of log likelihood in the posterior distribution is not subject to the chi-square distribution or any other simple distribution, because of singularities. Hence DIC can not be generalized by any simple considerations in singular models. WAIC always gives the generalization error and is a natural criterion from algebraic geometrical and functional points of view. For the theoretical and experimental analysis of DIC, CV, and WAIC, please see arXiv:1004.2316 . In fact, DIC is not invariant under the transform of the parameter, whereas WAIC is invariant under any birational transform of the parameter. In other words, in WAIC, one can replace the parameter w by using w = g(u) where g is any analytic function that does not necessary have the inverse function g^{-1}. In this case, the model p(x|w) and the prior p(w) are replaced by p(x|g(u)) and p(g(u))|g'(u)|, respectively. BIC and MDL give the true asymptotic log marginal likelihood only when the statistical model is regular. We generalized them for arbitrary statistical models, and proved that (the number of the paramaters /2) is replaced by the real log canonical threshold. It should be emphasized that the results of this book are truly new in statistics. Our main proposal is that algebraic geometrical foundation is necesarry for statistical problem,

Can we estimate the true structure of the information source from random samples ?"

If you are a statistician, you can understand the importance and advance of this book.

Q.12. I am a statistician. In regular models, the effective number of parameters in AIC is equal to that in BIC. Does this relation hold even in singular models ?

A.12. No. In singular models, the effective number of parameters that is defined by the difference between the generalization error and the training error is equal to the singular fluctuation, whereas that by the asymptotic Bayes marginal likelihood is equal to the real log canonical threshold. Both of them are birational invariants, however, they are different in general. It happens that they coincide with each other in regular models. The real log canonical threshold and the singular fluctuation depend on the triple, (true distribution, statistical model, prior distribution), however, they can be numerically estimated from random samples without any information about the true distribution.

Q.13. I am a statistician. In practical applications, it seems that we can replace the real log canonical threshod and the singular fluctuation by (d/2), where d is the number of parameters. Also it seems that we can use AIC, where AIC is defined by (Bayes training error) + d/n."

A.13. In practical applications, it is easy to calculate WAIC. The real log canonical threshold (RLCT) and the singular fluctuation (SF) are different from (d/2). If you use AIC, then their averages are not equal to that of the generalization error. Moreover, the variance of AIC is larger than WAIC. The variance of WAIC is equal to that of the generalization error. If you use WAIC in the case that the true distribution is realizable by and regular for the statistical model, then WAIC is equivalent to AIC as a random variable. Therefore, I recommend you that you had better use WAIC.

Q.14. I am a statistician. Do you study very special and pathological cases ?

A.14. No. Singular theory studies the most ordinary and generic cases in statistical model selection and hypothesis test. In statistical model evaluation process for hierarchical statistical models such as normal mixtures, neural networks, hidden Markov models, reduced rank regressions, and so on, we have to determine whether a statistical model is redundant compared to the true distribution or not. In such cases, the posterior distribution is far from the normal distribution. Moreover, it contains singularities in general. Even if the posterior distribution can be approximated by the normal distribution, we can not know it before estimating the posterior distribution. If you are a true statistician, I recommend that you had better see the true likelihood function by your own eyes, for exmaple, in a simple normal mixture model. In ordinary and generic cases in model selection or hypothesis test, the shape of the likelihood function is far from the normal distribution, resulting that we can not apply the regular asymptotic theory. Neither asymptotic normality of MLE nor asymptotic normality of the posterior distribution holds in many practical applications. Hence singular statistical theory is essential to statistics. It is not either special or pathological one. From the mathematical point of view, singular statistical theory contains regular statistical theory as a very special part. Therefore, statistics in the near future will be constructed based on singular statistical theory. The old Fisher's asymptotic theory was built on quadratic approximation + central limit theorem", whereas the modern asymptotic theory will on algebraic geometry + empirical process theory." The modern theory is not special one but general and natural one. It should be emphasized that there is no alternative method than algebraic geometry and algebraic analysis. Without them, the generalized version of AIC, BIC, MDL, and DIC can not be found. In the near future, statistics will be reborn with algebraic geometry and algebraic analysis.

Q.15. I am a statistician. In general, it is not easy to find a resolution map. Therefore, your theory is beautiful but can not be used in practical problems.

A.15. Nowadays, resolution maps are now being found in several practical models, for example, reduced rank regressions, Bayes networks with hidden variables, artificial neural networks. and normal mixtures. Even if the concrete resolution map is not found, the existence of desingularization enables us to make statistical theory for non-regular probabilistic models. Based on such theory, both information criteria, WAIC and WBIC are derived. We strongly expect that more general statistical theory will be constructed based on the existence of resolution map. In the near future, statistical theory needs neither positive definiteness of Fisher information matrix nor asymptotic normality of the maximum likelihood estimator.

Q.16. I am a Bayes statistician. As a statistician Dr. I. J. Good proposed, the Bayes marginal likelihood is a very important observable in statistics, but its numerical calculation needs a quite heavy computational cost. Does singular learning theory enable us to make a new method to estimate it with a small computational cost, even if the true distribution is unknown ?

A.16. Yes. Based on singular learning theory, we obtained a new method how to estimate Bayes marginal with a small computational cost. See a widely applicable bayesian information criterion (WBIC), Journal of Machine Learning Research . Moreover, by using WBIC, we can estimate the real log canonical threshold, even if we do not know a true distribution.

Q.17. I am a researcher of mathematical statistics. Singular learning theory leads us to find the universal laws in statistics. However, it needs understanding deep mathematical results, for example, Hironaka resolution of singularities (1964) or existence of b-function (1972). Is there alternative theory which does not use modern mathematics ?

A.17. Now, we do not know whether an althernative way would be found or not. However, it seems that, without algberaic geometry or algebraic analysis, statistical theory becomes more difficult and complicated. I would like to claim that our method is the most natural way to discover the true statistical laws.

Q.18. I am a student of Bayes statistics. I want to compare WAIC with DIC by experiments.

A.18. WAIC has not only theoretical support but also practical effectiveness. Let's compare WAIC with DIC.

Q.19. I am a student of Bayes statistics. I want to use WBIC.

A.19. In WBIC , you find a MATLAB program.

Q.20. I am a student of statistics and machine learning theory. I want to see several examples of the likelihood functions.

A.20. It is important to see the true shape of the likelihood function. In singular statistical models, it is quite different from the normal distribution. However, unfortunately, it is difficult to show the likelihood function for higher dimensional cases. In higher dimensional cases, the likelihood functions become more complicated with higher dimensional singularities. Singularity theory is essential to modern statistics.

The following figure shows a case when the dimension of the parameter is 2 and the number of random samples is 20.

A case when the number of random samples is 250. Rough-eged shape was caused by representation on the two-dimensional lattice. Note that the Fisher information matrix at the true parameter is positive definite, however, the likelihood function can not be approximated by the normal distribution.

A case when the number of random samples is 1000. Rough-eged shape was caused by representation on the two-dimensional lattice. Even in this case, the likelihood function is far from the normal distribution.

A case when the true distribution is a mixture of separated normal distributions and the number of random samples is 20.

If you want to see more examples, please see Examples .

Even for the same number of training samples, it strongly depends on the condition of the true distribution whether we can apply AIC, BIC, and DIC or not. In a generic case when we need model selection or hypothesis testing, we can not know whether we can apply the conventional criteria, because we see the true distribution in the probabilistic fluctuations.

Based on algebraic geometrical method, we found a new criterion which is applicable in both regular and singular cases. Therefore, we can estimate the generalization error independent of the true distribution. It is very simple, easily computable, and statistically natural. Journal of Machine Learning Research, Vol.11, (DEC), pp.3571-3594, 2010.

Author

Sumio Watanabe, Ph.D. is a professor at Tokyo Institute of Technology, Tokyo, Japan. He studied mathematical physics in Research Institute for Mathematical Sciences (RIMS) in Kyoto University, Kyoto, Japan. His research interest includes statistical learning, applied algebraic geometry, and information physics.

Questions or comments are welcome : [e-mail] swatanab (AT) pi.titech.ac.jp