Mathematical Theory of Bayesian Statistics

Sumio Watanabe

Sumio Watanabe, Mathematical Theory of Bayesian Statistics, CRC Press, 2018


Chapter 1. Definition of Bayesian Statistics
Chapter 2. Statistical Models
Chapter 3. Basic Formula of Bayesian Observables
Chapter 4. Regular Posterior Distribution
Chapter 5. Standard Posterior Distribution
Chapter 6. General Posterior Distribution
Chapter 7. Markov Chain Monte Carlo
Chpater 8. Information Criteria
Chapter 9. Topics in Bayesian Statistics
Chapter 10. Basic Probability Theory

Mathematical Theory of Bayesian Statistics

New mathematical theory is introduced and the formula of generalization loss and marginal likelihood are clarified, even if the posterior distribution cannot be approximated by any normal distribution.


(1) Example: Normal Mixture, p(x|a,b)= (1-a)N(x)+a N(x-b).

True (a0,b0)=(0.5,0.3). Even if Fisher information matrix is positive definite at the true, the posterior is far from any normal distribution. See the true posteriors.
n=30 (mp4) , n=300 (mp4) , n=3000 (mp4) , and n=30000 (mp4) .

New theory enables us to study such singular posterior distributions. Our theory holds in all cases n=30, 300, 3000, and 30000.

Generalization loss, cross validation loss, and WAIC in Normal Mixture (mp4) .

Generalization loss, cross validation loss, and WAIC in Linear Regression (mp4) .

(2) Example: Neural network, Y=a tanh ( bx ) + Z. Bayes and Maximum Likelihood (mp4) .

Conditional density estimation: true q(y|x1,x2) by a neural network p(y|x1,x2).
You can see a neural network in Bayesian posterior distribution. a Neural Network in Bayes .
Generalization loss, cross validation loss, WAIC, and AIC in CV and WAIC in Neural Networks (mp4) .

We need algebraic geometry to study neural networks.

For more examples, see Cross Validation and WAIC in Layered Neural Networks .

(3) Comparison of LOO and WAIC : In regression problems, LOO requires simultaneous independence of {(Xi,Yi)}, whereas WAIC needs only conditional independence {(Yi|Xi)}. In this book, we give the mathematical proof of this fact.

See the true: | Loo - G | - | WAIC - G |

(3A) Independent Case (mp4)

(3B) Influential Observation Case (mp4)

Since the variance of LOO is larger than WAIC, WAIC is the better criterion for the optimal hyperparameter.

Optimization of Hyperparameter by minimizing LOO or WAIC (mp4)


LOO = Leave-one-out cross validation

WAIC = Widely Applicable Information Criterion

G = Generalization error = KL distance between true and estimated

Diagnosis: If LOO is different from WAIC, then leverage sample points are contained in the sample. A statistician had better reconsider whether such points should be included in statistical inference or not. Leverage sample points can be found because they make the pointwise functional variances larger.

more information about loo and waic