Mathematical Theory of Bayesian Statistics

 Sumio Watanabe Sumio Watanabe, Mathematical Theory of Bayesian Statistics, CRC Press, 2018 CONTENTS Chapter 1. Definition of Bayesian Statistics Chapter 2. Statistical Models Chapter 3. Basic Formula of Bayesian Observables Chapter 4. Regular Posterior Distribution Chapter 5. Standard Posterior Distribution Chapter 6. General Posterior Distribution Chapter 7. Markov Chain Monte Carlo Chpater 8. Information Criteria Chapter 9. Topics in Bayesian Statistics Chapter 10. Basic Probability Theory Mathematical Theory of Bayesian Statistics New mathematical theory is introduced and the formula of generalization loss and marginal likelihood are clarified, even if the posterior distribution cannot be approximated by any normal distribution. Examples (1) Example: Normal Mixture, p(x|a,b)= (1-a)N(x)+a N(x-b). True (a0,b0)=(0.5,0.3). Even if Fisher information matrix is positive definite at the true, the posterior is far from any normal distribution. See the true posteriors. n=30 (mp4) , n=300 (mp4) , n=3000 (mp4) , and n=30000 (mp4) . New theory enables us to study such singular posterior distributions. Our theory holds in all cases n=30, 300, 3000, and 30000. Generalization loss, cross validation loss, and WAIC in Normal Mixture (mp4) . Generalization loss, cross validation loss, and WAIC in Linear Regression (mp4) . (2) Example: Neural network, Y=a tanh ( bx ) + Z. Bayes and Maximum Likelihood (mp4) . Conditional density estimation: true q(y|x1,x2) by a neural network p(y|x1,x2). You can see a neural network in Bayesian posterior distribution. a Neural Network in Bayes . Generalization loss, cross validation loss, WAIC, and AIC in CV and WAIC in Neural Networks (mp4) . We need algebraic geometry to study neural networks. For more examples, see Cross Validation and WAIC in Layered Neural Networks . (3) Comparison of LOO and WAIC : In regression problems, LOO requires simultaneous independence of {(Xi,Yi)}, whereas WAIC needs only conditional independence {(Yi|Xi)}. In this book, we give the mathematical proof of this fact. See the true: | Loo - G | - | WAIC - G | (3A) Independent Case (mp4) (3B) Influential Observation Case (mp4) Since the variance of LOO is larger than WAIC, WAIC is the better criterion for the optimal hyperparameter. Optimization of Hyperparameter by minimizing LOO or WAIC (mp4) where LOO = Leave-one-out cross validation WAIC = Widely Applicable Information Criterion G = Generalization error = KL distance between true and estimated Diagnosis: If LOO is different from WAIC, then leverage sample points are contained in the sample. A statistician had better reconsider whether such points should be included in statistical inference or not. Leverage sample points can be found because they make the pointwise functional variances larger. more information about loo and waic