Sumio Watanabe Homepage

WAIC and WBIC

Let us compare WAIC with DIC1 and DIC2 by experiments.

The main difference of them is that WAIC has the theoretical support whereas DIC1 or DIC2 does not.

To support WAIC, we need neither Fisher asymptotic theory nor Laplace approximation. Beyond Fisher and Laplace, the completely new statistical theory was established. WAIC is not an ad hoc variable but the universal concept.

(1) The expectation values of DIC1 and DIC2 are equal to those of the generalization error if the true distribution is regular for and realizable by the statistical model.

(2) The expectation value of WAIC is equal to that of the generalization error even if the true distribution is singular for or unrealizable by the statistical model.

(3) The expectation value of WAIC is equal to that of the generalization error even if the true distribution is in a delicate case.

(Remark) The average parameter E_w[w] that is used in the original DIC has no meaning in nonidentifiable statistical models, because the set of optimal parameters is an analytic set with singularities.

(Regular Case) In this case, lambda=nu=2, because the number of parameters is 4. The functional variance converges to a constant, 4, in probability. The probability distribution of GE is the chi-square distribution with freedom 4. If a regular case criterion, RIC=T+ 4/n, which has the same form as AIC, is applied, then DIC1, DIC2, and WAIC are asymptotically equivalent to RIC.

(Singular Case) In this case, lambda=1/2, but nu is unknown. The functional variance converges to some random variable in law. The probability distribution of GE is still unknown, but it seems to be a similar distribution as the chi-square distribution with freedom 1. Hence GE takes a smaller value than 0.0025 very frequently but sometimes a very larger value than 0.0025. Remark that the variance of WAIC is smaller than that of training error. If the information criterion is defined by the sum of the training error and some constant, such as GE + 0.005, then its variance is equal to that of training error.

(Delicate Case) In this case, both lambda and nu are unknown. The optimal parameter for the true distribution p(x|0.5,0.3,0.3,0.3) is a regular point of a statistical model but such a delicate case can not be treated by the regular statistical theory. The posterior distribution can not be approximated by any normal distribution. If you are a statistician, you know that almost all practical statistical procedures such as model selection and hypothesis test are conducted in such delicate cases. In other words, singular statistical theory is not a special theory but a general one that is necessary in generic practical problems. In all cases, the expectation value of WAIC is equal to that of the generalization error, and the variance of WAIC is smaller than DIC1 and DIC2. From the mathematical point of view, singular statistical theory contains regular theory as a very special part.

Theoretical base of this page:

Algebraic geometry is necessary for theoretical support of WAIC.

Sumio Watanabe, Algebraic Geometry and Statistical Learning Theory, Cambridge University Press, 2009

Sumio Watanabe, ``Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory." Journal of Machine Learning Research, 11, 3571-3594, 2010.

(Special Remark) In this page, we compared WAIC with DIC1 and DIC2. In Bayes statistics, the log marginal likelihood, which is sometimes referred to as the free energy or the stochastic complexity, is a very important observable. From the model selection point of view, Bayes marginal is not an unbiased estimator of the generalization error but it has consistency. On the other hand, WAIC is an unbiased estimator of the generalization error but it does not have consistency. The author recommends that both criteria are useful because both of them have important but different information about the relation between the true distribution and the statistical model. In singular cases, Schwarz' BIC does not give the true asymptotics of the Bayes marginal. Its theoretical value is determined by the real log canonical threshold. Please see,

Sumio Watanabe, ``Algebraic analysis for nonidentifiable learning machines," Neural Computation, 13(4), pp.899-933, 2001.