Sumio Watanabe Homepage
WAIC and WBIC
Key words: Bayes statistics, model selection, information criterion
Let us compare WAIC with DIC1 and DIC2 by experiments.
The main difference of them is that WAIC has the theoretical support whereas DIC1 or DIC2 does not.
To support WAIC, we need neither Fisher asymptotic theory nor Laplace approximation.
Beyond Fisher and Laplace, the completely new statistical theory was established.
WAIC is not an ad hoc variable but the universal concept.
(1) The expectation values of DIC1 and DIC2 are equal to those of the generalization error if
the true distribution is regular for and realizable by the statistical model.
(2) The expectation value of WAIC is equal to that of the generalization error even if
the true distribution is singular for or unrealizable by the statistical model.
(3) The expectation value of WAIC is equal to that of the generalization error even if
the true distribution is in a delicate case.
MATLAB program (NEW)
(Remark) The average parameter E_w[w] that is used in the
original DIC has no meaning in nonidentifiable statistical models, because the set of
optimal parameters is an analytic set with singularities.
(Regular Case) In this case, lambda=nu=2, because the
number of parameters is 4. The functional variance converges to a constant, 4, in probability.
The probability distribution of GE is the chi-square distribution with freedom 4.
If a regular case criterion, RIC=T+ 4/n, which has the same form as AIC, is applied, then
DIC1, DIC2, and WAIC are asymptotically equivalent to RIC.
(Singular Case) In this case, lambda=1/2, but nu is unknown.
The functional variance converges to some random variable in law.
The probability distribution of GE is still unknown, but it seems to be a similar distribution as
the chi-square distribution with freedom 1. Hence GE takes a smaller value than 0.0025 very
frequently but sometimes a very larger value than 0.0025.
Remark that the variance of WAIC is smaller than that of training error.
If the information criterion is defined by the sum of the training error and
some constant, such as GE + 0.005,
then its variance is equal to that of training error.
(Delicate Case) In this case, both lambda and nu
are unknown. The optimal parameter for the true distribution p(x|0.5,0.3,0.3,0.3)
is a regular point of a statistical model but such a delicate case can not be treated by the regular
statistical theory. The posterior distribution can not be approximated by any normal distribution.
If you are a statistician, you know that almost all practical statistical procedures
such as model selection and hypothesis test are conducted in such delicate cases.
In other words, singular statistical theory is not a special theory but a general one
that is necessary in generic practical problems. In all cases,
the expectation value of WAIC is equal to that of the generalization error,
and the variance of WAIC is smaller than DIC1 and DIC2.
From the mathematical point of view, singular statistical theory contains regular
theory as a very special part.
Theoretical base of this page:
Algebraic geometry is necessary for theoretical support of WAIC.
Sumio Watanabe, Algebraic Geometry and Statistical Learning Theory, Cambridge University Press, 2009
``Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory."
Journal of Machine Learning Research, 11, 3571-3594, 2010.
(Special Remark) In this page, we compared WAIC with DIC1 and DIC2. In Bayes statistics, the log marginal likelihood, which
is sometimes referred to as the free energy or the stochastic complexity, is a very important observable. From the model selection
point of view, Bayes marginal is not an unbiased estimator of the generalization error but it has consistency.
On the other hand, WAIC is an unbiased estimator of the generalization error but it does not have consistency.
The author recommends that both criteria are useful because both of them have important but different information about the
relation between the true distribution and the statistical model.
In singular cases, Schwarz' BIC does not give the true asymptotics of the Bayes marginal. Its theoretical value is determined by
the real log canonical threshold. Please see,
Sumio Watanabe, ``Algebraic analysis for nonidentifiable learning machines," Neural Computation, 13(4), pp.899-933, 2001.