## WAIC Experiments

Sumio Watanabe

Sumio Watanabe Homepage

WAIC and WBIC

Key words: Bayes statistics, model selection, information criterion

Let us compare WAIC with DIC1 and DIC2 by experiments.

The main difference of them is that WAIC has the theoretical support whereas DIC1 or DIC2 does not.

To support WAIC, we need neither Fisher asymptotic theory nor Laplace approximation. Beyond Fisher and Laplace, the completely new statistical theory was established. WAIC is not an ad hoc variable but the universal concept.

(1) The expectation values of DIC1 and DIC2 are equal to those of the generalization error if the true distribution is regular for and realizable by the statistical model.

(2) The expectation value of WAIC is equal to that of the generalization error even if the true distribution is singular for or unrealizable by the statistical model.

(3) The expectation value of WAIC is equal to that of the generalization error even if the true distribution is in a delicate case.

MATLAB program (NEW) (Remark) The average parameter E_w[w] that is used in the original DIC has no meaning in nonidentifiable statistical models, because the set of optimal parameters is an analytic set with singularities.  (Regular Case) In this case, lambda=nu=2, because the number of parameters is 4. The functional variance converges to a constant, 4, in probability. The probability distribution of GE is the chi-square distribution with freedom 4. If a regular case criterion, RIC=T+ 4/n, which has the same form as AIC, is applied, then DIC1, DIC2, and WAIC are asymptotically equivalent to RIC. (Singular Case) In this case, lambda=1/2, but nu is unknown. The functional variance converges to some random variable in law. The probability distribution of GE is still unknown, but it seems to be a similar distribution as the chi-square distribution with freedom 1. Hence GE takes a smaller value than 0.0025 very frequently but sometimes a very larger value than 0.0025. Remark that the variance of WAIC is smaller than that of training error. If the information criterion is defined by the sum of the training error and some constant, such as GE + 0.005, then its variance is equal to that of training error. (Delicate Case) In this case, both lambda and nu are unknown. The optimal parameter for the true distribution p(x|0.5,0.3,0.3,0.3) is a regular point of a statistical model but such a delicate case can not be treated by the regular statistical theory. The posterior distribution can not be approximated by any normal distribution. If you are a statistician, you know that almost all practical statistical procedures such as model selection and hypothesis test are conducted in such delicate cases. In other words, singular statistical theory is not a special theory but a general one that is necessary in generic practical problems. In all cases, the expectation value of WAIC is equal to that of the generalization error, and the variance of WAIC is smaller than DIC1 and DIC2. From the mathematical point of view, singular statistical theory contains regular theory as a very special part. 