# Bayesian model checking: a problem with Gelman's chi-squared test

#### lostsoul

##### New Member
So, I foolishly tried to implement a test Gelman and his colleagues propose to determine whether observed data actually fits a model (Bayesian Data Analysis, Edition 1, p.172; or better: Gelman, Meng, Stern, 1996). If the test worked as advertised, it would be really useful--an omnibus way to determine whether any model fits the data. But, it doesn't seem to work--it doesn't catch really poor data / model fits I generate with Monte Carlo. What I'm wondering is whether anyone else has tried to implement this test and had any luck with it, and whether I should expect it to catch the kind of bad models I've tested it against.

If I understood the test correctly, the test statistic involves a comparison between the chi-square fit of y's replicated (y.rep) from the fitted model (across various MCMC draws from the parameters, theta) and the chi-square fit of observed y's (y.obs) estimated using the same theta draws. The fit is the sum, across the y's, of the squared difference between each y and the expected value of y given theta (XB in OLS model), divided by the variance of y given theta. Each draw of theta generated two chi-square estimates, one for y.rep and one for y.obs. After a number of thousand of draws of theta, if the chi-squareds for y.obs prove systematically larger (or, I suppose, smaller) than the chi-squareds for y.rep, then the model probably doesn't fit the data.

The problem is that I've tried one Monte Carlo model after another in which I deliberately try to make the data not fit the proposed model, yet chi-sq(y.rep) and chi-sq(y.obs) are not systematically either larger or smaller than the other (chi-sq(y.rep) proves larger in 48% of the cases give or take a percentage point). I've tried an OLS model with normal errors against data with a t, df=1 error. I've tried another OLS model taking a sixth of the y's and setting them to zero (graphically: cloud of points well above zero and a row of points at zero).

I've been over the R script I wrote with a fine-tooth comb and see no errors. I can even see, as I'm stepping through, that, say, the zeroed y values are showing elevated error rates. But, the OLS inflates the std error to accommodate the outlying zero values, with the apparent result that the non-zero points have compensatingly less error. Once the chi-square sums up, there's no difference between y.rep and y.obs. The kinds of models Gelman, Meng, and Stern use to illustrate the test are unlike anything I have seen in my social science, so maybe the test only works in odd non-OLS situations?

#### Dason

Would you mind posting what you actually have coded up?

#### lostsoul

##### New Member
Hi Dason: Whoops, should have looked back sooner. Here's my script (see below). codaResults is standard output from JAGS (mcmc object list). df.model is a data frame with the model y and x's. yPositions and xPositions just indicate where in df.model the y and x's are. rightBOrder indicates the correct order of coefficients in the codaResults that correspond to the x's entered. My code isn't entirely elegant, but hopefully it is correct.

Code:
df.test.chi=fn.getGelmanChiSq(df.model1, v.yPositions=c(1), v.xPositions=c(2:4), codaResults, rightBOrder=c(1:3), sigmaLocation=4)

fn.getGelmanChiSq<-function(df.model, v.yPositions, v.xPositions, codaResults, rightBOrder, sigmaLocation)
{
#The formula is sum[ (y.rep.i-E(y.i | theta))^2 / Var(y.i | theta) ] .
#Substitute y.obs for y.rep for the 2nd Chi^2 formula.  What the
#formula means is in OLS case is (I think):  E(y.i | theta) = XB and
#Var(y.i | theta) = sigma^2 (for linear model).
#thetaMat is draws X # of thetas, say 6000 X 4 including sigma.
#bMat removes the sigma and transposes so it's 3 X 6000.
#Each column of bMat is another random draw of parameters.
#xMat is #obs X #coefficients; say 200 X 3.
#expYMat=xMat%*%bMat.  expYMat is 200 X 6000.  Each column is
#E(y.i | theta) for a given drawing of theta.
#You end up comparing y.rep.i-E(y.i | theta) w/ y.obs.i-E(y.i | theta)

#Get xMat and y:
y=as.vector(as.matrix(df.model[v.yPositions]))
xMat=as.matrix(df.model[v.xPositions])

#collapse codaResults
thetaMat=fn.combineCodaResults(codaResults)

#get the linear coefficients B
#(also must be transposed so each column is a sample of the B params):
bMat=t(thetaMat[,rightBOrder])

#get sigma:
sigma.v=thetaMat[,sigmaLocation]

#get E(y.i | theta):
#Each column corresponds to y1...yn for a given theta draw
expYMat=xMat%*%bMat

#create a matrix of y.obs:
Nobs=dim(xMat)
NdrawsTheta=dim(bMat)
yobsMat=matrix(data = y, nrow = Nobs, ncol = NdrawsTheta, byrow = F)

#create a matrix of sigmas (same sigma for each column):
sigmaMat=matrix(data = sigma.v, nrow = Nobs, ncol = NdrawsTheta, byrow = T)

#generate a matrix of random errors drawn from sigmaMat:
errMat=matrix(data=rnorm(Nobs*NdrawsTheta, sd=sigmaMat), nrow=Nobs, ncol=NdrawsTheta, byrow=F)

#Get sum(y.rep.i-E(y.i | theta))^2 / Var(y.i | theta)
#Note that y.rep.i-E(y.i | theta) is just error! because
#y.rep.i = E(y.i | theta) + error
chirep= (errMat^2) / (sigmaMat^2)
chirep= rep.int(1,Nobs)%*%chirep

#Calculate chi for observed values:
chiobs= (yobsMat-expYMat)^2 / (sigmaMat^2)
chiobs= rep.int(1,Nobs)%*%chiobs

#combine into data matrix:
df.chi=as.data.frame( t(rbind(chiobs, chirep)) )
names(df.chi)=c("chiobs", "chirep")
df.chi
}

fn.combineCodaResults<-function(codaResults)
{
thetaMat=as.matrix(codaResults[])
for(i in 2:length(codaResults))
{
thetaMat=rbind(thetaMat, as.matrix(codaResults[[i]]))
}
thetaMat
}