Chi vs R for goodness of fit

MALDATA

New Member
As an undergraduate, I had a physics professor who strongly preferred chi^2 as a measure of a good curve-fit as opposed to correlation coefficient r. I don't remember what her argument was, but I started using chi^2 since then as a measure of the goodness of a curve-fit to experimental data.

However, I started looking into this again recently, and I'm getting mixed reviews. Some prefer R, some prefer chi, and I don't understand the difference well enough to know which is the right choice. I'll be a teaching assistant in the spring for a course which involves the analysis of a lot of experimental data, and I want to be able to make a proper argument one way or the other when I explain the two.

Can anyone shed some light on this? Thanks!

toxikhan

New Member
The only regression I know of where chi-square can be used to assess goodness of fit is a generalized linear model of binomially distributed data. This was first done for probit analysis (fitting the cumulative normal curve), but can easily apply to the logistic and other sigmoid models. For other generalized linear models, about all that can be done is to divide the deviance by degrees of freedom. For a good fit, this number should be close to 1.

For normally i. i. distributed, data, an F test is used. The residual sum of squares is partitioned into a within-group (random error) mean square (sum the differences between the group average and the data) and a lack-of-fit mean square (sum the differences between group averages and the regression), then divide the lack-of-fit MS by the random error MS with k-1 numerator and n-k denominator degrees of freedom (k distinct groups, n observations). Obviously, this requires a designed experiment where multiple observations are taken at distinct x's.

rsquare does not assess the fit of a model. It is the proportion of variance the model explains. r is simply the correlation coefficient. r is highest when the first-order association between x and y is high. A simple linear regression with a squared term may fit the data very well yet have a low r because of the deviation from a strictly first-order association. Just because a model explains a high proportion of variance does not mean it fits well, and a model may fit well but not explain a high proportion of variance.

The best initial assessment of model fit is to scatterplot the studentized, or preferably the Pearson, residuals against the predicted values. The points should evenly distributed about a horizontal zero across the range of prediction. If the distribution of the residuals expands going one way, usually to the right, you have an unexplained variance component. If the density of the residuals veers up or down, you need additional parameters in your model.

You should also consider a variable selection procedure. There are a number of these, and there are arguments for and against each one, but any is better than none, and they are all based on the concept of testing the significance of each parameter using a Type I or Type III sum of squares. This will optimize the fit of the model using the least number of parameters.