Importance of regression assumptions


Fortran must die
Long ago I was brought up on the view that analysis of the residuals was critical for regression. But I am confused about the advice on it now. This is for data sets that has thousands (or tens of thousands of points) . That is how important is residual analysis?

Normality: The sense I get is few concern themselves with this any more with large data sets. That suggests not reviewing it.

Heteroskedasticity. I am unclear what the importance of this is anymore. Some suggest just using White standard errors and ignoring it.

Outliers (this includes the issue of leverage). It is not clear to be the views about this. Two points are particularly pertinent. In a large data set can one or a few points move the regression line? And if you find an outliers (assuming they matter) what do you do. The most common advice I have seen recently is, (unless it is just a mistake) to leave it in.

Independence This is seen as important. But I have never found a test for it outside time series in the form of serial correlation. I understand in some cases theory leads you to conclude that it has been violated, but I know of no formal way to test this. I have not seen a solution for this othan than time series and multilevel models (which are special cases where theory suggests independence is likely to be violated).

I have grown lax over analyzing residuals because of the many comments that downplay its importance (and admittedly because I may have the population of interest - this is subject to dispute).


Active Member
Residuals may give a warning that the underlying model is wrong or can be improved, and certainly can give information about prediction intervals.


Fortran must die
Yes I use them. I am just less sure given the recent literature how important many of the classical assumptions are with a large data set.


Less is more. Stay pure. Stay poor.
To pour some fuel on the fire - I saw Paul Allison said that when conducting LPM via OLS, if the sample is large enough you may not need to use sandwich SEs.


Fortran must die
Paul Alison has come up with a new experimental approach, with some other statisticians that supposedly deals with the problems of LPM. But the research is very new.


Fortran must die
hlsmith when you check your data for violations what exactly do you check for and how? You already explained how you check for non linearity
Fox & Weisberg have a good example of why one might want to check outliers and either correct the entry or delete it if it's a clear typo, 2-3x -ing the regression equation coefs, SD's and SE's
Last edited:


Fortran must die
Ok this may seem like a dumb question but. In their book about MLM Bryk and Raudenbush argue that you should look at the univariate distribution of all variables and bivariate relationships of the DV and IV. I have seen others suggest those tell you little.

I have a model with 42 predictors only 1 of which I care about (the others are control variables I have to leave in the model). Do you need to do this type of analysis for variables that are essentially nuisance variables you don't care about?

I am not sure there is agreement with large samples you need to do this at all (I have about 20,000 cases).
Wouldn't the bivariate analysis possibly be useful to check for extreme non-linear relations and relations with no bivariate IV effect on the DV? Non-linears could then be transformed (if hypothesis testing with linear models). IV's with no effect could be omitted for model comparisons. Car package 'scatterplotMatrix(~IV1+IV2...IVn, data=data)' could be good bivariate visualisation for a numeric DV with multiple numeric IV's.
Last edited: