Diagnostic question

noetsi

Fortran must die
#1
An author I read says that you should always do this.

Plot Y vs X (check for linearity, outliers)

He is talking about bivariate Y and X scatterplots.

I was wondering what others think about this. Since I have 20 thousand or so points sometimes (and always thousands) I usually don't worry about individual outliers. And I am not sure it makes sense to look at bivariate relationships this way when you have many regressors.

Maybe you should instead look at partial regression plots?
 

noetsi

Fortran must die
#2
These are bivariate 'test" for non-linearity (and I understand the limitations of bivariate tests in a multivariate model). They don't look like the non-linearity I have seen, but the pattern between sum of tuition and Q2wage (which is my dependent variables) does not look normal.

Sum of tuition is spending on tuition that ranges from 0 dollars, in most cases, to thousands.
 

Attachments

noetsi

Fortran must die
#3
to me the diagnostics look ok (they do for proc genmod, they don't for proc reg but I think this is due to proc reg having no class statement). But the data is very much non-normal.

So in interpreting the results (I have 20 thousand points) how serious is it to interpret the p values that the data is non-normal. Do I need to do a transformation to address the non-normality?
 

Attachments

noetsi

Fortran must die
#4
This is what I thought was true.

"It is widely but incorrectly believed that the t-test and linear regression are valid only for Normally distributed outcomes. The t-test and linear regression compare the mean of an outcome variable for different subjects. While these are valid even in very small samples if the outcome variable is Normally distributed, their major usefulness comes from the fact that in large samples they are valid for any distribution. We demonstrate this validity by simulation in extremely non-Normal data. We discuss situations in which in other methods such as the Wilcoxon rank sum test and ordinal logistic regression (proportional odds model) have been recommended, and conclude that the t-test and linear regression often provide a convenient and practical alternative. The major limitation on the t-test and linear regression for inference about associations is not a distributional one, but whether detecting and estimating a difference in the mean of the outcome answers the scientific question at hand."

https://pubmed.ncbi.nlm.nih.gov/11910059/
 

noetsi

Fortran must die
#5
Ok for non-linearity I am using partial regression plots which seem to be better than bivariate scatterplots. But I have no idea if this suggests non-linearity. Nothing in the literature says anything about this type of behavior :)
 

Attachments