Including polynomials if adj R2 does not change

MSem

New Member
#1
I try to run multiple regression model. By plotting the model residuals against each continuous variable in my model, I see that there is non-linearity in some cases. I use poly function in R to raise these variables to 2nd power. I do it one by one and run anova to compare the models. When I add one of the polinomians in my model, the adjusted R2 does not change, although anova shows that adding polinomial improves model fit. Should I include polinomial in this case? Especially taking into account reduced degrees of freedom and more complicated interpretation. Also, I have 6 independent variables in my model. Two of them seems to be quadratic. Is it ok? I am absolutely new to quantitative analysis :( Thanks!
 
#2
You should determine if extra predictors should be added to a linear model using ANOVA F-test and t-tests, not the adjusted R-square. The adjusted R-square is useful but less accurate, overall... If there is heteroskedasticity in the data, use p-values from the F-test and t-tests conservatively, i. e. p-value = 0.047 is not a statistically significant result but p-value = 0.003 is significant.
 
#3
Forgot to mention that my previous remark about "conservative processing of p-values" applies only to situations with small to medium sample sizes. If the sample size is big, by the Central Limit Theorem the assumptions of the F-tests and t-tests are valid and you can use the p-values as they are.
 

hlsmith

Omega Contributor
#5
Follow-up, does your knowledge of the subject model support a non-linear relationship? Could transforming the dependent variable help as well?
 

MSem

New Member
#6
Follow-up, does your knowledge of the subject model support a non-linear relationship? Could transforming the dependent variable help as well?
Thank you for your response. The problem is that the theoretical justification is very contradictory. But it is more likely to be linear. So, frankly, I am not sure. On the one hand, I would like to have it linear, as the model would not be so complicated, and theoretical arguments in favor of linearity are stronger. On the other hand, anova output shows that it is not the case. Can I just say: although model residuals plot shows some non-linearity and anova shows that our model with polynomial is better, we will not include it as there is no theoretical basis for it. I do not know if it is usually acceptible to do (just to clarify, I am a student writing a coursework on quantitative analysis, not advanced)
 

hlsmith

Omega Contributor
#7
Given we have not actually seen your results, I am basing the following just on what you have told us.

You can likely use the linear term model, but I would also present the residual plots for both models as supplement items. This way the reader/audience can judge for themselves. Also, presenting the plot would allow readers to understand if using the linear model, where it may over- or under-predict the dependent variable.

P.S., Feel free to upload those images, so we can provide feedback. P.S.S., What is the sample size we are talking about here? I could see sampling variability coming into play for small samples or if the sample is very large, you would imagine the truth is better presenting itself.
 
#8
Given we have not actually seen your results, I am basing the following just on what you have told us.

You can likely use the linear term model, but I would also present the residual plots for both models as supplement items. This way the reader/audience can judge for themselves. Also, presenting the plot would allow readers to understand if using the linear model, where it may over- or under-predict the dependent variable.

P.S., Feel free to upload those images, so we can provide feedback. P.S.S., What is the sample size we are talking about here? I could see sampling variability coming into play for small samples or if the sample is very large, you would imagine the truth is better presenting itself.
This is the question I was looking to ask the OP. With smaller sample sizes, it can appear that there are "patterns" in data, for example in a residual plot, when really the data are sparse so you have holes in the plot that would make a "pattern" disappear if you had a larger sample size.
 

MSem

New Member
#9
Given we have not actually seen your results, I am basing the following just on what you have told us.

You can likely use the linear term model, but I would also present the residual plots for both models as supplement items. This way the reader/audience can judge for themselves. Also, presenting the plot would allow readers to understand if using the linear model, where it may over- or under-predict the dependent variable.

P.S., Feel free to upload those images, so we can provide feedback. P.S.S., What is the sample size we are talking about here? I could see sampling variability coming into play for small samples or if the sample is very large, you would imagine the truth is better presenting itself.


My sample size is large (more than 1000 observations). You can see the plots of my multiple regression model residuals agains independent continuous variables. I clearly see the pattern in the first one but the rest seem less obvious...
Rplot.png Rplot01.png Rplot02.png
 
#12
Do you mean logging dependent variable? No, I haven't.
Do you have any categorical variables that may be important to add to the model that generated these residuals? If so, try generating these plots (separately) for each categorical variable using the variable as a plotting symbol. In other words, if you have an independent variable with 2 levels, A and B, create these plots but use a different symbol for A cases and B cases. Maybe you have group A with the upward slope and group B with the downward sloping part of the residual plots.

Also, can you better describe your variables and how they are measured?

If plotting with diff symbols for categorical variables doesn't show much: theory says linear, then you may want to "ignore" the first two plots. You can try adding a curvature term and refitting the model, saving the new residuals and plotting them to see how much the plot changes. Take a look at your main interests in the model and see if the conclusions change much. If not, it's probably okay to pick the linear model. The assumption in that case can be considered "reasonably" satisfied. If there is a big change, then maybe you can consider something else.

The third plot may also be explained by two different groups, but it would still leave you with an unequal error variance. Again, try to fix the problem, and reexamine the model diagnostics and conclusions (from whatever tests you did) after the fix. If the changes are immaterial, there may be a violation of the strict assumption, but it may be satisfied in a looser sense.

Can also run this as a rank regression and see how your qualitative conclusions change. Again, if the conclusions are largely the same, it may suggest the violations are not so bad in practical terms.
 
#13
Do you have any categorical variables that may be important to add to the model that generated these residuals? If so, try generating these plots (separately) for each categorical variable using the variable as a plotting symbol. In other words, if you have an independent variable with 2 levels, A and B, create these plots but use a different symbol for A cases and B cases. Maybe you have group A with the upward slope and group B with the downward sloping part of the residual plots.

Also, can you better describe your variables and how they are measured?

If plotting with diff symbols for categorical variables doesn't show much: theory says linear, then you may want to "ignore" the first two plots. You can try adding a curvature term and refitting the model, saving the new residuals and plotting them to see how much the plot changes. Take a look at your main interests in the model and see if the conclusions change much. If not, it's probably okay to pick the linear model. The assumption in that case can be considered "reasonably" satisfied. If there is a big change, then maybe you can consider something else.

The third plot may also be explained by two different groups, but it would still leave you with an unequal error variance. Again, try to fix the problem, and reexamine the model diagnostics and conclusions (from whatever tests you did) after the fix. If the changes are immaterial, there may be a violation of the strict assumption, but it may be satisfied in a looser sense.

Can also run this as a rank regression and see how your qualitative conclusions change. Again, if the conclusions are largely the same, it may suggest the violations are not so bad in practical terms.
Thank you very much! I'll try to include interaction and see how it looks graphically. Indeed, it could be the reason.
 
#14
Thank you very much! I'll try to include interaction and see how it looks graphically. Indeed, it could be the reason.
It could be the case! I also had an example in a course I took that kind of covered this point that there may be a slight non-linearity in the data set (artifact, chance), or it happens that the real function is slightly curvilinear but not to a meaningful amount-- therefore a linear approximation is still appropriate.