How serious are violations of regression assumptions

noetsi

Fortran must die
#1
We are considering getting a tool which runs linear regression without checking the assumptions of the method (the tool is automated, it generates linear regression based on data you put in including generating the report). The problem is that the tool does not check any regression assumptions (which is not surprising given that its not a Dason level AI). :p

I raised concerns about that, but I am not sure myself. I know the theory behind violations of the assumptions such as the Gauss Markov assumptions, but how serious can they be in practice (especially if you have hundreds of cases). And do any of the violations other than independence (which can't easily be tested easily for anyhow) actually bias the results? The only violations I know that do that would be some forms of time series and omitted variable bias.
 

hlsmith

Omega Contributor
#2
You input data, so you may catch the form, but a big question would be outliers and heteroskedasticity. Doe it use cross-validation or a holdout sample during training? And what kind of results are we talking it outputs? Perhaps it will kick out more if desired.
 

Dason

Ambassador to the humans
#6
I always just output the mean. That's what regression is right? I mean I'm always hearing about regression to the mean.
 
#7
Great example of how "big data" and "analytics" are watered down statistics...I think you can see how violated assumptions screw with estimates and conclusions when you've worked on something that changed dramatically when the assumption violations were remedied or a more appropriate method was implemented.
 

noetsi

Fortran must die
#8
You input data, so you may catch the form, but a big question would be outliers and heteroskedasticity. Doe it use cross-validation or a holdout sample during training? And what kind of results are we talking it outputs? Perhaps it will kick out more if desired.
I think the answer, we have not got a response to this yet, is that it makes no test of any type. It gets data, it runs linear regression, and it spits out results. There are no test of anything.

I am not sure, in practice not theory, of how serious a threat this is to the conclusions. Many of the assumptions only effect the p values and confidence levels not the effect size. And there is disagreement if large sample size influences this. If you have a large sample size does it matter if there are serious violations of the assumptions.

It is a good point that the results might be influenced by heavy tailed results. I am not sure how bad that would be with hundreds of data points.
 

hlsmith

Omega Contributor
#9
The issue beyond the assumptions of using parametric models is that the underlying data generating process is never really known in these settings (real life data) and all models are wrong unless you generated these data yourself. Obviously what if the process is best defined by a general additive model or kernel based approach. Yeah you may get estimates in the ballpark, but they probably are lousey.
 

noetsi

Fortran must die
#11
No I don't have access to the program. You post information to it and it spits back results.

I confirmed it does not test the regression assumptions.
 

j58

Active Member
#13
What is the point of this program? Any undergraduate whose had a single semester of linear algebra should be able to write a linear regression program with about 10 minutes of effort.
 
#14
We are considering getting a tool which runs linear regression without checking the assumptions of the method (the tool is automated, it generates linear regression based on data you put in including generating the report). The problem is that the tool does not check any regression assumptions
As we have said before, if you want to forget the assumptions then you can also forget to infer something from the data.

You input data in the computer and the computer will show you numbers. That is all you can conclude. So, instead of forgetting about the assumptions, forget about that program.
 
#15
One of the big problems that I've seen is the misunderstanding that these things are "just calculations" and boil down to a black and white matter (not saying this is you, noetsi, but the people pushing the program). The violation may affect one conclusion in a material way and another in an immaterial way. The only way to know is by check the assumptions, applying reasonable remedies, and comparing results to see how serious a violation is for the question at hand.
 

hlsmith

Omega Contributor
#16
I would also wonder how the program would address collinearity, confounding, interactions, mediation, mediated-moderation, outliers, etc.