Why do experts continue to use stepwise regression?


Less is more. Stay pure. Stay poor.
They are all garbage, even LASSO. The first question should be the purpose of modeling: causal inference or prediction. While two-stage causal inference can include a prediction and outcome model, so it can have issues too. Regardless, content expertise should trump an automated process every time. There currently are countless papers in the past 3 years on which variables should be included in a model and it comes down to content knowledge, because sometimes it may be appropriate to put a mediator in it or not based on the estimand of interest and other times you don't want common effects of IVs and DVs, etc. or not.

The model should be based on content knowledge and the estimand of interest. A stupid automated model isn't gonna know if you flip the IV and DV, there are some residual checks you can make to get an idea, but if I model rooster crowing predicts sun rising, a model is too stupid to know know this isn't causal, and if it is too stupid to know this, it isn't gonna have a chance when it comes to instrumental variables, mediators, confounders, colliders, or extraneous variables.

In prediction there is work on finding the Markovian blanket around the DV, but this fails at times too. This is why AI is known it have issues, including data leakage.

Most people using stepwise are relics or fields that haven't caught up. Best subset can have similar issues if a person just dumps all of the variable in without using content knowledge.
The problem I have with expert knowledge is that people assume stuff is true without any real empirical evidence. I always thought that was the point of empirical methods to test assumptions. I agree with the problem of causality.


Less is more. Stay pure. Stay poor.
Well your empirical knowledge here is that the outcome comes after the exposures - first step. Second step, what is the relationship between all of the interventions - they all can't be independent can they. So you know which were administered and when, correct?
Last edited:
Not in my organization. It is not like medicine. There is often no obvious moment in time when and intervention occurs and many interventions occur at the same time, many of which you are not likely to know of. :p

That is the problem with correlational research in the social sciences. Things are grey. And many things occur at the same time.


Less is more. Stay pure. Stay poor.
Yeah, that is why insanely big data is need to tease out individual effects, otherwise interactions between interventions may be missed of confounded.