- R-squared values are biased too high.

- p-values are too low due to multiple comparisons.

- Parameter estimates are biased high.

Could someone please explain briefly how Stepwise regression causes the above claims?

- Thread starter Phishypenguin
- Start date
- Tags pvalues regression stepwise

- R-squared values are biased too high.

- p-values are too low due to multiple comparisons.

- Parameter estimates are biased high.

Could someone please explain briefly how Stepwise regression causes the above claims?

In general, if you are not using knowledge of the context to guide your rationale, this could lead to the above.

The issue behind your first and last critiques seem similar.

The last variable included/first excluded can be particularly wrong. Stepwise gets blasted by statisticians - I once read a chapter entitled "Death to Stepwise: Think for Yourself"

So conduct regression to get VIF, then run stepwise afterwards and drop or address potential collinearity concerns. Seems roundabout. Plus what about other assumptions for model fit or appropriateness. Stepwise currently is not telling you, you have leverage, etc., it is automated with basic criteria to fulfill.

Imagine the following scenario. You have a variable, X1, that (in the population), actually has a moderate effect size. You also have some other variables, X2-X5, which you would consider as part of your model. (But we'll focus on X1).

Now imagine we conducted repeated studies, each time randomly drawing a sample from the population and estimating a regression model. The (estimated) sample coefficient for X1 would vary: Sometimes it will be smaller than the true parameter value, and sometimes larger. Make sense?

Importantly, the cases when the sample coefficient for X1 is smaller will also tend to be the cases when the coefficient is not statistically significant.

If, each time we collected a sample, we used stepwise regression to exclude non-significant predictors, we would tend to

Across stepwise models estimated on repeated samples, the

(NB: If using SPSS, this problem occurs regardless of whether you actually click "stepwise" "forward" or "backward" selection in SPSS - all are broadly stepwise methods. Further, it also applies if you

Side note, stepwise typically has inclusion exclusion criteria to help catch those small effect variables.

CB I like your post but i am unsure if a person wouldn't exclude that variable even if they used a nonautomated approach. I agree say that in the arena of publication bias or general publishing we are likely to those samples with greater effects

I once built a predictive model for a large Swedish company who wanted to predict the probability that a customer would not place an order within one year. I had a couple of hundreds of variables to work with and I could identify a dozen which would surely have an impact on the DV.

Within many fields, you don't care if you've included the 'correct' IVs; the only thing you care about is whether the model gives accurate predictions. And if you can assure that your model can do that - why not stepwise?

So, what I did was that I used stepwise regression and found the 'best' model (or in other words: a good model) based on the least out of sample validation error. I did this for three different time periods and then I averaged the predictions for these three models. By building models based on three different time periods - the variables which does not affect the DV is expected to be cancelled out by averaging the predicted probabilities.

TL;DR - Stepwise regression can be useful if you don't care if you accidentally include variables which does not affect the DV. Or in other words: if you only care about the model's predictive capability.

Within many fields, you don't care if you've included the 'correct' IVs; the only thing you care about is whether the model gives accurate predictions. And if you can assure that your model can do that - why not stepwise?

- Stepwise regression is not set up to maximise prediction accuracy - it's based purely on the significance of predictors
- Stepwise regression will give you an overly optimistic estimate of prediction accuracy

Using cross-validation is great, and that helps to deal with point 2. But it doesn't really deal with point 1. If you want to maximise out-of-sample prediction accuracy, stepwise regression isn't really the best tool - that's not what stepwise regression attempts to achieve. Some other options off the top of my head would be AIC, BIC, cross-validation (for selection, not just validation), or lasso.