Selecting Variables for Multiple Regression (Univariate Significance Levels)

Hi All! I really hope someone can answer my question.

I am building multiple linear regressions and I am testing salient variables one by one at the univariate level to determine whether I should include them.

However, what is the current acceptable limits of p values to include/exclude univariate level variables. I have gotten some very conflicting advice, which is why I turned to this site. At first I was using p < 0.05 but then I was told to use .20 and later read p values ranging between .25-.50. So I am very confused. If I continue to use variables at the 0.05 level is there any literature that can justify this?




No cake for spunky
Stepwise has signficant problems despite its use (likely because those that use it are unaware of the issues).

My advice is not to use bivariate relationships period to decide whether to enter a variable. Because bivariate relationships commonly have little to do with the marginal relationships, unique variance explained by a given predictor, in a multiple regression model. The best strategy is theory or what makes sense substantively. Another alternative is to run various models and chose the one with the lowest BIC (or alternately run them all and remove the ones that are not signficant although some disagree with that strategy as well).

Another problem with using univariate criteria is that ignores interaction effects.


Less is more. Stay pure. Stay poor.
It is all situation based on what cut-off you want to use. Many times the inclusion level for entry (candidate) variables will be higher (e.g., 0.20), but the significance in the final model will drop back down (e.g., 0.05), this gives the opportunity to let the covariates comingle for an instance.

But overall this is typically situation and discipline based.


No cake for spunky
Except, IMHO, a lot of disciplines know so little about statistics that their behavior is doubtful. Having come from one of those disciplines myself (public management).:p It took me a long time to realize just because it came from a journal did not mean the writer actually understood the method. I know now I did tons of stuff that was just flat out bad practice.
Would it help if I said I was testing for main effects of each variable in the regression model and not the interaction of the variables? In my final multiple regression models I use P < .05 for significance, but am just making sure that you know I am asking about the step before, when I am selecting the variables, I have a huge variable set so I need to narrow down. So please make your case for .05 or .2 at the univariate level? References would be appreciated!!! THANKS!!


No cake for spunky
You really should not be testing for the main effects if there is a signficant interaction effect, because the meaning of main effects then is doubtful. Automatically throwing out interaction is not a good idea if admitedly commonly done.

But to repeat what I said before I don't think it is valid to use the bivariate numbers to tell you what variables to include in the model. Because the effects of a multiple regression model are commonly very different. Strong variables in a bivariate comparison may be very weak in regression. The best way to select is theory or what the existing literature says on this topic. Or, at worse case, what makes sense to you to include.
Noetsi - I wish I could take your advice because it would be much easier, however I have to make a choice and justify it for a research project. And there are only two choices which are use .2 or .05 at the bivariate level, because the majority of the variables (20 variables) could all be included based on literature...however I have read that only 5-6 variables should be used in the model....and the reason I am testing main effects is because I am looking at the relationship of 1 variable in the model but had to control for others for the sake of doing it the way my teacher wanted it yeah... there's that!