# multiple regression with *many* variables

#### Levon

##### New Member
Hello

I am not a statistician but interested in knowing how the following problem would be approached by those who are.

What would you do if you had to run a multiple regression with a very large(!) number of independent variables?

Suppose you had 200 variables. The number of possible models would be huge (2^200), would stepwise regression be even a possibility? Or a smart approach? A "best subset analysis" i.e., exhaustive search would be computationally infeasible. Are there any other methods that might be used? At what point would they become infeasible?

I've come across PCA which I think allows one to reduce the number of variables before(?) the regression - but I am interested in the case were you might be stuck with 200 variables.

I suppose best subset analysis becomes difficult beyond perhaps 15-20 variables?

Stepwise regression at ? variables?

Others at ?

Related to this problem, I also assume that standard/commercial software (minitab, SAS, SPPS, R, etc) have limitations on the number of variables that could be used for generalized least squares multiple regression. I was unable find this information. Searching the web for "best subset analysis" and the software yielded nothing, so if anyone has some knowledge about this they could share I would be grateful.

Any pointers, links, references would be welcome too. I hope the question as I asked makes sense/is clear, or at least the intention behind it.

Thank you.

#### ichbin

##### New Member
Why are you insistent the stepwise procedure? It's heuristic and there is little satistical basis for it. Assuming you have enough data points, just do a single regression on all 200 variables. Any decent statistical software should be able to handle that.

#### Mean Joe

##### TS Contributor
Agreed, one thing to consider is how many data points you have. If you only have 50 data points, then your model should be closer to just 2 variables. Some people have a rule of thumb for the number of data points per variable.

But I've never run a model with 200 variables. First thing I would ask if other researches in the field have identified significant variables previously. Focus on those, and maybe a "couple" more. A model with 200 variables is probably too large for the human mind to make sense of.

You could run 200 regressions, each with one variable. Focus on the most significant ones, while trying to understand (here, more research on your part is involved) why these variables are significant to the outcome. Then you'd be putting together in your mind, the underlying things that your statistical analysis is/may be revealing.

Although I understand wanting to take everything into account, coming out of the blue with a 200 variable model will be tough to get others to accept. Are you making the model for your own information/satisfaction only?

#### Levon

##### New Member
Why are you insistent the stepwise procedure? It's heuristic and there is little satistical basis for it. Assuming you have enough data points, just do a single regression on all 200 variables. Any decent statistical software should be able to handle that.
Hi there,

If it appears that I am insistent on stepwise regression, it is only due to my lack of knowledge about statistics :-|

Confronted with a bunch of variables that I am trying to use to explain the behaviour of another variable, I only really know about stepwise regression (and as I mention, I have heard about PCA).

Since it is a heuristic, is there be a better, more systematic way to approach this? I thought exhaustive search would find the best subset, but of course that's not feasible with this many variables.

How do I eliminate the statistically irrelevant factors/interactions?

Thanks

#### Levon

##### New Member
Hello,

Thanks for taking the time to post.

But I've never run a model with 200 variables. First thing I would ask if other researches in the field have identified significant variables previously. Focus on those, and maybe a "couple" more. A model with 200 variables is probably too large for the human mind to make sense of.
Agreed, so the goal is to come up with the smallest subset consiting of statisitically significant variables (a parsimonous set?) I was under the impression that a form of stepwise regression, or best subset analysis, i.e., exhaustive search was the way to go about this if the set of variables was small. What if you are stuck with many?

Although I understand wanting to take everything into account, coming out of the blue with a 200 variable model will be tough to get others to accept. Are you making the model for your own information/satisfaction only?
This is from an experiment some friends of mine ran with 220 participants that yielded data for 21 variables. Limiting this data to simple interactions and squared terms, resulted in a set of roughly 200 variables for analysis. As there is no preconceived notion regarding the relationship between the observed behavior and the very large set of independent variables the goal is to find a small subset of statistically significant coefficients, and to eliminate redundant variables.

So once all experimental variables, their squares and various interactions between them are taken into account the maximum model has about 200 coefficients. This is the motivation behind my question.

If the stepwise regression approach is a heuristic, is it a deterministic one? I.e., if I were to run the the procedure more than once, would I always get the same result, or is this dependent on the particular software/implementation used?

Also, would it be wrong to view this as an optimization process with a search space in 200 dimensions? And if the stepwise approach is considered a heuristic, would it be possible for the search to give a local rather than global optimum?

Sorry, my knowledge of stats is really rather poor, but I am seeking a bit more understanding.

Thank you for your thoughts on this.

#### ichbin

##### New Member
The systematic way to approach the problem of determining which variables are relevent is to do a single, multi-variate regression on all variables, and then look at the uncertainties / error bars / confidence intervals on the coefficient of each variable. If the confidence interval contains zero, the variable is not relevent at your required confidence.

For example, suppose I have three independent variables x1, x2, x3 and I am trying to use them to predict y. I use a model function y = a + b1 x1 + b2 x2 + b3 x3 and get best-fit values and uncertainties on a, b1, b2, and b3. If b1 = 1.0 +/ 2.0, b2 = 2.0 +/- 1.0, and b3 = 3.0 +/- 2.0, then b2 and b3 are relevent and b1 is not, at least to the sensitivity allowed by my data. At this point, you are welcome to re-run the regression without the x1 variable, but your values of b2 and b3 shouldn't change significantly. This approach extends straightforwardly if you have 200 variables instead of three.

#### terzi

##### TS Contributor
Hi Levon,

First I should never run a model with 200 predictors. Interpreting it would be awful, Adjusted R-Square may get terrible scores and I imagine most assumptions will not be met.

One problem once you learn statistics is the fact that you always want to jump in the main analysis. That's usually not the best way. Before even thinking in regression models, try with some exploratory studies. In particular, scatterplots and correlations will shed a light on which variables are better for explaining your response. You can easily drop variables you find useless: this way, you can end with about 30 or 40 only. Then you can try ichbin's suggestion: testing individual variables in the model and tuning it yourself. I assume it will be easier to do it 40 times than 200 times. Only be careful with the multicollinearity assumption, since it is easily problematic when you have many predictors.

And just for the record, I also think stepwise regression is a bad joke

#### Levon

##### New Member
Hi!

Hi Levon,

First I should never run a model with 200 predictors. Interpreting it would be awful, Adjusted R-Square may get terrible scores and I imagine most assumptions will not be met.
Agreed, even with my limited stats knowledge (and it *is* very limited) I realize this is a bad idea. However, I had no control over the setup (see my message two above re the background of the experimental data and variables). I am trying to explore options once this sort of situation is presented.

One problem once you learn statistics is the fact that you always want to jump in the main analysis. That's usually not the best way. Before even thinking in regression models, try with some exploratory studies. .....

And just for the record, I also think stepwise regression is a bad joke
That is good advice re exploratory studies - thanks.

I am curious to know why you (and it seems others) seem to think that stepwise regression is such a bad idea. I recall from my earlier studies that this is always presented as the tool to use if you are dealing with multiple regression, and especially if you have no clear idea about the relationship between variables.

Also, I would be very interested in your opinion (or other's) regarding what I wrote two messages above:

If the stepwise regression approach is a heuristic, is it a deterministic one? I.e., if I were to run the the procedure more than once, would I always get the same result, or is this dependent on the particular software/implementation used? [i.e., on how the particular stepwise code is implemented?]

Also, would it be wrong to view this as an optimization process with a search space in 200 dimensions? And if the stepwise approach is considered a heuristic, would it be possible for the search to give a local rather than global optimum?
[i.e., the search would get trapped in a local optimum, rather than global optimum.]

Am I close, or is my mental model of this whole process off?

Thanks a lot for any insights and corrections to any misconceptions I may have.

#### terzi

##### TS Contributor
Hi again,

There are many ways of using stepwise regression. Some package use different algorithms and the order in which variables are selected may change final results in certain procedures. The main concerns regarding stepwise regression is the fact that it will test too many hypothesis, so the error will be inflated:

Code:
http://www.informaworld.com/smpp/content~db=all~content=a780142433~frm=abslink
Also, you may be including variables that will finally violate assumptions in the general linear model or eliminating variables that have a non-linear relationship or even ignoring interactions. Personally, I think most people use it as an easy exit to avoid the complex procedure of adjusting a model.

Code:
http://en.wikipedia.org/wiki/Stepwise_regression#Criticism

#### Levon

##### New Member
Thanks terzi,

the links help, though I am still not quite sure I have the right model in my mind re search and optima - I guess I will a bit more research/reading and see what I find.