Does my predictor in my multiple regression have too many variables?

#1
So I am trying to work out what is the best predictor of a) awareness over environmental issues, b) concern over environmental issues and c) pro-environmental behaviour from a set of sociodemographics (eg. age bracket, political standing, location etc) measured using a survey.

I am using a backward stepwise linear regression to do this. I get clear results when I input all the sociodemographics apart from location. There are 7 predictors for model a, 4 predictors for model b and 1 predictor for model c. However, when I include location as a possible predictor, it excludes almost none of the other 12 possible predictors in the output for each of the 3 models.

I am wondering whether this is something to do with the high number of variables within the location category - there were 164 respondents from 82 different locations. However, with every other possible predictor eg. political standing, there were roughly 6 variable groups reported by participants in the survey.

If anyone had any indication of what might be going on and any advice on whether to leave location out of the analysis it would be much appreciated! (Also any advice on how I would justify excluding location in my methods would be amazing!)
 

hlsmith

Less is more. Stay pure. Stay poor.
#3
Usually if you have to ask, then likely yes.

Ignoring that you used stepwise, you never tell us the sample size and the distribution of the outcome.
 

Karabiner

TS Contributor
#5
I am wondering whether this is something to do with the high number of variables within the location category - there were 164 respondents from 82 different locations.
Do I understand this correctly, you predict 164 data points using a variable with 82 levels?
That would be a huge disproportion. But maybe I did miss something here.

With kind regards

Karabiner
 
Last edited:
#6
Hi there, I have since edited
[]quote]I am wondering whether this is something to do with the high number of variables within the location category - there were 164 respondents from 82 different locations.
Do I understand this correctly, you predict 164 data points using a variable with 82 levels?
That would be a huge disproportion. But maybe I did miss something here.

With kind regards

Karabiner[/QUOTE]

Hi Karabiner,
I have since changed this and grouped them by borough so there are 37 levels instead of 82 levels. However, this still hasn't made much difference to the regression so the problem remains
 

Karabiner

TS Contributor
#7
Unfortunately, it simply does not make sense to predict 164 data points with a 37-levels-
variable.

I do not know how you entered that variable, 36 dummy variables maybe? There are
some rules of thumb which would recommend that you provide between 8 and 20 cases
for each predictor. For 36 dummies, you'd need a sample size like 300 or 500.
 

hlsmith

Less is more. Stay pure. Stay poor.
#8
Can you post your model output so we can better understand what you model looks like. I still don't get how the dependent variable is formatted and its distribution!
 

noetsi

No cake for spunky
#9
In theory you should build models on theory. Most of us are not so lucky to have that.

LASSO is better than stepwise (which is wrong, the use of stepwise that is if all to common). If you want to reduce the number of predictors LASSO or adapted LASSO is preferable - again assuming you have no theory to do this.

There are rules of thumb on how much data you need for a specific number of predictors. Various authors use different rules which depend in part on the method. If you have more predictors than data your results will be invalid if your model runs (which it may or may not depending on the software).

One thing to consider. You want your model to tell a story. What story will it tell if you have a hundred variables (well it will tell that you don't have a theory of what drives the dependent variable). :p

There is a problem between having two many predictors to be useful (lack of parsimony) and omitted variable virus from leaving an important variable out of the model that is correlated with something in the model. Personally I think the latter is nearly certain with any real world data, but you have to figure out how to address it.