# Linear Regression:Controlling for categorical independent variables, WITHOUT dummies?

#### khaidoba

##### New Member
Short version:
Does including a categorical nominal-level independent variable WITHOUT dummies still work as a simple control for other independent variables?
e.g. for P = b0 + b1*Grades + b2*WordCount + b3*Sex + b4*Major
With 6 possible categories for Major, I know b4 is meaningless since Major is categorical. But do the coefficients for b1, b2, b3 still remain meaningful? i.e. b1 is the change in P for each unit change in Grades, holding all else constant, controlling for WordCount, Sex, and Major; etc.?

Longer version:
I am running a multiple linear regression on plagiarism (as a percentage P) as dependent variable, and independent variables Grades, WordCount, Sex, Major:

P = b0 + b1*Grades + b2*WordCount + b3*Sex + b4*Major

As such, Sex is a simple 0/1 dummy variable.
My issue is with Major, which is a categorical nominal-level variable, currently labeled 1 through 6.

Alas with my messy data, going the dummy variable route (5 0/1 dummy variables for 5 majors with an excluded baseline major, i.e. equivalent to reg P Grades WordCount Sex i.Major in Stata) makes the WordCount coefficient statistically insignificant. For some reason, just including the Major variable as-is (i.e. reg P Grades WordCount Sex i.Major in Stata) makes the WordCount coefficient statistically significant. Is this significant coefficient meaningful?

#### rogojel

##### TS Contributor
Re: Linear Regression:Controlling for categorical independent variables, WITHOUT dumm

hi,
maybe you could try dichotomizing the variable -like major < 3 and Major > 3?

regards
rogojel

#### noetsi

##### No cake for spunky
Re: Linear Regression:Controlling for categorical independent variables, WITHOUT dumm

With 6 possible categories for Major, I know b4 is meaningless since Major is categorical.
This is true, which raises the question why you don't turn it into dummies.

But do the coefficients for b1, b2, b3 still remain meaningful?
I have been told by a professor it may make the other slopes meaningless and he was an individual I have a lot of respect for (Harvard trained statistician who worked at ETS before entering academics). But I have not read this anywhere else. The common statement in text is that the distribution of the IV does not matter so it should serve as a control under those interpretations.

#### CB

##### Super Moderator
Re: Linear Regression:Controlling for categorical independent variables, WITHOUT dumm

I have been told by a professor it may make the other slopes meaningless and he was an individual I have a lot of respect for (Harvard trained statistician who worked at ETS before entering academics).
Hmm. noetsi, you know more than enough about stats to be able to justify your claims with something better than an argument from authority

Alas with my messy data, going the dummy variable route (5 0/1 dummy variables for 5 majors with an excluded baseline major, i.e. equivalent to reg P Grades WordCount Sex i.Major in Stata) makes the WordCount coefficient statistically insignificant. For some reason, just including the Major variable as-is (i.e. reg P Grades WordCount Sex i.Major in Stata) makes the WordCount coefficient statistically significant.
It seems like here you are doing something that you know to be rather silly, because it gives you the result that you would like to see. Sorry if that sounds harsh, but look into your heart: am I wrong? And how do you think an examiner or peer reviewer might react to this control strategy?

To explicate the thing a bit: In the first version of your analysis, you aren't really "controlling" for the effects of Major in an effective way, because major is a nominal variable. There's no reason to expect that it would happen to have a linear relationship with plagiarism percentage that by miraculous coincidence lines up perfectly with the coding scheme you've happened to choose.

#### khaidoba

##### New Member
Re: Linear Regression:Controlling for categorical independent variables, WITHOUT dumm

Thanks for the responses!

maybe you could try dichotomizing the variable -like major < 3 and Major > 3?
This doesn't properly differentiate the differences in Majors 1-3, and 4-6 though.

CowboyBear said:
It seems like here you are doing something that you know to be rather silly, because it gives you the result that you would like to see. Sorry if that sounds harsh, but look into your heart: am I wrong? And how do you think an examiner or peer reviewer might react to this control strategy?

To explicate the thing a bit: In the first version of your analysis, you aren't really "controlling" for the effects of Major in an effective way, because major is a nominal variable. There's no reason to expect that it would happen to have a linear relationship with plagiarism percentage that by miraculous coincidence lines up perfectly with the coding scheme you've happened to choose.
I completely acknowledge the "sketchiness" of applying this kind of control and am against it in principle. But I'm working with a few partners who are trying to do this, and I can't convincingly persuade them that this doesn't work - hence why I'm posting here

@noetsi, @CowboyBear: So yes, all other sources clearly say to create the dummies, but they don't say you can't NOT do that (if that makes sense ). Again, I'm already disregarding the coefficient for Major as meaningless; I'm trying to understand the logic behind whether and why the OTHER slopes would become meaningless if I don't use dummies? In theory, wouldn't the interpretation of the other coefficients remain the same? (i.e. holding constant a certain major, b1 is the P change per unit change in Grades, etc.) Or would the fact I'm treating Majors as a quantitative variable somehow invalidates the other coefficients?

#### noetsi

##### No cake for spunky
Re: Linear Regression:Controlling for categorical independent variables, WITHOUT dumm

Why include it at all if it is meaningless? The rules governing parsimony at the very least are violated by adding a variable you accept ahead of time to have no value. Any reviewer worth their salt is going to hammer you for including a variable coded this way. Moreover you are artifically inflating R squared by including it.

I think the answer to your question is that it will probably serve as a control. But you are violating common usage by including a nonsensical variable in your model for which apparently you have no theoretical basis to include.

#### CB

##### Super Moderator
Re: Linear Regression:Controlling for categorical independent variables, WITHOUT dumm

I'm trying to understand the logic behind whether and why the OTHER slopes would become meaningless if I don't use dummies?
I think this is an interesting question (though again from a practical point of view I think it's very clear that you should not use this strategy).

First of all, I don't think the slope becomes meaningless as such: Rather, it just takes on a meaning that is likely to be both confusing and not what you are really interested in finding out. To be honest I'm finding it pretty hard to word what the WordCount slope can be interpreted as when you're controlling for Major as a single, supposedly continuous, variable. I think this may be a good way to explain it:

First of all take the equation with dummy variables for major:
P = b0 + b1*Grades + b2*WordCount + b3*Sex + b4*Major2 + b5*Major3 + b6*Major4 + b7*Major5 + b8*Major6 + e

Here b2 is the expected change in plagiarism score for one extra word in the assignment, while controlling for participants grades, sex, and mean differences in plagiarism scores across different study majors.

But if we take the equation dummy variables treated as a single continuous variable:
P = b0 + b1*Grades + b2*WordCount + b3*Sex + b4*Major + e

Then b2 is the expected change in plagiarism score for one extra word in the assignment, while controlling for participants grades, sex, and the linear effect of Major, when Major is coded as a 1-6 nominal variable with whatever coding rule you happen to have used.

So b2 has "meaning" but I can't imagine a reason why that meaning is something you'd actually want to find out. For example, remember that in this case the slope for both Major AND WordCount will depend critically on the scheme you've used to code majors into numbers (of which there many alternative schemes, all of which could be perfectly valid). Really this is a good demonstration of why level of measurement does actually have some impact on which tests are appropriate - if the analyses' results would differ substantively across different, equally valid, coding schemes, then the wrong analysis is being used.

Really you'd be better off just not bothering to control for Major at all. So those are really your options: Leave it out, or control for it properly (via dummy variables).

In theory, wouldn't the interpretation of the other coefficients remain the same? (i.e. holding constant a certain major, b1 is the P change per unit change in Grades, etc.)
I don't think so. As I understand it, the idea that b1 would be the slope for grades while holding constant a participant's major relies on the assumption that you have correctly specified the effect of major in the model itself.

#### khaidoba

##### New Member
Re: Linear Regression:Controlling for categorical independent variables, WITHOUT dumm

Thanks @noetsi and @CowboyBear for all the help. We finally figured out a hole in our data that messed up our analyses, so the debate about including a non-dummy Program variable is void.
Still, IMO this overall issue does bring up an interesting (theoretical) line of thought.

Messing around with the data a bit, I saw that for the non-dummy (wrong) method:
-Recoding the Programs to different values does mess up the other coefficients quite a bit.
-Meanwhile, multiplying the Program values by a fixed value (e.g. recode program 6=12 5=10 4=8 ... 1=2) does NOT change the other coefficients (or anything else whatsoever), JUST the coefficient for Program.

It appears that the interpretation for the other coefficients when NOT using dummies is that for
P = b0 + b1*Grades + b2*WordCount + b3*Sex + b4*Major + e
b1 is the slope for Grades while holding constant Major, given that Major is a continuous variable as defined by the coding rule.

In other words, you're absolutely right :tup:, @CowboyBear