# Grouping educational level in different variables

#### Fëanor

##### New Member
hi there!

I'm trying to apply a regression model based on the Human Capital Theory.

Besides other independent variables, I have grouped the educational level (in years) in 4 distinct variables, in the following way:

• E1: variable that assumes values 1, 2, 3 or 4 in the case of the worker have these educational levels. Assumes value 0 otherwise.
• E2: variable that assumes values 5, 6, 7 or 8 in the case of the worker have these educational levels. Assumes value 0 otherwise.
• E3: variable that assumes values 9, 10, 11 in the case of the worker have these educational levels. Assumes value 0 otherwise.
• E4: variable that assumes values 11 or above in the case of the worker have these educational levels. Assumes value 0 otherwise.

My question is: grouping educational level this way may cause any problems to my estimation?

Thanks! #### noetsi

##### Fortran must die
Nice to see someone names Feanor humble enough to ask for advice Are you creating dummy variables where (using example E1) if someone had 1-4 years of education than they would be coded 1, otherwise 0? If this is the case than (unless having no education is a level) you are going to have a problem because you have one dummy variable for every possible level in your categorical variable. That is everyone will be in one of the dummies and there will be no reference level. This should cause your model not to run at all (no estimate can be made since you will have perfect collinearity) although it is always possible that the software might miss this with so many levels.

Regardless you want to have a reference level if education is a categorical variable or you are treating it as such.

#### Fëanor

##### New Member
Nice to see someone names Feanor humble enough to ask for advice LOL! Are you creating dummy variables where (using example E1) if someone had 1-4 years of education than they would be coded 1, otherwise 0?
No, they are not dummies. It is just like in the description. Think in a single variable for the educational level. Then I would split it in four parts. Inside each one of these parts, the variable assumes the value of the educational level of the individual (if it fits in it), or zero.

For example, think in an individual with 6 years of education. Then E2 will assume value 6, and E1, E3 and E4 will assume value 0. Understand?

I know that this is something unusual, but what I want to know is if it may cause problems to my estimation.

#### noetsi

##### Fortran must die
Your original description said E1-E4 would be four seperate predictors. If that is so, than if you code each person's education this way I would think you would have major multicolinearity issues (each person would have their education coded in each of the four variables). You would also have variables with only five levels. While this does not strictly violate the rules for regression, it will likely lead to signficant non-normality (which is especially an issue if you use some maximum likelihood estimators but will create problems for the confidence intervals and p values generally).

I am curious why you would have four different education predictors - if I understood your original post correctly.

Besides other independent variables, I have grouped the educational level (in years) in 4 distinct variables

#### Fëanor

##### New Member
Your original description said E1-E4 would be four seperate predictors. If that is so, than if you code each person's education this way I would think you would have major multicolinearity issues (each person would have their education coded in each of the four variables). You would also have variables with only five levels. While this does not strictly violate the rules for regression, it will likely lead to signficant non-normality (which is especially an issue if you use some maximum likelihood estimators but will create problems for the confidence intervals and p values generally).

I am curious why you would have four different education predictors - if I understood your original post correctly.
Actually, I have run a quantile regression model. The four regressors were significant, so multicolinearity doesn't appear to be a problem. Also, quantile regression don't have problems with non-normality, as far as I know.

Initially, I have used this form because I have misunderstood the model from another work =P

I thought that spliting the educational level in four variables would get slope coefficients more precisely for each group. But I was questioned about it, and I don't know if doing so I may have been incurring in any error.

#### noetsi

##### Fortran must die
What was the basis of the questioning, that is what did they think was the problem. I do not know quantile regression - so I don't know the assumptions behind it. I have never seen this approach used in a journal which might be why they questioned it even if there is nothing wrong with it (that is they might have seen other forms of regression used such as linear regression where this would be a problem).

#### Fëanor

##### New Member
What was the basis of the questioning, that is what did they think was the problem. I do not know quantile regression - so I don't know the assumptions behind it. I have never seen this approach used in a journal which might be why they questioned it even if there is nothing wrong with it (that is they might have seen other forms of regression used such as linear regression where this would be a problem).
The questioning was just: "Can you do that? I've never seen something like this." But I don't see any problems in principle. It is almost like the dummy variable case, but instead of my groups assuming just 0-1 values, they can assume other values. And I tried to avoid something like the dummy variable trap by leaving the 0 value for educational level out.