# Logistic regression. How to combine classes in a mullticategorical variable?

#### mattdoughty

##### New Member
Hi,

I’m preparing some data with a view to carry out a multivariable logistic regression.

The dependent variable is LANDSLIDES, i.e. occurrence, and is dichotomous (YES/NO).

The independent variables are a collection of environmental factors, such as VEGETATION, SLOPE, ELEVATION, and ROCK TYPE, to name a few. Some of these are continuous (e.g. SLOPE) whilst some are categorical (e.g. VEGETATION).

Concentrating on the categorical variables and VEGETATION in particular, I have already carried out a Chi-square test that has indicated that it influences the occurrence of LANDSLIDES. From reading around, I understand that in multivariable logistic regression it is advisable that the dependent variables should be converted to dichotomous variables and my question is... how should I do this?

After looking at the data and running a few exploratory LR analyses I thought of a possible way of proceeding and wanted to ask people opinions about it. My "methodology" is as follows.

Firstly, by examining the contingency table for LANDSLIDES vs. VEGETATION (please see attachment), I've noticed that for several classes of vegetation there are no positive observations, i.e. in these classes landslides do not occur. My first step would be to combine these categories and give them a new value (OTHERS).

Secondly, using the reclassed VEGETATION variable, run a bivariante logistic regression analysis of LANDSLIDES vs. VEGETATION, using the OTHERS class as the reference class. In the resulting equation variables table, the coefficients (B) are either negative or positive. The negative values correspond to a reduced risk of LANDSLIDES whilst the positive values imply an increased risk. My idea was (1) to group the categories with negative coefficient values with the OTHERS category, and (2) group the categories with positive values, thereby creating a dichotomous variable. This variable would effectively have one class of vegetation types that "don't" cause landslides and one which "does".

I'm unsure whether this is sound and reliable, statistically speaking, and that's why I'm writing. If this is a good way of proceeding, I could then use the same method to reclass my other categorical variables before proceeding.

Matt

PS. I'm using SPSS.

#### noetsi

##### No cake for spunky
From reading around, I understand that in multivariable logistic regression it is advisable that the dependent variables should be converted to dichotomous variables and my question is... how should I do this?
Your dependent variable already is dichotomous - it is yes no. Did you mean your independent variables should be dichotomous (which seems to be what you are really talking about)? If so categorical variables are normally made into dummy variables which do have two levels. There are no formal rules on how you make a categorical variable into a dummy variable, often you turn every level of the IV into a dummy except one (which is the reference level). However, if a dummy variable links to only one level of the DV (say all are yes) then your regression may not run, and even if it does the slope will be badly attenuated. So the combination (often this is called collapsing categories) you mention makes sense, although normally you would combine levels of an IV so it makes theoeretical sense not based on running an emprical analysis to do so.

In the resulting equation variables table, the coefficients (B) are either negative or positive. The negative values correspond to a reduced risk of LANDSLIDES whilst the positive values imply an increased risk. My idea was (1) to group the categories with negative coefficient values with the OTHERS category, and (2) group the categories with positive values, thereby creating a dichotomous variable. This variable would effectively have one class of vegetation types that "don't" cause landslides and one which "does".
Essentially you are looking at your data and then modifying your predictor to get a specific result. I have never seen that done - it defeats the purpose of regression. You are supposed to be testing a theory not manipulating the data to get specific results.

I'm unsure whether this is sound and reliable, statistically speaking, and that's why I'm writing. If this is a good way of proceeding, I could then use the same method to reclass my other categorical variables before proceeding.
While you should wait for other posters to comment I do not believe it is valid. You should combine levels of the IV, if you do, on what makes sense theoretically not by using the data to suggest levels that insure that those levels will lead to a specific result on the DV.

#### mattdoughty

##### New Member

Firstly, yes, I meant to say the independent variable. Thanks for pointing it out!

I understand your point about the modification of the predictor variables and it does make more sense to collapse categories that are theoretically linked, for example combine oak woodland with beech woodland, rather than combine urban areas with pastures.

The reason I thought of collapsing classes was really to make the analysis a little more efficient. I'm currently looking at around 15 IVs, of which about half are categorical. Given your answer, and if nobody replies otherwise, I'll continue for now without collapsing categories and see what kind of results I get.

Cheers,

Matt

#### noetsi

##### No cake for spunky
It is not a bad idea to collapse categories. It is a bad idea to do so based on having one category be linked to one level of the DV and another category linked to the other. If you do this you would pretermine the results (there would be little reason to actually run the regression - in practice you already have in this case).

If you have 15 IV, not 15 levels of the same variable, than you are going to have multicolinearity issues in all probability. You should run a VIF (this has to be done in linear regression, you ignore the linear regression results other than the VIF). If you encounter this your model will be fine but your individual variables signficance test won't be. You might consider in this case collapsing the IV into larger factors either based on theory or perhaps exploratory factor analysis.

Your research sounds interesting, if way beyond me, good luck

#### mattdoughty

##### New Member
OK, colinearity between the IV was one of the things I was going to look at next!! I've found information about how to reduce confusion between variables through statified analysis although I'd never heard of VIF before. I'll do some reearch into it. Does it work with categorical IVs?

Thanks again!

#### noetsi

##### No cake for spunky
VIF is not a way of combining variables. It is a test of whether multicolinearity exists. It works with any type of variable. It actually stands I believe for variance inflation factor, but normally you call this VIF. Tolerance is a related concept. I don't know stratified analysis, but John Fox has a good treatment of multicolinearity and possible remedies in Regression Diagnostics a sage monograph. It is worth reading (logistic regression assumes there is no multicolinearity although it does not make other regression assumptions such as normally distributed residuals or (I believe) equal error variance).