Skewed Continuous Predictor - dichotomize or treat as continuous?

#1
Hello All,

I have two scales with 7 binary items. The simple raw scores from these scales (summing the item responses of 0 or 1) are very skewed with about 56% of the sample having a scale score of 0. This is not unexpected as the scale measures behavioral problems with higher scale scores indicating more problems. A scale score of 0 indicates no problems. I am interested in looking at the relationship of these scale scores to binary outcomes variables. Given the nature of the scale and positive skew, I am inclined to dichotomized the scale scores into those with a score of 0 and those with a non-zero score (i.e., 1-7). Effectively creating a "no problems" / "problems" dichotomy. This way when I use logistic regression the coefficients will be easily interpretable. I know that there are issues with dichotomizing a continuous variable and was curious what you all think of this approach. I was also curious what other analytical paths you would consider.

Thank you very much for you thoughtful consideration!
 

noetsi

Fortran must die
#2
Generally you don't want to dichotomize continous data because you lose information. I am not sure if your scales are continuous or not (likert responses are usually seen as ordinal, combinations of multiple questions into scales may be continuous depending how many levels you end up with on the scale). The percent of a variable that is in a certain category does not mean a variable is skewed - you should do a test for skewness before you take the drastic step of dichotomizing a continuous variable (if that is what it is). There are less drastic transformations to deal with skewed data.

In honesty I have never seen skew raised with logistic regression. It is more of a concern for linear regression.
 
#3
noetsi,

Thank you for the quick reply. The scale scores range from 0 to 7. With 7 possible scores I thought this variable could be treated as continuous. Please advise if you think otherwise.

I was just giving the % for illustration and did check the skewness statistics which were fell around 1.3 (sd .022). There are many ways to analyze this data and I am trying to find the most meaningful and defensible approach. For example, I could dichotomize the scale scores, treat them as continuous, or I suspect treat them as categorical.

Appreciate any additional thoughts!
 

noetsi

Fortran must die
#4
There is no agreement among experts on how many levels you need to be continuous in practice (in theory of course you need to have infinite possible levels but there appears to be agreement that once you reach a certain number of levels you can treat it as continuous if the distance between each level is the same). The minimum that is seen as acceptable would appear to be 5 levels, some hold out for 9 and so on. That is not a great answer, but this is simply one of those areas there is no agreement on apparently.

There are different ways to define skewness (that is different ways it is measured). I believe the way SPSS does it that 3 or higher is considered to show skewness for example. Other software may have different calculations, you would need to look on line for your specific software to determine this.

Unless your data violates the skewness rule (say 3 for SPSS) I would treat a 7 level as continuos. I don't think normality is a critical issue with logistic regression (I am not even sure multivariate normality is assumed - certainly homoskedacity is not).
 

Dason

Ambassador to the humans
#5
There is no agreement among experts on how many levels you need to be continuous in practice (in theory of course you need to have infinite possible levels but there appears to be agreement that once you reach a certain number of levels you can treat it as continuous if the distance between each level is the same). The minimum that is seen as acceptable would appear to be 5 levels, some hold out for 9 and so on. That is not a great answer, but this is simply one of those areas there is no agreement on apparently.
Note that typically that distinction only really matters if we're talking about the response variable. It doesn't really matter what the distribution of the predictor is.

There is no normality assumption of any kind in logistic regression.
 

noetsi

Fortran must die
#6
Although I know what you say is true, it is also true that it is common to believe (I was taught this repeatedly in graduate programs) that it does matter which is likely an issue in a class or even for publication. Also having a categorical predictor variable (not a dummy variable created from that, a variable with say 4 categorical levels) is hard to intepret and makes the interpetation of other predictors more difficult when you are controlling for it.

Or so I was taught :p
 
#7
Thank you both for the thoughts.

One more question... The scale measures behavioral problems and is intended to help identify those who may be at risk for negative outcomes. First I treated the scale score (ranging form 0 to 7) as continuous and predicted a binary outcome with logistic regression. The exp(B) for the scale is 1.5. Next, I treated the scale score as a dichotomy with 0/1 (where those with a non-zero score get a 1). The odds ratio for the dichotomy is approximately 5. My question is which approach provides a more accurate representation of the data.
 

noetsi

Fortran must die
#8
That really depends on what you think is reasonable substantively. You should use the continuous answer if you think the scale is interval like and not otherwise.

To be interval like there should be enough levels (no one agrees on what this means in practice - that is how many levels are needed) and the distance between each level should be the same. So the results of level 4 minus level 3 should be the same as level 2 minus level 1. Personally I would guess that 7 levels meets the former requirement, but how you actually determine the later is something I have never seen.