# Best way to run regression

#### noetsi

##### Loves R
I have done this many ways in the past. I have a series of variables that are modeled on a four point likert scale from strongly disagree 1 to strongly agree 4. Originally, since I think it is reasonable the differences between the levels are the same, I ran linear regression using the raw levels (that is the original 4 point levels for the predictors and the dependent variable - DV).

Last time I ran logistic regression with predictors on the original scale (which the system assumes is linear) and a two level DV. All I am trying to determine is which predictors are statistically significant (I have found no good way to determine which has the greatest relative impact in a decade plus of looking at the literature). I have about 700 cases. Most are going to be in one level of the DV because most agree or strongly agree (effectively the questions deal with satisfaction although the wording does not use that term it uses agreement/disagreement with phrases like I am satisfied with pay).

I can see several ways to do this. I would appreciate advice on this (which way to do it).

Linear regression with a 4 point predictor and 4 point DV.
Linear regression with dummy predictors (2 levels agree/disagree).
Logistic regression with a two level DV and either the predictors on the original 4 point scale or as dummy variables.

I don't want to run ordinal logistic regression given various concerns. One thing that worries me about logistic regression, given rules of thumb I have read such as Agresti, is that I only 82 of 639 usable cases at one of the two levels of the DV (if we collapse level 1 and 2 into level 0 and 3 and 4 into level 1). Not sure that is enough to make the regression work (well it will run, I mean interpret it correctly).

Last edited:

#### Miner

##### TS Contributor
I don't know whether the following is technically correct, but I can tell you that it worked for me. We do an annual survey on a wide variety of measures, which use an ordinal scale of 1 - 10. I used multiple linear regression to determine the relationship between possible IV and an important DV for each product line. The purpose was NOT to create a predictive model, but simply to understand the influence of the various IVs. Once I developed the model, I interpreted it as follows:
• Coefficients were used to explain shifts in the mean (i.e., IVs with larger coefficients could be leveraged to raise the mean score)
• % Contribution was used to explain extreme lower values (i.e., IVs with high % contribution tended to be strong dissatisfiers since the DV distributions were left skewed)
This provided many key insights into customer perceptions by product line that enabled us to make many significant improvements.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Why don't you want to use ordinal logisitic? I haven't really had any case uses for it, but Frank Harrel is always going on how it is the best. However, if you have sparsity issues in the DV for logistic, would you have it for ordinal or others as well?

#### noetsi

##### Loves R
The SAS book I read on this topic suggests it is pretty hard to do correctly since it makes assumptions that regular logistic regression does not. I don't think I have the most recent version with me (mine is from 1999) by Allison. I did not find the concern I had in this. But some points it makes are that

"A word of caution about the score test for the proportional odds assumption: the SAS/STAT user's guide warns that the test may tend to reject the null hypothesis more often than is warranted. In my own experience if there are many independent variables and if the sample size is large this test will usually produce p values below .05"

I actually have a different concern. Out of about 442 usable numbers I have a very small number (about 42 in all) of two levels of the dependent variable. For one of those levels I have about 10. I am not sure what the minimum size is to test a level but with 39 or so predictors that does not seem like a lot.

There is an assumption of ordered regression not used by other forms from memory. I can not find the reference to that reviewing the Allison book, but I will keep looking.

Another SAS issue is that a lot of the diagnostics won't be generated if you run ordered logistic regression.

Last edited:

#### noetsi

##### Loves R
This is part of my reason not to use ordered logistic regression. I only have about 10 cases of one level of the DV. If I dichotomize the DV, I have about 40 which is still too few, but the best I can do with our data. I have 30 plus predictors.

Lemeshow and Hosmer suggest at least ten cases per variable. This is generous, that is you need to have an absolute minimum of this – more is preferable. Another perspective, Agresti, suggests you take the event that is rarest (either the 0 or the 1) and divide this by 10 to determine how many IV you can have.