Relative impact

noetsi

Fortran must die
#1
My first question 11 years ago and I still cycle back to it every few years. I have a series of dummy predictor variables. I have a two level DV. I am running logistic regression. What I want to do is determine what variable has the greatest impact on the DV. Because we want to know what predictor changed the DV the most controlling for the others. The dependent variable is overall satisfaction, the predictors are satisfaction with things like pay. Answers to this question tend to say either, relative impact can not be assessed with regression, or standardize your predictors and see which is larger. But a lot of analyst disagree with standardizing dummy variables.
 

hlsmith

Less is more. Stay pure. Stay poor.
#2
If everything is binary (i.e., IVs and DV) - given your scenario I would use LASSO logistic regression. You may have some sparsity, but I would use half the data in the LASSO and then run a model with the selected features (IVs) in the second half of the data to get estimates.
 

noetsi

Fortran must die
#3
Ok I know nothing of that, you mentioned it before. I don't care at all about the actual slopes which we will never use. I just want to know which has more impact.

Does lasso work when the predictors are correlated, as mine will be. That seems to be a major issue with relative impact although I don't know if it impacts LASSO or not.

You might be interested in this approach which I had never heard of before. I have to find out if you can do it in SAS. Or if I can learn the R.

ORM341993 767..781 (researchgate.net)

It is called dominance analysis.
 

hlsmith

Less is more. Stay pure. Stay poor.
#4
I have always done it in R previously, since it wasn't available in SAS. But via the below link it seems assessible in SAS now. I would do a data split and run the LASSO, then fit the selected model to the holdout set using normal logistic regression and look at the coefficient and SEs to make a decision. Which can be a little tricky since you can have a big coefficient and big SE or a little coefficient and little SE, etc. and which is more accurate. The LASSO's partial purpose is to circumnavigate collinearity. Another approach could be fitting a random forest and just looking at the variable importance list - but I think the latter is my choice.

https://www.mwsug.org/proceedings/2017/AA/MWSUG-2017-AA02.pdf
 

noetsi

Fortran must die
#5
I only want it to say which variables are more important, which it seems to do indirectly by getting rid of less important variables. It still won't tell me which are the better ones of those that remain, but there seems no agreed on way to do that with dummy variables which can not be standardized.

What do you think about the logic, for dummy predictors, of saying the ones that have the highest odds ratio are relatively more important after lasso.
 

noetsi

Fortran must die
#7
They all are formatted the same and have the same scaling since they are all 0 and 1.

In that case, if I understand correctly if one of the predictors has a higher odds ratio than it has the greater impact than one that has a lower odds ratio. But I note that some disagree you can ever measure relative impact with regression, particularly if the predictors are correlated which they certainly will be in this case. Regression deals with this in the slopes by controlling for other variables, and this to me would seem to address that issue. But it is clear to me others disagree and this is a topic I can find little on in the literature.

Probably because the people writing the articles don't really care much about which has greater impact. :p They are theorist not practitioners.
 

noetsi

Fortran must die
#8
Here is where, in practical terms logistic regression gets tricky. We ran a series of predictors which are 1 satisfied 0 unsatisfied. The DV is the same. We ran logistic regression.

The odds ratio for pay was about 19 meaning that people on average were far more likely to be overall satisfied if they were satisfied with pay compared to if they were not satisfied with pay controlling for about 30 other causes of satisfaction. But does that mean that pay satisfaction drove overall satisfaction or had more influence than other factors with lower odds ratios?

I don't know. That is the problem with categorical dummies to me. You know that one group is more or less than another group. But there is no reasonable way to know if they actually caused anything, particularly relative to other factors that influence satisfaction.

Other that experimental design, which is not possible for our organization, anyone know a way to address this?
 

hlsmith

Less is more. Stay pure. Stay poor.
#9
Yeah this goes back to all models are wrong. You can't know the truth without having zero non-respondents and perfect accuracy in responses. So you just have to deal with it.