Class Imbalance in Logistic Regression

Buckeye

Active Member
Hello!

I am trying to predict which vehicles sell at a "quick auction" or not using vehicle characteristics such as age, make, actual cash value, etc. I would like to start with maybe a mixed logistic regression model. I have about 10,000 cases and for every 3 events there are 7 non-events (i.e. the rate that the vehicles sell at a quick auction is 30%). I think the general approach here is to over- or under-sample the training data. If I do this, will it change anything about how I interpret the model for inference? I think other options are to use a penalized model of some sort or include weights in some way? Thanks for the help

hlsmith

Less is more. Stay pure. Stay poor.
I wouldn't consider 3 to 7 imbalanced. Yes artificially changing the prevalence will monotonically change the predicted probabilities and requires a correction to get the values for the true prevalence value.

PS, Is this public data?

Buckeye

Active Member
I wouldn't consider 3 to 7 imbalanced. Yes artificially changing the prevalence will monotonically change the predicted probabilities and requires a correction to get the values for the true prevalence value.

PS, Is this public data?
No, it is not public data. Thanks for the response! I felt the same way. I'm unsure if there is a rule of thumb to consider a variable imbalanced. I know in fraud cases it could be 99 to 1.

hlsmith

Less is more. Stay pure. Stay poor.
<10% gets touted, but if the sample size is large enough and there aren't too many predictors - it can still hold. The issue comes when the outcome is rare and categorical predictors imbalanced as well. And that issue is that the SEs will be fairly large for some subgroups.

Buckeye

Active Member
As mentioned earlier, I have vehicle age (in years) and the vehicle cash value. Let's say the cash value ranges from $500 -$15,000. Clearly, vehicle age and cash value are not on the same scale. I'm aware that this can affect the performance of the model. If I divide the cash value by 1,000 so that the coefficient can be interpreted in \$1,000 increments, will this "help" my model? Is this an okay approach? I've read that we can standardize the variables as well, but then the coefficients are harder to interpret. I would likely include the interaction term between these two variables as well.

hlsmith

Less is more. Stay pure. Stay poor.
I don't think the original scale will mess with much. Some approaches do better with smaller values given computations. Rescaling is fine but won't change estimates if I recall, just value places. @Jake (west???, sorry I am forgetting his full last name, 'cookie scientist') has a blog post on benefits of centering main effects of an interaction.