Correcting Probabilities for Artificially Balanced Data

hlsmith

Less is more. Stay pure. Stay poor.
#1
I have data from a frequency matched case-control study. All that needs to be known about that is the outcome variable is artificially balanced (50:50), when the true ratio is actually (10:90). When you run logistic regression on these data all outputs are correct (if generalized to the original unbalanced data or new unbalanced data) except the intercept and calculations based on the intercept, so probabilities. Side note, generated probabilities are in the right rank order, but need to be transformed to get the actual correct values for the overall unbalanced data. There are formulae to get these values. I will post links.

However, I want to score a new dataset based on this model, with the issue being the new dataset also has been artificially balanced (50:50). Question, has anyone else had this scenario, and if so - can I just score the new data and then apply the correction?

Model(Balanced training data) -> score(Balanced validation data) -> correct predictions (Balanced validation data)

Thanks!
 
Last edited:

hlsmith

Less is more. Stay pure. Stay poor.
#2
Model(Balanced training data) -> score(Balanced validation data) -> correct predictions (Balanced validation data)
Yes, this does seem to work in my simulated attempt. Now I am wondering if this correction needs to be applied every time a prediction is made based on balanced data. If so, I would think the data science people would come up against this all of the time with their churn and other problems. I wonder if there is an R package for this?
 
Last edited: