Why do you need oversampling/undersampling?

#1
Assume original data contains 1000 goods and 1 bad
I build a logistic regression and use the the model to score the bad and I get probability = 0.00001
Then I use oversampling/undersampling to increase/decrease the original data so now I have 1000 goods and 1000 bags if I use oversampling.
Then I build a logistic model use the data and apply the model to the original data then for that bad I get probability = 0.5.
However this probability need to be adjusted to reflect original data so after doing some math you get adjusted probability lower than 0.5 (for example 0.00001 ) so what is the point of oversampling/undersampling if you are required to adjust the probability?
 
Last edited:
#2
Oversampling and under-sampling is used to balance the training dataset only. We make sure that the model has enough training instances to learn yes from no in the case of a binary outcome. If we keep the training data imbalanced then all the model will see is the majority class and miss out on the rare class. However, we keep the test set/validation set imbalanced like it is in the wild. Check out this kaggle dataset as an example:

https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets
Note how "non-fraud" makes up >99% of the data.
"Most of the transactions are non-fraud. If we use this dataframe as the base for our predictive models and analysis we might get a lot of errors and our algorithms will probably overfit since it will "assume" that most transactions are not fraud."
 
Last edited:
#3
Oversampling and under-sampling is used to balance the training dataset only. We make sure that the model has enough training instances to learn yes from no in the case of a binary outcome. If we keep the training data imbalanced then all the model will see is the majority class and miss out on the rare class. However, we keep the test set/validation set imbalanced like it is in the wild. Check out this kaggle dataset as an example:

https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets
Note how "non-fraud" makes up >99% of the data.
"Most of the transactions are non-fraud. If we use this dataframe as the base for our predictive models and analysis we might get a lot of errors and our algorithms will probably overfit since it will "assume" that most transactions are not fraud."
Thanks for your reply but it didn’t explain why would oversampled/undersampling can help improve the prediction
 

hlsmith

Less is more. Stay pure. Stay poor.
#4
If you have an original sample of 1000 people with an outcome prevalence of 1%, a naive model approach would be to say everyone doesn't have the outcome and it would be 99% correct. But what if the outcome is very deadly - that model wouldn't be helpful even though it is accurate.

Well you want to figure out the most regular predictors of the outcome, but you only have 10 people. Do those 10 people represent everyone that could have had the outcome - due to sampling variability, etc. it is hard for the model to find a generalizable signal. So when you oversample you have more cases that represent what attributes may be associated with the outcome. Before doing this you could have had very sparse data, so of the 10 people with the outcome the were, 4 females, 3 older, and 2 unemployed. When you look further you could have 1 old female that is unemployed and in your sample, due to sparsity and sampling variability they may not have the outcome, but that subgroup is actually at higher risk - you just didn't have enough data to find the underlying signal. This is the purpose.

Of note there are multiple approaches such as oversampling, undersampling controls (not great for your setting), over-and undersampling, and also creating synthetic data (not great for your setting since you need more data to simulate more data). Also, you can penalized false positives or negatives by using a cost-penalty to help capture certain groups.
 
#5
If you have an original sample of 1000 people with an outcome prevalence of 1%, a naive model approach would be to say everyone doesn't have the outcome and it would be 99% correct. But what if the outcome is very deadly - that model wouldn't be helpful even though it is accurate.

Well you want to figure out the most regular predictors of the outcome, but you only have 10 people. Do those 10 people represent everyone that could have had the outcome - due to sampling variability, etc. it is hard for the model to find a generalizable signal. So when you oversample you have more cases that represent what attributes may be associated with the outcome. Before doing this you could have had very sparse data, so of the 10 people with the outcome the were, 4 females, 3 older, and 2 unemployed. When you look further you could have 1 old female that is unemployed and in your sample, due to sparsity and sampling variability they may not have the outcome, but that subgroup is actually at higher risk - you just didn't have enough data to find the underlying signal. This is the purpose.

Of note there are multiple approaches such as oversampling, undersampling controls (not great for your setting), over-and undersampling, and also creating synthetic data (not great for your setting since you need more data to simulate more data). Also, you can penalized false positives or negatives by using a cost-penalty to help capture certain groups.
thanks for your reply. But there is only 1 person goes into bad. Not 10. The original data has proportion 1000:1 so 0.1%. I still don’t understand why duplicate this record would help?
 

hlsmith

Less is more. Stay pure. Stay poor.
#6
Oversampling means you collect 100K records so now you have 100 cases and you don't have to use all of the controls, perhaps only 100 matched controls. You are correct that if you have 1 case you are fucked, you can't create or duplicate if you have on!y 1. This is statistics not the highlander!
 

Dason

Ambassador to the humans
#7
Sometimes a false positive means somebody is very slightly inconvenienced and a false negative means somebody might die. In some cases it's important to have a model that can hopefully identify cases that might have a chance of being a the event you're looking for.