Class Distribution, Boosted Models (gbm) Probability Scores

#1
All,

Problem:
I need help to better understand the probability scores that come from the result of a decision tree model. Specifically, I'm using the gbm package from R to create Generalized Boosted Regression Models, but the results I see are common across various ensemble classification models.

My experience has me building models with training data that has various levels of unbalanced class levels. In all cases, I'm dealing with two classes, (e.g. Yes/No) with my data. I typically holdout 10% or more of the training data to score and test the model validity.

Findings:
The closer the training data is to a 50/50 distribution (Example: 800 Yes Records and 800 No records), the probability scores assigned to my test data are higher. In fact, the range of scores typically is between 0.01 to .99. This is what I would expect and what I would like for every model.

However, as the distribution level becomes more unbalanced (lower amount of "Yes" records) to 40%/60% or even to 10%/90%, the probability scores have a lower range and the maximum score is sometimes below 0.5.

As a general rule, if a scored record has a score of 0.5 or greater, then they are predicted to be in the "Yes" class. However, this rule becomes irrelevant if all scored records are below the 0.5 score.

The point of building and scoring an unknown universe is to ultimately, RANK the records from most likely to least likely to belong to the "Yes" class. However, even in the cases where all the scores are less than 0.5, the model does a good job of ranking the records. Is the probability score that is generated irrelevant and I should re-scale?


I have two main issues:

Issue #1 - Are these probability scores meaningful as true probabilities? That is, if a model that is built results in zero test records with scores greater than 0.5, is the model poor? Is it common practice to just re-scale the scores from every model to be from 0-1?

Issue #2 - If I re-distribute my classes to be 50/50 in my training file, I sometimes get more "Yes" records than should be expected. For example, a mailing campaign expects 1-5% responses. When building a model using equal classes (50/50), my scored universe returns 20-25%, sometimes even more expected respondents. When is it appropriate to balance the training class and when is it appropriate to leave unchanged and unbalanced?


My Test Results from 3 Models built from same training data set. The only difference is the class distribution, noted in parenthesis.

Model 1: (50/50)
  • No Records = 1,200
  • Yes Records = 1,200
  • Max Scored record = 0.95
  • Median Scored record = 0.48

Model 2: (80/20)
  • No Records = 1,200
  • Yes Records = 300
  • Max Scored record = 0.89
  • Median Scored record = 0.16

Model 3: (5/95)
  • No Records = 1,200
  • Yes Records = 64
  • Max Scored record = 0.43
  • Median Scored record = 0.039
 
Last edited:
#3
Sorry about that. My topic is pretty broad and covers several disciplines (Modeling, Regression, R, Stats, probabilities, etc.), so I posted in 2 forums to hopefully get more coverage. I understand why you deleted. Thanks for letting me know.
 
#4
Anybody? 200+ views and no help or thoughts?

If my post is unclear or you would like more information, please let me know.

Any help or thoughts on this subject would be appreciated. Thanks in advance!
 

bugman

Super Moderator
#5
I am familiar with their use, and I have done a couple of working tutorials on BRTs, but the details to me a skethy and i would rather not hand over incorrect advice. I found this paper very useful as it discusses some of the finier detail. Is this the sort of thing you were after?
 
#6
Thanks for your response Bugman! I have read the paper you attached and it's a great resource for Boosted models. It certainly provides a good understanding of how the models are created and how they work.
However, they don't elaborate much on the two issues I have: Class distribution and probability score explanations.

I have yet to find anything that specifically dives into these two issues and answers my concerns. With Class distribution, you read all the time that mailing campaigns typically get response rates of 1-2% and the models are built with 50/50 splits to give the target group a "chance" of being identified. With such an imbalanced dataset, the model would reduce error by saying no one should be mailed and be right 98-99% of the time. My concern is "when" should imbalanced datasets be balanced and if there are any tests that can be done to identify such a case.

Thanks again for your post!
 

bugman

Super Moderator
#7
Hey Collin,

If I was in your shoes, I would go straight to the source and email the authors of that paper. I have generally found that as long as you are willing to show that you have made some effort in your attempts to find a solution, they are pretty willing to help out. Sorry I can tbe more help to you.

If you do find a solution, please re-post in this thread, because I wouldnt mind knowing the answer myself.

P