Post-Processing: Probability Cutoff

Buckeye

Active Member
#1
Hello,

I have a dataset of about 18,900 electric vehicles. I am building a logistic regression model to determine what factors are most predictive in a total loss (i.e. repair cost exceeds vehicle value). The dataset is imbalanced such that total losses occur in <20% of all vehicles. I am not building a fancy ML model or using k-fold cross validation. The current production model most likely uses a balance between precision and recall. I would like to optimize between sensitivity and specificity for my model. My question is, should I split my data into 80% train, 20% test and use a proportion of the training samples as a validation set to calibrate a probability cutoff value? What proportion should I use for the validation set? I say "post-processing" because I will get model estimates and then choose the cutoff value after prediction.
 

hlsmith

Less is more. Stay pure. Stay poor.
#2
Oh fun. Yes, I would have a small random holdout set in this scenario to confirm the threshold selection. 30% could be fine.
 

Buckeye

Active Member
#3
I decided to optimize precision and recall. Preliminary findings are that adding a variable for more detail about type of accident (intersection, single vehicle, parking lot) helps a lot. I picked up 20% recall for pretty much the same precision.
 

hlsmith

Less is more. Stay pure. Stay poor.
#4
I haven't played around with recall and precision - use the more medical terms sensitivity and positive predictive value. I usually play around with making a cost matrix, where I assign a values to TP, FP, TN, FN and optimize cut off based on either trying to minimize FP or FN. Usually in medicine we attempt to minimize FN - since that would mean we missed a positive diagnosis. Though over diagnosis is rare outcomes results in overtreatment in benign cases.

This could be done by giving TN and TP a value of 0 and FN -4 and FP - 1, etc. saying FN are four times worse. Scores can be based on financial costs as well - all depends.

I would be interested in hearing about what you end up doing - I know business sector uses f-scores or something - regularly.
 

Buckeye

Active Member
#6
Usually in medicine we attempt to minimize FN - since that would mean we missed a positive diagnosis. Though over diagnosis is rare outcomes results in overtreatment in benign cases.
@hlsmith When you say rare, what proportion of the population has the disease? How do you typically account for this imbalance when you fit the model? Do you include weights in some way? Or can the "overtreatment" of benign cases be attributed to less informative predictors? .
 

hlsmith

Less is more. Stay pure. Stay poor.
#7
Yeah, the weights can help ensure the imbalance doesn't come into play too much. Precision is an horizontal calculation in the confusion matrix. So basing something on it, may be troublesome. I drafted an abstract for a conference a couple of years ago where I talked about imbalance options in general (e.g., upweight, downweight, both, or synthetic data). Its purpose was just to educate myself. I feel like a couple of papers may have come out recently talking about issues with synthetic data or adjustments. Given the imbalance isn't stark, prevalence <1% or something - none of this is probably a huge issue.

I am likely similar to you, in that I am not an expert but just trying to do my best. In the following paper I tried to use some approaches when creating a decision tree for infection detection. I put the code in the supplemental file.

Comparison of the performance of a clinical classification tree versus clinical gestalt in predicting sepsis with extended-spectrum beta-lactamase–producing gram-negative rods | Antimicrobial Stewardship & Healthcare Epidemiology | Cambridge Core