Hello,
I have a dataset of about 18,900 electric vehicles. I am building a logistic regression model to determine what factors are most predictive in a total loss (i.e. repair cost exceeds vehicle value). The dataset is imbalanced such that total losses occur in <20% of all vehicles. I am not building a fancy ML model or using k-fold cross validation. The current production model most likely uses a balance between precision and recall. I would like to optimize between sensitivity and specificity for my model. My question is, should I split my data into 80% train, 20% test and use a proportion of the training samples as a validation set to calibrate a probability cutoff value? What proportion should I use for the validation set? I say "post-processing" because I will get model estimates and then choose the cutoff value after prediction.
I have a dataset of about 18,900 electric vehicles. I am building a logistic regression model to determine what factors are most predictive in a total loss (i.e. repair cost exceeds vehicle value). The dataset is imbalanced such that total losses occur in <20% of all vehicles. I am not building a fancy ML model or using k-fold cross validation. The current production model most likely uses a balance between precision and recall. I would like to optimize between sensitivity and specificity for my model. My question is, should I split my data into 80% train, 20% test and use a proportion of the training samples as a validation set to calibrate a probability cutoff value? What proportion should I use for the validation set? I say "post-processing" because I will get model estimates and then choose the cutoff value after prediction.