Simplification of formula in logistic regression

#1
I have performed a logistic regression analysis including five variables and one outcome. However, I would like to simplify the formula significantly for clinical use. So, instead of the formula been something like -12.22+2.33*systolic blood pressure-1.21*temperature etc., I would like to make a scoring system where the score is calculated on basis of the measured values of the vital signs.

An example could be something like this:
.................2 points..1 point...0 points...1 point.....2 points
Pulse ...........-30........31-50....51-100....101-200..201-
Sys. BP.........-60........61-100..101-200...201-

However, I have no idea how to find the optimal cut-off points. Do any of you have a suggestion how to do this statistically correct?
 
#2
Hello!

Firstly, I want to point out that the equation you have posted is technically incorrect for logit models, it should be:

p(outcome) = e^(B0 + B1(X1) + B2(X2) + ...) / [(e^(B0 + B1(X1) + B2(X2) + ...)) + 1]

Notice that you take e to the power of the regression equation you listed, and then divide that by the e to power of the equation and add 1. This will give you the probability of the outcome occurring.This is what you should use if you want to classify.

Technically you could also use a scoring or classification scheme, but it won't be any less complicated really (you would use mahalanobis distance, which involves matrix algebra). Maybe you could use discriminant analysis, but it is less robust to violations of multivariate normality, so logistic regression will probably perform better.

The main reason I don't think you should do what you are thinking about with you scoring system is that you won't be taking into account the fact that there is a nonlinear relationship between the probability of the outcome and the predictors, which is why the logit function works well, the impact of an increase in a predictor depends on the starting point.

You might be able to figure out how to translate your logistic regression results to a scoring system like you were thinking of, but it won't have as much precision (because you are making continuous variables into 'buckets') and honestly it would take some work to make sure you are reflecting your regression results with your chosen cutoffs. I don't think it is worth the effort, instead I would make a calculator for routine use.

See the attached example of a calculator based on logistic regression results; you could just load in the correct coefficients from your results and plug in the scores on the predictors (on the scales you used in your model), and get a quick predicted probability of the outcome. If you do so, make sure the e^[...] calculation uses all of the coefficients, I only have it using the three coefficients you gave me. If you want to give me more results, I can set it up to be ready to use.

Hope this helps!
 

CB

Super Moderator
#3
Hi there,

First off, I don't really know how to do what you're trying to do here, so perhaps my reply will not be very helpful (sorry)!

What always strikes me in these cases, though, is that the approach of simplifying a logistic regression model into some simple scoring heuristic scheme is rather strange (although it's still very popular - e.g. used in risk prediction for offenders). Nowadays, computers are ubiquitous - why do we need to simplify the scoring scheme into one that can be scored by hand? Why not save the loadings into a spreadsheet, let the doctor or nurse enter in the predictor variable values, and provide them directly with the predicted odds that the outcome is present? This would seem a lot more accurate to me, and just as easy to use - but perhaps I'm missing something (I'm not an epidemiologist, obviously!) Hopefully an interesting point for discussion anyway :p

EDIT - snap, Marchhare beat me to it, and even provided a calculator!
 

Link

Ninja say what!?!
#4
What always strikes me in these cases, though, is that the approach of simplifying a logistic regression model into some simple scoring heuristic scheme is rather strange (although it's still very popular - e.g. used in risk prediction for offenders). Nowadays, computers are ubiquitous - why do we need to simplify the scoring scheme into one that can be scored by hand? Why not save the loadings into a spreadsheet, let the doctor or nurse enter in the predictor variable values, and provide them directly with the predicted odds that the outcome is present? This would seem a lot more accurate to me, and just as easy to use - but perhaps I'm missing something (I'm not an epidemiologist, obviously!) Hopefully an interesting point for discussion anyway :p

EDIT - snap, Marchhare beat me to it, and even provided a calculator!
Just wanted to say that I agree with this. It'd be much more accurate to just have the clinician enter in the values and the score or probability be calculated automatically. To further concentrate on this point though, you will not have much accuracy if you do not put a lot of care into forming the model. I'm assuming that you want to use this model for prediction since you are planning on having outside data used. I would run models using cross validation to try and get the best risk possible.

HTH