ROC curve for ordinal regression goodness-of-fit

#1
Hello everyone,

I have a problem with the graphical output of a ROC curve I am calculating.
I am sure it is very silly but I cannot figure out where I am going wrong.

I estimate the probability of damage (damage levels 1 to 5) to a number of buildings given one continuous explanatory variable taking n values (Xi), using ordinal regression - prop odds.

I will call my cumulative fitted probabilities: cyfit, and my actual cumulative probabilities: pDS.

Using Matlab, I want to assess the goodness-of-fit of my model by plotting the ROC curves (one per damage level),

I calculate the number of true positives TP as follows:
TP=pDS (if cyfit-pDS > = 0);
otherwise TP=cyfit

similarly, for false positives FP
FP=cyfit -pDS (if cyfit-pDS > = 0);
otherwise FP=0

Then I sum FP and TP over number of values of X by doing
TotFP=sum(FP) : i=1 to n,
TotTP=sum(TP) : i=1 to n,

Then calculate rate
RateFP=FP / TotFP
RateTP=TP / TotTP

Finally I calculate the cumulative rates
cumulFP(i)=cumulFP(i-1)+RateFP(i)
cumulTP(i)=cumulTP(i-1)+RateTP(i)

So this is where I have an issue, I plot cumulFP on the x-axis and cumulTP on the y-axis and end up with ROC curves way below the random guess line, basically telling me that my fir is really bad,
when really when I look at my fitted model my curves seem to give a relatively decent fit...

(see attached files)

I have loads of data and I have this problem with every single case, so I think there must be some issue with my approach in plotting the ROC curve.

If anyone has an idea of where I am going wrong, please let me know...

Thanks!
 

hlsmith

Less is more. Stay pure. Stay poor.
#2
For each individual curve what were you exactly comparing? Were you using the full dataset or how were you breaking it up in the individual curves.
 
#3
Originally my dataset gives me the counts of buildings per damage level, from 0 (no damage) to 5 (collapsed), according to an explanatory variable (flow depth). so basically for each value of "flow depth" I know how many buildings are at level 0, 1, 2 etc.
I calculate an actual probability of damage from this, pDS.

After doing the OR I am breaking up the dataset by damage level to plot the ROC curves, and I want to know how well the model predicts the buildings being at level x. In other words, my actual positives correspond to the actual probability (or counts) of buildings being at level x, which I compare to the estimated positives given by the model (again for each individual curves)

I worked with probabilities directly, not with counts of buildings, so in my code my TP corresponds to the probabilities correctly estimated by the model and my FP corresponds to the estimated probabilities of being at damage level x when the buildings are not actually at damage level x.

Maybe I should work with counts?
 

hlsmith

Less is more. Stay pure. Stay poor.
#4
What are you basing this approach on? Have you seen this methodology used any where else, if so can you provide the reference.

If I follow, which I am a little dubious about that, you are comparing probabilities from odds based on categorical version of damage based against a continuous version of damage. But the probabilities come from the categorical versus continuous variables originally, now you are comparing them again? This seems pretty mottled, regardless. You have to be able to have us understand the approach and reasoning in order for other consumers or subsequent readers to grasp this construction. Have you made any progress since last post?
 
#5
After considering what you wrote in your comment I realized my method did not make too much sense - sorry I am new to this topic and trying to understand on the job, so I decided to work with one damage state at a time and counts of buildings.
I have counts of buildings for each value of a continuous variable Xi.

I am still stuck, but this time my problem is the rate calculation. So I ll keep things as simple as they can be:

Say, I want to know how many buildings are at level 5 according to a range of values of my continuous predictor going from say 0 to 20. I fit a model, so I end up with a binary outcome: predicted counts of buildings being at level 5, or not.

From there I can actually compare with the actual counts of buildings, being at level 5, or not.

So if predicted nb of buildings at level 5 (Pred5) > actual number of buildings at level 5 (Act5);
TP=Act5; FP=Pred5-Act5
otherwise
TP=Pred5; FP=0

Similarly, if predicted nb of buildings NOT being at level 5 (NPred5) > actual nb of buildings NOT being at level 5 (NAct5)
TN=NAct5; FN=NPred5-NAct5
otherwise
TN=NPred5; FN=0

Positives=TP+FN (one value per Xi, so I have got n values for "Positives")
Negatives=TN+FP (one value per Xi, so I have got n values for "Negatives")

The false positive rate FPR(i) for each Xi is: FP(i) / Negatives(i)
The true positive rate TPR(i) for each Xi is: TP(i) / Positives(i)

Now, if I am right the ROC curve is a cumulative probability of TPR vs FPR but if I calculate cumulative rates as X increases i.e. CFPR(i) = CFPR(i-1) + FPR(i),
and same thing for CTPR(i), I end up with:

1/ values higher than 1 so again this doesnt make sense in the probability domain,
2/ loads of "no values" (NaN) particularly in my FPR because I have many FP=0 (so cannot be divided by "Negatives")

I think I misunderstand the nature of the cumulative function of TPR vs FPR? Any additional insight would be very useful....

Thank you
 
Last edited: