Comparing two methods to measure disease severity

I am testing a new, fast method to score disease severity in an animal model of a human disease. M appears to be less sensitive than the gold standard scoring method, G.
G is discrete and has a range (0,1,2...20). M is continuous and varies from 0 to 1.

The linear correlation coef between M and G is about 0.6-0.7 (Pearson) for most studies, when scores are compared for every animal in the study.
In most studies, I have a control group and treatment groups, each with 5-8 animals. Traditionally, we used G to score each of the animals, and used the Dunnett's test to identify treatment groups that were significantly reduce disease severity score (just look at p values < 0.05). When I used M instead, the Dunnett's test also identifies mostly the same treatment groups as significant, but generally with higher P values. In a dose study, where multiple treatment groups were treated with titrations of a drug, G identified a few more treatment groups as significant (p<0.05) than M, which missed the very low titrations.

  1. Ultimately, I want to know if M is good "enough" for our needs, because it is so much faster and easier.
  2. What is the best way to formally measure the "sensitivity" of M compared to G?
  3. In practice, M identifies similar groups as significant using the Dunnett's test as G. Should I use P values from Dunnett's, or look at confidence intervals to make this point? I am concerned because P values don't address effect size, but in reality, a treatment that decreases disease score by less than 20% is not biologically interesting. In fact, M does very well for treatments that have more than 20% effect, compared to G. How do I combine effect size with confidence intervals or p values, to assess whether M can replace G in some circumstances?

I would really appreciate any help!!


Less is more. Stay pure. Stay poor.
There are many options for the many questions you asked. You know the diagnostic values of G based on a gold standard (criterion standard), correct? You also acquire the same metrics for M, based on the criterion standard (e.g. error rate, SEN, SPEC, PPV, NPV, Accuracy). I would start with accuracy (AKA, area under the ROC curve). Calculate the AUC for both G and M, and see if they are statistically different, test them. This will tell you if they are statistically different (not clinically), do with an adequate sample. You can also use the classification metrics to evaluate both measures based on false positive and negatives and where the tests differ and examine your threshold for diases for M.

To evaluate whether M can replace G, you need to evaluate the risks of missing disease and cost tradeoffs (perhaps using decision trees).
Thank you for your post.
Let me see if I understand you correctly (I am a biologist without too much statistics training, so I struggle with the terminology).
One big problem I have is that although I have over 1000 animals, they are distributed in 20 different experiments designed to test various different types of treatments. There is considerable study to study discrepancy: for example, a control group in experiment 1, might have much higher scores than a control group in experiment 2, though the animals were treated and measured the same way. Sample preparation varies from study to study. This effect of sample preparation is stronger for M, than for G.
In spite of this obvious problem, I collected all the 1000 datapoints together, otherwise I just don't have enough datapoints. I put animals that came from an experiment group that had a G score of less than 3 as "healthy", and plotted a histogram of scores for all animals that belonged to such a group, from all experimental studies. I put animals that came from an experiment group that had a G score of more than 9 as "diseased" and plotted a histogram for these. I got two distributions with small overlap (these graphs are not normal, because G is actually a manual method, and all "healthy looking" animals get a 0). Then I plotted the M scores for the same animals, and plotted them again in two histograms (these look normal). The two histograms have more overlap than the graphs for G. Is this an ROC curve? It seems like it from the wikipedia site. Do you have a suggestion for a better and more reliable resource for a beginner like myself?

So anyway, this gives me a visual idea of false positives and negatives. Then I can say that "if I am looking for a difference between scores greater than 9 or less than 3, the false positive rate is x, false negative rate is y, using M or G". How do I generalize this for different possible scores? What if I want o know the false positive rate between less than 4, and more than 11?

Am I roughly on the right track, or have I misunderstood you?

I don't quite understand how I would use a decision tree, though I have a vague idea. Do you know of a biostats publication / methods paper that uses decision trees that you could suggest?


Less is more. Stay pure. Stay poor.
Before you get to the decision tree you need to first establish a difference. Are you using any statistical software? The ROC curve is the SEN plotted against the 1-SPEC.

Your data do not seem to be paired or are they (do you have a G and M score for the same animal or is it a bunch of Gs for some animals and a bunch of Ms for the other animals)? The basic formula for the area under the ROC curve is accuracy (AD/ABCD), the A, B, C, and D are the cells of your classification table (2X2) for the test versus the criterion standard. Keep reading online and posting questions, it should eventually click. Statistical software would probably help, since once you figure out the basics you may need to think about controlling for the samples or batches of animals.


Less is more. Stay pure. Stay poor.
SEN = sensitivity
SPEC = specificity

You have a gold standard, correct?
Accuracy = A + D / A + B + C + D

I will try to better define terms, since you stated that you have limited statistics experience.
Yes, my data is paired. I have a G and M score for every animal. I am fairly proficient at MATLAB and have some experience with R. In MATLAB I can only find ROC curve functions for binary outcomes (healthy vs disease), but I have a range of disease severity (0-20).
I am interested within an experiment, if my new method will differentiate the groups that show improvement due to a drug vs control, compared to the old method. SHould I draw a new ROC curve for each experiment (each with 50 animals or so) or 1 graph for pooled data (1000 animals but considerable variation due to sample preparation differences between studies for the control group readout)

I think I understand the idea of plotting sensitivity (true positive rate) vs specificity (true negative rate).
But a problem: no 1-1 mapping between M and G, although both scores exist for every animal. G varies from 1 to 20, M varies from 0 to 1.
So say I set the [disease / health] threshold to 1 (defiine less than 1 healthy, more or equal than 1: disease).
There will be a number of animals that received a score less than 1 by G. These same animals will have a value anywhere between 0 and 1 using M, more likely near 0. But what is the cut-off for M? If I had a cut-off of say 0.1, I could find all the animals that scored less than 1 using G AND less than 0.1 using M, over total (this would be sensitivity?) and all animals scored more than 1 using G AND more than 0.1 using M, over all animals (this would be specificity?).
I could then plot all sensitivity vs specificity for all possible thresholds in G.

But how do I define the cut-offs when the scales of G and M are different? Do I first scale M?


Less is more. Stay pure. Stay poor.
You should determine your thresholds for disease. There are multiple methods available including Youden's Index. In your first post you stated, "M appears to be less sensitive than the gold standard scoring method, G", do you have a true gold standard to compare G to? Does the literature have all of the diagnostic properties of G (e.g., SEN, SPEC, etc.) and are these exceptable for making it the gold standard? Why I am asking is that you can compare M directly to G, but if G is missing disease your analyses can be flawed. Ideally if G is less than perfect, you can compare both G and M to the gold statard and examine their differences.

Also, if you have a scale of disease severity, you may be interested in running a logistic regression and calculating the predicted probability estimates. These would tell you based on your sample and the severity level, how probable the disease would be, this may get around having to determine the threshold and other procedures. Than you would continue to apply M and build evidence for its usefulness. Ideally you would be able to say level 0 animals have a 0.03 probability, level 2 have a 0.08 probability, etc (just hypothetical numbers). However, with 20 levels you would want many more animal observations to power the analyses, or you could collapse the levels until you have acquired more observations.
There is no good gold standard other than G, unfortunately... G is not perfect (it is a subjective manual method), but it's the best we've got.
I tried going back to the idea of using classifiers to compute the true positive and false positive rates depending on the varying cut-offs that I define as disease. This has in fact provided useful data. Thank you for your inspiration of using ROC curves.

I think the main complication is that there is no binary disease / non-disease cutoff (the quantification is an estimate of the proportion of tissue that is diseased). Some "disease" is seen in healthy animals. But the question is whether drug treatment can decrease the score in an animal with high disease score. I suppose I could do multinomial regression for 20 scores, or bin them as you suggest...