Comparing proportions (or comparing performances of diagnostic tools)

Hey everyone and thanks in advance for your attention

I'm trying to compare the performances of 5 different softwares acting as diagnostic tools. all they do is asking the patient questions for symptoms.
I have given each software 20 vignettes in which everyone has 20-40 symptoms the softwares should discover by asking questions.

The number of symptoms the software discovered divided by the total number of symptoms the patient had in the vignette I would call sensitivity. when averaging these proportions for each software on all vignettes I get the "mean sensitivity" or the sensitivity of my diagnostic tool.

The numbers of symptoms the software discovered divided by the total number of questions it asked I call "specificity". again I get 20 proportions for each software for the number of vignettes which I can average for each software.

now I want to compare the performances of the diagnostic tools - how would you do that?


I thought one way to do so is to use chi square in which I can show a significant difference between the sensitivities of the different tools or the specificities. what post hoc test would you use? (lets say to show if 1 diagnostic tool is preforming better - the ratio is bigger than the others)

Any other ideas how to compare the groups? How to graphically emphasize the difference in performances?

Thank you. David


TS Contributor
For each of the n=20 vignettes, you have 2*5 interval scaled measurements, i.e. sensitivity of tool A, sensitivity
of tool B, of C, D, an E; and the same with "specifity".

I do not know whether it makes sense to perform statistical tests here (e.g. n=20 gives not much power). But
you can compare means between tools and correlations between tools. I guess the variability of the sensitivity/
specifity of the tools across 20 vignettes could be interesting, too.

With kind regards



Less is more. Stay pure. Stay poor.
Poisson regression with count or rate of true positives regressed on test as a categorical variable. Depending on the distribution of the number of true positives, another distribution/model may be applicable, but you would need to provide that information.

Did the software ever have a false positive for a condition? Also, not all symptoms contribute equally to a disease state - this approach would not address this.
Thank you hlsmith. Would you mean by that to make the dependent variable true positive rate (what I called sensitivity - e.g. the number of symptoms the software discovered out of the total number of symptoms for each vignette) and the categorical independent variable the softwares identity?

I'm not deeply familiar with the poisson regression analysis as I come from the medical field and not the statistical. I will try to read a bit and ask further questions as needed.

I don't think I can define false positive (even though I somehow defined specificity). For the different relevance of the contribution to the disease state, as it is very interesting and would make the comparison more accurate I don't have the tools to dive into that.


Less is more. Stay pure. Stay poor.
Poisson distributions are typically used with count data. You seem to have count data. I will think more about this as well since all observations got all software, which is a little quirky for traditional simple Poisson regression. Making this also similar to a cross-over design, which I don't have much experience.

Unless @Karabiner has suggestions, perhaps searching "Poission crossover design" may get you started.