Using Bayesian statistics to improve classification task

I have a question regarding a classification problem that I think it can be addressed using Bayesian statistics, but I am not familiar with Bayesian statistics and it would be great to get some support.

Suppose a traditional medical test says that the probability of a random sample of patients being positive to disease Y is 5%, but we know that the test is not accurate when identifying the positive cases. My assumption is that the real probability might be closer to 30% among the population. We have been collecting data with the traditional model for the last 5 years, but for next year, my team has developed a new experimental system to identify the incidence of Y. On a positive note, the new system is great and can identify 80% of the positive cases while the percentage of False Positive overall negative cases is only 5%. On a negative note, this experimental system is quite expensive, and we will not be able to roll it out to all the population of patients for the next year, but just to a subsample (150 out of 300 patients). Finally, all my patients next years will also be checked with the traditional testing, so I will be able to compare how the traditional and the experimental system perform against each other.

I would like to build a classification model, likely a logistic regression, to use the data from the previous 5 years and calibrate my estimator leveraging on the information collected through the new experimental model, to predict the probability of Disease Y among my patients given the patient characteristics.

Any suggestions/resources on how to approach this classification task would be great!


Active Member
One of the uses of Bayesian analysis is incorporating previous knowledge into the current, ongoing estimation procedure. The only previous knowledge that you have mentioned is the estimate of disease incidence produced by the traditional medical test. This estimate is 5%. Moreover, you are saying that the test does not detect all the cases of the disease. So we know that the true incidence rate is lying somewhere in interval [5%, 100%]. You can probably cap this further with a reasonable upper bound to produce [5%, UB%].

In your Bayesian procedure you can set the prior distribution as uniform on [5%, UB%]. When the new data come, you can calculate the posterior distribution of the incidence rate and estimate the incidence rate with the posterior mean... Still, if UB% does not fall too far from 100%, the ultimate Bayesian estimate will not be much different from the traditional, frequentist estimate (no Bayesian calculations).


Less is more. Stay pure. Stay poor.
Is there a gold standard? How do you know how accurate any of these approaches are? How do you know the new method is better?

Via @staassis - you could weight both the traditional and new test going forward with a prior weight based on prevalence of traditional approach historically. You can then formally compare the accuracy of the two approaches after weighting them. You could also run them without the prior weights as well as a type of sensitivity analyses.
@hlsmith and @staassis thanks so much for your responses. There is no strong evidence of the distribution of the disease in the population, but only assumptions. The new method is better than the old one because it focuses its attention on the identification of the disease - how better? we are not sure, these are still assumptions we made based on the investigator experience. Thanks for your feedback again.