Statistical Validation of a Labelling Method

Currently working on automatic quality estimating of medical data. We use a supervised machine learning approach so we need labelled data. Unfortunately expert physicians have limited time so they cannot label all data. Non experts can label the data as well but they introduce label noise since they make errors. Currently 4 non-experts are labelling the data and we want to verify if this labelling has the same quality as done by an expert.

- The data samples are labeled in 4 categories, 0 to 3.
- 3 To 4 experts are available with limited time.
- 4 Non-experts are available with enough time.

Next we need a agreement policy. The idea is to only use data that at least X of 4 non-experts agree upon. Where X is between 2 and 4. And see if this method has the same quality of the experts.

- What kind of statistical test should be used?
- How many samples should be given to the experts to make it a valid experiment?
- What type of samples should be given to the experts. In other words: Should we first pick a policy, then pick samples that pass the policy and check if the experts label similarly. Or should we pick "random" samples en check which policy will give satisfying results.

Any ideas are welcome!


Less is more. Stay pure. Stay poor.
Sounds like you need to start off by just doing a straight validation study. I know you have limited adjudication resources, but you need to have all adjudicators review the same observations. This means all experts and non-experts. Doing this will let you understand inter-reliability between experts (which are assumed the gold standard) and intra-reliability with non-experts. If certain non-expert perform better in some areas and vice versa this can help guide your strategies. Also, do you have a mechanism in place to train non-experts when they make a discordant classification. So they know what is going on. Lastly, once you get a strategy for going forward you still need to have some random quality checks, experts reviewing non-expert adjudications. But first you need to define there is a problem by having everyone adjudicate the same samples. Also, can you associate costs with false positives or false negatives - so which are more detrimental? For full disclosure, this isn't my area of expertise.