I am planning to do a reliability study in which I want to test the inter-rater reliability of an histopathological grading scheme (several pathologists grade a set of biopsies under the microscope). The grading scheme has 5 categories on an ordinal scale. There will be multiple raters. I am planning to use Krippendorff`s alpha as a reliability measure.
I have 2 questions related to this:
  • Is there any way to perform a sample size calculation?
  • In which way is the result influenced by the number of cases in each category? As some of the categories in my grading scheme are rare, there is a risk that these categories will not be represented in a random sample (or very small in numbers in a proportionate sample). Is it an option to create a sample with even distribution of all categories?
Thanks for Your help.


Less is more. Stay pure. Stay poor.
I haven't done a rater test in over a decade, but regardless if they is an actual power calculation, you could simulate what you expect your results to look like and play around with sample sizes. I get that there likely could be an infinite number of scenarios to explore, so perhaps create what you believe it would look like and the two extremes of best and worst case scenarios.

Correct, if you have a finite sample size and rare classifications there could be issues in ruling out chance, but I can't remember the Krippendorff formula well enough to be more useful. You can play around with this in simulations as well.

Good luck!