intuitively or statistically: is mean interrater- reliability useful?

hello, i am a "non - statistician" writing my masters in health research and i have ( many, but for now) one question:

I`ll be using Cohens Kappa with its many problems and also gwets AC to calculate IRR in my thesis.
I`m now writing the litterature review, and find methodological review articles where the samles from the articles are pooled ( to get a big enough sample) and mean cohens Kappa is calculated for a discussion on IRR- indices and how they vary by for instance coder properties.

do you think there critisisms to be made here?

1 : The populations are different in the articles and also the methods. so pooling and then calculating mean has some problems right?
( My reasoning is that i intuitively now have come to think of interrater- reliability as a measure of reproducibility of a certain method of datacollection , in my thesis the method is by a questionaire for retrospective case record review. so all IRR / cohen says is then what reproducibility one would expect from this exact method/ study: i.e that questionaire used on that sample by those coders? ) the discussion that follows i guess is for practical uses one relates to the categories ( landis and koch) from poor to almost perfect IRR up or down one or more categories .

2: Is mean kappa at all a practically useful measure?in my mind even if you perform the same rigid study but with two different coders, one set of coders could have very high agreement, and the other very low, and that is all you can say about that. .?

thank you:)


TS Contributor
2: Is mean kappa at all a practically useful measure?
Actually: not at all. And I have been wondering for years why it is still in use, because that has been repeatedly demonstrated since
the 1980ies. If the marginals for both raters are not fixed (kappa was developed by Cohen for situations with fixed marginals), then
kappa is not interpretable. The claim that kappa corrects for "chance agreement" is misleading, bceause true agreement is labeled
as chance agreement . kappa is dependent on the base rates, and for example a very high and true interrater agreement kann be
associated with an extremely poor kappa if the maginal distribution of observations is very unequal between categories. Because of
kappa's dependency on base rates, comparisons between studies with different marginal rates seem pointess, and also the calculation
of summary statistics like "mean kappa".

* Brennan, RL & Prediger DJ (1981). Coefficient Kappa: Some Uses, Misuses, and Alternatives. Educational and Psychological Measurement 41: 687-699.
* Feinstein AR, Cicchetti DV (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology 43:543-549
* Cicchetti DV & Feinstein AR (1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology 43: 551-558.
* Cook RJ. Kappa and its dependence on marginal rates. In: The Encyclopedia of Biostatistics, P. Armitage, T. Colton, eds., pp. 2166-2168. New York: Wiley, 1998
* Thompson WD & Walter SD (1988). A reappraisal of the kappa coefficient. Journal of Clinical Epidemiology 41: :949-958.