Weighted kappa - so confused

#1
I have posted about this subject regarding a variety of different inferential statistical tests...so apologies in advance.

I'm running a project where one of my my investigations will look at the level of agreement between multiple raters who will be split into two groups; novice vs expert.

They will assess a variety of participants who will be completing a movement screen.

The raters will score the participant probably on an ordinal scale (1-3).

As I'm looking for levels of agreement with ordinal data this brings me to use in the weighted kappa.

I am struggling to understand what my sample size should be and how how this sample size should be defined.

For example many studies that I have reference have only used two raters, and many participants.

Surely it makes sense to use as many raters as possible? However, more participants an easier way to get more data points (ie assessment scores).

So I need to know how many participants I should be recruiting, and then I can make an educated guess on the amount of raters (it will be easy to recruit participants that hard to train raters).

I have looked at 1 paper recording sample size calculation for kappa. it appears to suggest that in order to compute sample size you have to have an understanding of the amount of options the raters can select? currently I do not know exactly what the score in mechanism will be for each screen as this election of such forms part of my research.

I'm essentially at a complete loss as to how to start computing my my sample size.
 
#3
Is your basic question "Are the ratings given by novices demonstrably (significantly) different from the ratings given by experts?"
Yes, it will be. I suppose the hypotheses will be that there is a difference (due to skill level and knowledge level) was the null being that there is agreement
 

katxt

Active Member
#4
I imagine that there will be a difference, otherwise you would not have experts.
You will already know the next bit but I've put it down anyway to clarify my thoughts. The standard approach to hypothesis testing is to get a p value and say "Yes there is a significant difference" or "No significant difference". The second conclusion is just a face saving way of saying "I don't know if there is a difference or not" which is something to be strongly avoided after you have spent all the time and money and goodwill with raters and participants (plus people are reluctant to publish you saying you don't know). So your problem is how many of each is enough to be reasonably certain (say 80%) of getting a significant result and so justifying the time and expense. I suspect that there is no standard answer to this and you will need to work from first principles.
Here is one approach -
First you will need to devise a method of measuring the disagreement between the two groups and getting a p value from it. We can think about this later perhaps if you want some ideas.
Next you need some data to get an idea of how strongly the groups disagree. Do a pilot study of say 5 in each group and 10 participants. When you analyse this data you will probably get a non significant result but that doesn't matter.
Use the pilot data to generate a larger data set, say 10 per group and 50 participants. Get a p value. Repeat the exercise 20 times (say) and see what proportion of that 20 is significant. This is the power. If it is around 80% then this is a suitable design.
This could be a long process unless you have programming skills. I suggest that you first work out the biggest experiment you can afford to do and find the power of that design. Then either abandon the project because the chance of success is too small, or work out a smaller design that still gives you acceptable power.
 
#5
I imagine that there will be a difference, otherwise you would not have experts.
You will already know the next bit but I've put it down anyway to clarify my thoughts. The standard approach to hypothesis testing is to get a p value and say "Yes there is a significant difference" or "No significant difference". The second conclusion is just a face saving way of saying "I don't know if there is a difference or not" which is something to be strongly avoided after you have spent all the time and money and goodwill with raters and participants (plus people are reluctant to publish you saying you don't know). So your problem is how many of each is enough to be reasonably certain (say 80%) of getting a significant result and so justifying the time and expense. I suspect that there is no standard answer to this and you will need to work from first principles.
Here is one approach -
First you will need to devise a method of measuring the disagreement between the two groups and getting a p value from it. We can think about this later perhaps if you want some ideas.
Next you need some data to get an idea of how strongly the groups disagree. Do a pilot study of say 5 in each group and 10 participants. When you analyse this data you will probably get a non significant result but that doesn't matter.
Use the pilot data to generate a larger data set, say 10 per group and 50 participants. Get a p value. Repeat the exercise 20 times (say) and see what proportion of that 20 is significant. This is the power. If it is around 80% then this is a suitable design.
This could be a long process unless you have programming skills. I suggest that you first work out the biggest experiment you can afford to do and find the power of that design. Then either abandon the project because the chance of success is too small, or work out a smaller design that still gives you acceptable power.
Hi, thanks for your detailed answer. I agree, I have been finding info that says some form of pilot study is required.... which is a bit hard for a project approval form! Additionally, I also believe that the raters required will differ depending on the response options. i.e. if the raters can only rate something "yes" or "no", then this will have a different influence on sample size than if they can rater something "yes, no, maybe, unsure"... etc

So, what I’ve done in order to give an estimate of sample size, is Ive looked through the literature I’ve read that is similar (interrater reliability). There are 7 papers, and most use a rater sample size of 2-4 (one uses 20). The average number of participants doing the screen is 30, and average data points (scores) is just over 100.

I think I’m going to put this on my form, because, at the end of the day, I don’t even know what tests I am doing yet (i.e. what screes), thus I don’t know the scoring system I’ll use. Deciding on the screens is part of the project.
 

hlsmith

Less is more. Stay pure. Stay poor.
#7
I have not read all of the above comments, but yes you will need to make assumptions about data collected, differences between raters, etc. Then, given this information, I would likely think about trying to simulate data and running the weighted kappa values to explore the effect needed and sample size to generate sufficient power. It has been a long time since using kappas, but only having 3 rating categories would seem to me to require more data to rule out chance differences between groups, since there won't be much spread in ratings.