Sample size - inter and intra rater reliability

#1
Hi all. I was posting questions with this subject in the spring but thought I would start a new thread as things have changed a little bit.

I am just completing my ethics forms for my PhD and need to outline sample sizes for all my studies (some are quant, some qual).

One study requires me to assess the reliability of a screening tool (think of a person doing an action and getting a score based on that action).

in a previous study a focus group are going to suggest which particular screening tools I will be using. Therefore at present I do not know which particular screening tools will be taken forward to the reliability study. Therefore I do not know their scoring mechanisms and whether they are going to be categorical or ordinal data outputs.

What I do know is that I want to test both inter and intra rater reliability.

Most of the similar studies seem to use kappa as their test statistic.

Most of the other studies also use a small number of raters (approx 2).

However, I am wanting to test the reliability between two rater groups (novice raters and expert raters).

I was always confused as to how many raters and participants I would need. For example, 4 raters looking at 100 participants would give more test observations than 10 raters looking at 10 participants.

My supervisors suggest I may need to do two calculations; one in order to determine the sample size for raters and another for participants.

I have tried reading some papers but get confused (my statistical knowledge ends at ANOVAs).

For example, 1 paper try to outline sample size calculations, but talks about completing a t test prior??!?

I would appreciate some guidance of how to tackle my issue.

All I know that I will probably set power to 0.8, with p=0.05

Anyone want to help someone very lost?
 
#3
Do you have any typical data?
Nop. The idea is I want to take stuff that's been designed by people in white coats in a lab and see if people in the real world can use it/a version of it.

To that end, in one of my earliest it is I will be presenting them with a synopsis of a variety of evidence-based screen currently developed. They pick the best to suit their industry.

Thus without this stage, I don't know what screens and thus what data
 

katxt

Active Member
#4
All I know that I will probably set power to 0.8, with p=0.05
OK. I can't quite picture what is going on. The setting for power = 0.8 and p = 0.05 is usually you are planning some sort of inference test (often regarding mean values) and you want to choose a sample sufficiently big that the probability of getting a p value of 0.05 or less is about 80% and you can claim success.
In your case, what are you actually testing for? What p value will be generated? by what difference? by what test?
What you are looking for is some sample sizes for raters and participants and if you have chosen correctly, there is an excellent chance of getting p>0.05 and you can draw some conclusion.
Assuming that all goes to plan, what will that conclusion be?
kat
 
#5
OK. I can't quite picture what is going on. The setting for power = 0.8 and p = 0.05 is usually you are planning some sort of inference test (often regarding mean values) and you want to choose a sample sufficiently big that the probability of getting a p value of 0.05 or less is about 80% and you can claim success.
In your case, what are you actually testing for? What p value will be generated? by what difference? by what test?
What you are looking for is some sample sizes for raters and participants and if you have chosen correctly, there is an excellent chance of getting p>0.05 and you can draw some conclusion.
Assuming that all goes to plan, what will that conclusion
kat
Told you I was lost!

In basic terms.... I was to see if two (groups) of raters give the same score after observing the performance of a physical action.

I then want to see whether the same rater gives the same score on the performance of a physical test as they did previously.

I need to know how many raters I need to do both and how many performances (participants) I need.
 

katxt

Active Member
#6
In basic terms.... I was to see if two (groups) of raters give the same score after observing the performance of a physical action.
The most common way to check this is the kappa as you have mentioned. However, expressing this as a hypothesis test is problematic. There are expressions for a SE for K kappa and these could be used to formulate a hypotheses to be tested. K>0 for example but that seems particularly weak.
What value of K would you consider "agreement"?
 
#7
The most common way to check this is the kappa as you have mentioned. However, expressing this as a hypothesis test is problematic. There are expressions for a SE for K kappa and these could be used to formulate a hypotheses to be tested. K>0 for example but that seems particularly weak.
What value of K would you consider "agreement"?
SE?
Not sure what similar have used for "agreement". Will check

Thanks for the replies
 

katxt

Active Member
#8
With the usual hypothesis testing situation you can detect any small difference by taking a large enough sample. You can set the power and calculate the sample size needed. That is not the case with kappa. By taking bigger and bigger samples, you do get more and more precise estimates of K, but is that what we are aiming for?
 
#9
The most common way to check this is the kappa as you have mentioned. However, expressing this as a hypothesis test is problematic. There are expressions for a SE for K kappa and these could be used to formulate a hypotheses to be tested. K>0 for example but that seems particularly weak.
What value of K would you consider "agreement"?
Hi

If I am understanding your questions correctly (which is another issue here!), others have defined agreement as per;

“The interpretation of Cohen’s kappa coefficient utilised the theoretical values set by Fleiss et al. (2003) as < 0.40 poor, 0.41 – 0.75 fair to good and 0.75 – 1.00 very good, with > 0.75 used as a cut off

for clinically acceptable measure of interrater agreement (Sim & Wright, 2005).”

Is this what you mean

The above paper also sets k=.00 at 80% power, with CI=95%

Other papers are around the that cut off level too
 

katxt

Active Member
#10
k = 0 means a test that there is actually some agreement (k > 0) if you take enough data, no matter how small that agreement is. I don't really think that's what you're after. You could make a one sided test where the hypothesis is that k>0.75 However, you need to have some typical data to work with - a pilot study for instance.
My personal view is that the ethics committee has asked for sample sizes as a standard question without really thinking about what they are asking for in this particular study. This doesn't seem to be a power/sample size situation at all to me.
I think I have run out ideas. My strong suggestion is that you buy an hour's consultation time with a statistician who is used to working in the social sciences. She will either show you how to calculate the sizes (I would be interested to know too) or will give you some report to take back to the committee saying the request is unreasonable at this stage of your planning. It would be money well spent.
Regards, kat
 
#11
k = 0 means a test that there is actually some agreement (k > 0) if you take enough data, no matter how small that agreement is. I don't really think that's what you're after. You could make a one sided test where the hypothesis is that k>0.75 However, you need to have some typical data to work with - a pilot study for instance.
My personal view is that the ethics committee has asked for sample sizes as a standard question without really thinking about what they are asking for in this particular study. This doesn't seem to be a power/sample size situation at all to me.
I think I have run out ideas. My strong suggestion is that you buy an hour's consultation time with a statistician who is used to working in the social sciences. She will either show you how to calculate the sizes (I would be interested to know too) or will give you some report to take back to the committee saying the request is unreasonable at this stage of your planning. It would be money well spent.
Regards, kat
Thanks for the advice. It is something I will look into.

You say it might be an inappropriate request at this stage, but at what stage should I be able to come up with one? After I know what screens I am using and this how many possible scoring criteria there are for each?