I imagine that there will be a difference, otherwise you would not have experts.

You will already know the next bit but I've put it down anyway to clarify my thoughts. The standard approach to hypothesis testing is to get a p value and say "Yes there is a significant difference" or "No significant difference". The second conclusion is just a face saving way of saying "I don't know if there is a difference or not" which is something to be strongly avoided after you have spent all the time and money and goodwill with raters and participants (plus people are reluctant to publish you saying you don't know). So your problem is how many of each is enough to be reasonably certain (say 80%) of getting a significant result and so justifying the time and expense. I suspect that there is no standard answer to this and you will need to work from first principles.

Here is one approach -

First you will need to devise a method of measuring the disagreement between the two groups and getting a p value from it. We can think about this later perhaps if you want some ideas.

Next you need some data to get an idea of how strongly the groups disagree. Do a pilot study of say 5 in each group and 10 participants. When you analyse this data you will probably get a non significant result but that doesn't matter.

Use the pilot data to generate a larger data set, say 10 per group and 50 participants. Get a p value. Repeat the exercise 20 times (say) and see what proportion of that 20 is significant. This is the power. If it is around 80% then this is a suitable design.

This could be a long process unless you have programming skills. I suggest that you first work out the biggest experiment you can afford to do and find the power of that design. Then either abandon the project because the chance of success is too small, or work out a smaller design that still gives you acceptable power.

Hi, thanks for your detailed answer. I agree, I have been finding info that says some form of pilot study is required.... which is a bit hard for a project approval form! Additionally, I also believe that the raters required will differ depending on the response options. i.e. if the raters can only rate something "yes" or "no", then this will have a different influence on sample size than if they can rater something "yes, no, maybe, unsure"... etc

So, what I’ve done in order to give an estimate of sample size, is Ive looked through the literature I’ve read that is similar (interrater reliability). There are 7 papers, and most use a rater sample size of 2-4 (one uses 20). The average number of participants doing the screen is 30, and average data points (scores) is just over 100.

I think I’m going to put this on my form, because, at the end of the day, I don’t even know what tests I am doing yet (i.e. what screes), thus I don’t know the scoring system I’ll use. Deciding on the screens is part of the project.