# PhD – Power calculation

#### obh

##### Active Member
Hi Burnsie,

If the goal is to compare novices to experts, then the sample size of the t-test I calculated should be okay (if using t-test) but even if you would use the Mann-Whitney U test, I assume it is still a good estimation.

But is it really the goal?
Regard the model, what do you expect to get, something like:
screening novice Expert Difference
A 2 2.1 0.1
B 2 2.9 0.9
C 3.9 5 1.1

Is screening tool A better than C because the difference is 0.1?
Or is screening tool C is better because both novices and experts get better results?

#### Burnsie_UK

##### New Member
The screen would be better is there is closer agreement between the two groups

#### obh

##### Active Member
The screen would be better is there is closer agreement between the two groups
So A is better than C?
But this sais that both fail to do the screening? ??

#### Burnsie_UK

##### New Member
Are you reporting the means of the whole data set? They can only be whole numbers.... Obvs the means can be dp's.

But yes, the closer the means, surely the closer the agreement between the novice and the expert. This means the novice finds the same info from the screen as the expert and thus you could generalise that all novices should be able to to use the screening tool to get the same data as an expert would

#### obh

##### Active Member
what is "dp's." ?

Sorry, I still don't understand the experiment.
I don't understand the scale: 1= fail, 2= minor faults, 3= no faults.

let's say you have 4 rater and 10 participants.

So the 4 raters (2 novices 2 experts) check participant #1

novice rater1: 1 fail
novice rater2: 2 minor
expert rater1: 1 fail
expert rater2: 3= no faults.

What does this mean?
Is that mean the experts think (1+3)/2=2 it is minor faults
and the novices: (1+2)/2=1.5 ~medium faults?
So you assume the experts are the correct answer and you want the novice to have a similar answer?

#### Burnsie_UK

##### New Member
what is "dp's." ?

Sorry, I still don't understand the experiment.
I don't understand the scale: 1= fail, 2= minor faults, 3= no faults.

let's say you have 4 rater and 10 participants.

So the 4 raters (2 novices 2 experts) check participant #1

novice rater1: 1 fail
novice rater2: 2 minor
expert rater1: 1 fail
expert rater2: 3= no faults.

What does this mean?
Is that mean the experts think (1+3)/2=2 it is minor faults
and the novices: (1+2)/2=1.5 ~medium faults?
So you assume the experts are the correct answer and you want the novice to have a similar answer?
Yes.

So the scale is a/will be a subjective scoring system (to take out the need for experience measuring equipment etc).

For example, you do a hop on one leg

If you fall over when landing = 1

if you wobble with arms out when landing = 2

if you don’t wobble when landing = 3

I want to test to make sure than raters will report the same answers when observing a participant irrespective of their level (i.e. novice or expert)

and..... therefore, I want to know if I need 10 each of novices or experts, testing 10 participant, or 100 each of novices or experts, testing 10 participant…and this is the only thing I need at the moment (and an understanding as to why I’m giving that number).

Again, because you can only score a 1, 2 or 3 on the test (as opposed to being able to score 1-100), that will mean I should have more raters, more participants, or both??

(bet you wished you nenver open this thread, dont you!!!).

#### obh

##### Active Member
Hi Burnsie,

PS what model did other researches use?

I still trying to guess your model ...it seems that you are not sure yourself?

If I understand you correctly, you test several tools over several subjects checks by several estimators?

And for each tool, you check what is the difference between the average mark of the novices (over all subjects) and the average mark of the experts (over all subjects).
Now you gets something like:

Average difference = absolute ( average(experts) - average(novices) )

Tool, Average difference
Tool A, 0.5
Tool B, 0.8
Tool C:, 0.2
Tool D, 0.3

Now you need to prove that the differences 0.2<0.3<0.5<0.8 are significant differences? (each pair)

(bet you wished you nenver open this thread, don't you!!!).
It is very important to explain the research well but shortly at the beginning so you wouldn't get a response to a different model ...
Simple examples always help

#### Burnsie_UK

##### New Member
Sorry for being annoying, and thanks for your speedy and detailed replies.

So were singing from the same hymn sheet.... What do you mean by "model"

#### obh

##### Active Member
Model...What test do you want to use?

Paired-T test while one group is the average of the novice and one the average of the experts?

Anyway, as Noetsi wrote the test power is not all, you may have good power if you take one expert and one novice with many subjects but get bad results.

#### Burnsie_UK

##### New Member
AH, the model they (lit) has used to comapre is general ICC and also the Kappa

#### obh

##### Active Member
The Kapa model uses the z distribution, so if you use this model I assume that you can use the power of z-distribution to choose a sample size
or if you define the required MOE of the confidence level.

#### noetsi

##### Fortran must die
Are you going to randomly assign experts and novices to the tool? How are you going to deal with differences between the group other than what you are interested in.

Which is a different issue than statistical power, but extremely important. Is there anyone in your program (I mean a professor) who is an expert in design of experiment. The issues you are dealing with go way beyond statistical power.

But you may have already decided this. If you have a specific test in mind there are tools like Gpower that help determine power. You might review those. But again choosing your method is the real key here. Not sample size.