Hi, would firstly like to say that ive put some considerable effort into finding my answer myself! I think Chi squared is the way to go but just want to confirm/ask opinions.

Background: im comparing apples to apples - a novel test against a gold standard (electrocardiogram - ECG). The data is categorical in so much as the question will be - does the ECG show X? answer will be limited to yes or no. There are 30 subjects and each will have 3 ECG's. one gold standard and 2 variations from the novel test. Asking 2 independent examiners to answer the question. As all the tests should answer 'yes' there will also be 30 ECG' s that should answer 'no' to act as a control. Data is expected to be not normally distributed in terms of subjects will likely all be old (>50) and male.

Any help is greatly appreciated.




Not a robit
Three tests, do you want to compare A to B, A to C, and B to C? Do you know the gold standard is 100% accurate? What is the sample size?
Thank you for your reply. Sorry for the omission. It will be comparing A to C and B to C with C being the gold standard. 30 subjects. The gold standard in this case is considered 100% accurate.



TS Contributor
What do you want to achieve by carrying out statistical tests of significance?
A test of significance concerns the Null Hypothesis of whether the difference
between Gold Standard and the novel instrument is exactely = 0.0000 in the
population. Since the Gold Standard is 100% accurate, for any instrument with
only a 99.999 % accuracy the Null Hypothesis will be rejected - all you need is a large
enough sample size. But if you achieve a non-singificant result, then you will have
made a beta error, due to your tiny sample size.

So, what is achieved by a test which will either tell you what you new before
(instruments do not 100% agree / novel Instrument is not 100% precise), or
which will have wrong result (failure to reject the Null although the
Null is incorrect)? I'd guess that the actual difference between the instruments
is of concern, not whether they differ at all?

With Kind regards

Thanks Karabiner,

Ok so thinking about it the gold standard is not 100%. There is the possibility that the novel test will be more accurate in 1 configuration but less in another (A and B) the issue is that the analysis of the data is subjective in real life hence why I have kept the data categorical. But yes the actual difference is important but there would be 100's of variables or answers from the analyser/reviewer. This analysis is for an initial review and is not expected to be as in depth as the final analysis.


Hi again,

So after some more research I believe that for this data Cohens Kappa is the correct statistical analysis for the level of inter rater agreement. My question is whether in SPSS I would need to run 2 separate tests A vs C and B vs C or is there a way of doing them together?