I want to know if a test I'm working in has a good sensitivity and specificity (compared with a gold standard test), but I am not very familiarized with statistics. My doubt arises because instead of dichotomical (yes-no) variables, there are 4 nominal variables.

Should I apply the "test vs gold standard" test considering the test as a whole or should I do 4 different tests (one for each variable measured? Or is there any better test for my purpose? I am asking this because I can see that my test seems more specific in some categories than in others and that would be something good to discuss in my work.

Omega Contributor
Please provide much more information. Unsure if you are talking about 4 variables or levels, etc.

Given your reply, that is a chance you may need to look at inter-rater reliability approaches instead.
Sorry, let me explain a little bit more:

I have developed a test that classifies words from a given list into 1 out of 4 categories which are nominal (specialized word, non-specialized word, semi-specialized word and special word). In fact, the 3 first categories could be considered as ordinal variables, but the last one is very different and thus I think it is more convenient to consider all of them as nominal variables.

Now I have another test, which is the gold standard test to classify words according to the above-mentioned categories.

I want to see if my test is good or bad in terms of specificity and sensitivity in comparison with the gold standard test.

To make it simple: which statistical method should I apply for such purpose? (I am interested in seeing if the test is good or bad for each category, rather than as a whole).

Omega Contributor
Have you looked into what people do in Natural Language Processing or naïve bayes classifiers. I don't have experience in these areas, so would be of little help providing relative direction.
Well I have checked it out and I have seen complex models that I am not really into, since the focus of my research is partially correlated with this field. But I will try to find out more.



TS Contributor
I am interested in seeing if the test is good or bad for each category, rather than as a whole
So, with regard to each category (e.g. category A), for each object you have a dichotomous outcome: "belongs to category A" / "does not belong to category A". You have these outcomes for your own measurement and for the gold standard. You can apply the common descriptive statistics for this (sensitivity, specifity, NPP, PPP...).

