Comparing absolute values of difference scores between two groups

I have two groups (A and B) that each completed a different version of an experimental measure (measure A and measure B). Both groups also completed a standard, "gold standard", measure. For the purposes of my study, I am treating the "gold standard" measure as a participant's true score. If I look only at group means, there does not appear to be any difference between the measures. Both groups obtained very similar scores on their respective versions of the experimental measure, and both groups obtained similar means on the "gold standard" measure. However, for group A, I noticed that there was a lot more variance in their scores on measure A, than for group B in the variance in their scores for measure B. So, although measure A and measure B did about as well each other and obtained the same means as the gold standard, group A appears to be a lot more unreliable when considering the scores of individual participants. Basically, people who completed measure A tended to give more extreme scores, both overshooting and undershooting their true score (obtained by the gold standard measure), but somehow, they cancelled each other out and their group means was the same as group B's. This difference in variance is also noticeable when I look at correlation coefficients, as measure A has a lower correlation with the gold standard measure than measure B. I compared the correlation coefficients with a Fisher Z test, and the differences in correlation coefficient are statistically significant.

So, here's where I'm hung up:

I'm wondering if it would be okay to calculate the absolute value of the difference score between the gold standard and the experimental measure for each participant. I would then want to compare the means for this absolute values of the differences for each group, with a t-test or ANOVA. I think this would be more meaningful than simply comparing correlation coefficients, because it would help me quantify in meaningful units, on average, how much more each participant completing measure A missed their true score, and then make a statement about the significance of the difference between the groups. Is this Kosher? Is there a better procedure for me? I hope the information I provided wasn't too vague, but I am a first time poster and a little shy about asking questions on a public forum.
I don't want to be the bearer of bad tidings but from what I understand you have done you have a small problem. However others may be able to advise for ways around it.
From my understanding you gave group A, measure A and Gold and you gave group B, measure B and Gold. Please advise if I have misunderstood. If you gave both groups both tests then you would be ok.

The tests you discuss, ANOVA and t-test are for comparing the same measure between groups/ across conditions/ across time etc. You can't use t-test or ANOVA to compare test A with test B. You can use ANOVA/t-test to compare Gold scores between group A and group B.

Regarding the bigger question what you appear to be studying is a validity/reliability analysis to test two alternative measures against a gold standard. The types of analyses to use here are correlation or regression (eg predicting Gold from A or B), reliability coefficients (alphas etc), and perhaps item analyses on A and B to find which items on each perform better.

However given that you gave tests A and B to two different groups the conclusions you can make will be a bit limited.

Regarding variance of A versus B, having greater variance in scores on a measure between participants is not a measure of the reliability of the measure. It could actually be regarded as a good thing because it gives you greater discrimination between scores. What you are interested in is that the reliability coefficient is high - that is that the items are scored consistently by participants in relation to the overall measure. That is the internal consistency of the scale is good. Also you want a good correlation (predictive capacity, criterion validity) between A and Gold and B and Gold.

So you could do a few useful analyses in term of comparing A and B with Gold but it is hard for you to say for sure that A is better or worse than B because you gave the test to two different samples. Even if the samples are very similar and the conditions were very similar there is a certain amount of additional unexplained variance caused by the different samples.
Last edited:
Thank you for the reply statsanon. I have considered the dilemma you brought up before, but I gave the measures to two different groups for a reason. I neglected to mention that participants were randomly assigned to the groups, while all other procedures were held constant. Basically, measures A and B are the same, but their insructions are different, so it would not have been possible to give them both measures at the same time (they would basically be getting two different sets of intsructions for the same measure). So I'm interested in the effect of this manipulation. Ideally, I would just use my "gold standard" measure for all participants in research studies in the future, but it is cumbersome and time consuming, and I'm trying to determine which is the best version of the brief measure (A or B).

I'm not sure if this information makes my problem less of a concern, but my data is what it is, and I'm just looking for a way to present it the most clearly. I am less concerned about variabilty within the measure, but the difference in responses between the experimental measures and the gold standard. While both measures A and B obtain similar scores, and these scores are similar to the scores obtained by the Gold Standard, I am seeing that participants in group A have a wider range of scores on Measure A compared to their Gold Standard scores. Measure B appears to have more precision. Unfortunately, when you just look at overall group means, they appear the same.
To clarify further, when I same "similar scores" I am referring to the overall group means for each measure, not individual participant scores. When looking at individual participants, it looks like measure A overestimated and underestimated scores of the gold standard than measure B, it looks like measure B has more precision.

Thanks for clarifiying. In the case you describe you could use ANOVA and t-test to compare differences in means. However it really isn't going to tell you all you need to know. Which of your measures A and B has the highest alpha and which has the highest correlation with Gold standard. I know that maybe you ideally want a 1-1 correspondence between you alternative measure and Gold but could be situations where a measure could have greater range of scores but is in fact a more reliable measure and a better predictor of the Gold standard. To test the reliability of your measure you should also consider a retest at a later date (test-retest reliability) and look at other forms of reliability and validity.

So which has the best correlation and which has the best alpha?