What test to identify words used more frequently by one group than another?


New Member
I have two groups of subjects, A and B. I have a text sample produced by each subject. My goal is to idenitfy those words that are used significantly more frequently by one group than by the other (probably in terms of percentage of total words rather than number of uses, because group A subjects usually have longer text samples).

There are 16 subjects in group A and 18 in group B, with a few hundred words for each subject. There are about 1500 distinct words total.

A few approaches that I've considered already (please let me know if these need clarification):

Initially, I simply looked at those words for which the difference between total percentages is large -- e.g. the word "***" constitutes 7% of all words for group A but only 5% of all words for group B, yielding a large difference of 2%. This approach, however, fails to identify many words that are rarer but still have significant differences.

I also looked at ratios of percentages -- e.g. "YYY" is 8% of all words for B but only 1% of all words for A. With this approach, it's hard to tell what's significant (0.07% vs 0.01%?), and there's no clear cutoff.

I was thinking of running an analysis in which, for each word, I would calculate the percentage that word constitutes in each patient's text. For each word, I would then run a T-test of the percentages for that word in group A vs group B. For example, if "ZZZ" was 3% for subject A1, 4% for A2, 5% for A3, 10% for B1, 11% for B2, and 12% for B3, then that word would be significantly more common in group B (if those were the only subjects). However, for all but the most common words I would be dealing with 0% for most patients, which would skew the data enough that a T-test would be inappropriate.

Any suggestions about an appropriate approach would be appreciated. I'm not familiar with many techniques beyond those taught in a basic statistics course, but would be happy to learn if pointed in the right direction. Thank you!


I've seen Chi Squared used here for each word but also have seen literature against such practices. Not to complicate things but you also have nested data, words nested withing subjects nested within groups.


No cake for spunky
Why would you have nesting if all you care about is did Group A vary from Group B in terms of what words they use, which seems what the OP is interested in? I don't understand that point....

I am not sure what statistical test is appropriate here, but reading the OP comments it seems to me that the issue is not what statistic to use, but how great a difference suggests that the groups vary signficantly. Which is a issue of theory not statistic. If you believe, for some substantive or theoretical reason, that a one percent difference is enough to matter, then it does. There is no statistical basis to reject this (unless the issue here is one has a sample and you are not sure if the sample pertains to the population - something I did not see in the OP comments).

For example:

This approach, however, fails to identify many words that are rarer but still have significant differences.
Based on what criteria? That is you initially say that a certain difference are large enough to matter, but then reject the results because some words where there is signficant differences don't show up given this definition. So the logic at the end of that paragraph seems to contradic the logic in the beginning for no obvious reason. In any case just pick what you believe is a justified difference and stick with it.

If you believe that a one percent difference in usage of words is the right number to use - for theoretical reasons which should guide such decisions or common usage in your field then use that number. Don't reject it because some words don't show up you think are important - that is a post hoc logic that IMHO invalidates your results. Its like doing a t test then rejecting the results because you don't agree they make sense.

That defeats, biases the whole point of statistical tests:p


TS Contributor
I'm skeptical about the approach: You want to discover words that are used more often that others by one group than another. Then you consider a lot of words and want to run multiple t-test. But running a lot of test like this some words are bound be used significantly more often by one group than the other. The asymptotics for conducting a t-test simply do not validate this use you are suggesting of running multiple tests. Usually a small number of test are allowed with rule of thumb correction but in your case it does not sound like a small number since you have 1500 different words.


New Member
Thanks for the replies -- especially to noetsi. I think that basically is my problem; I'm uncomfortable using the ratios or differences because there's no clear way to assess what is meaningful and what is due to there being too few data points. For example, let's say that "AAA" is 16% of all words for A and 4% for B, while "BBB" is .05% for A and .01% for B. Intuitively, it would seem that the first difference is more meaningful than the second, even though the ratio is smaller, since in the second case we're talking about just a few data points.

If you think ratio and/or difference of percentages are a good way to approach it -- do you have any suggestions about how I could define a reasonable threshold?

Thank you!


No cake for spunky
I am not sure if it is a good or bad way to approach it statistically. My point was simply that the issue you raise does not really seem to be primarily a statistical issue. It is a design or theory based one.

Because I am not an expert at all in your field I will give my generic suggestion. You should look at the literature (you may have already) and see what threshold they chose (and for that matter method). That is by far the safest way. That way if someone objects - you have support for your approach. I would think linguistic journals would be one place to look - again you may have already.

Another point is that 34 cases really is not enough to generalize to a population in all liklihood and will have very weak statistical power.


TS Contributor
I couldn't agree more with Noetsi you need some sort of theoretical guidance. You cannot set up a hypothesis test without an hypothesis and you cannot invent you're hypothesis based on the data and afterwards test it with the same data. Either you come up with a hypothesis independently of the data fx. by going through relevant theoretical litterature or you adopt an exploratory design where you merely describe the differences. But even here theoretical litterature might be needed afterall why should you group words according to their semantic units or syntactical? Words could be different simply according to their length and a hypothesis being that academics use longer words than non-academics. Words could also be grouped according to whether or not they were slang and certain forms of slang could serve as part of constructing identity etc. Without these basic ontological issues answered by theoretical assumption it is really hard if not impossible to get anything informative from data.