I have two groups of subjects, A and B. I have a text sample produced by each subject. My goal is to idenitfy those words that are used significantly more frequently by one group than by the other (probably in terms of percentage of total words rather than number of uses, because group A subjects usually have longer text samples).
There are 16 subjects in group A and 18 in group B, with a few hundred words for each subject. There are about 1500 distinct words total.
A few approaches that I've considered already (please let me know if these need clarification):
Initially, I simply looked at those words for which the difference between total percentages is large -- e.g. the word "***" constitutes 7% of all words for group A but only 5% of all words for group B, yielding a large difference of 2%. This approach, however, fails to identify many words that are rarer but still have significant differences.
I also looked at ratios of percentages -- e.g. "YYY" is 8% of all words for B but only 1% of all words for A. With this approach, it's hard to tell what's significant (0.07% vs 0.01%?), and there's no clear cutoff.
I was thinking of running an analysis in which, for each word, I would calculate the percentage that word constitutes in each patient's text. For each word, I would then run a T-test of the percentages for that word in group A vs group B. For example, if "ZZZ" was 3% for subject A1, 4% for A2, 5% for A3, 10% for B1, 11% for B2, and 12% for B3, then that word would be significantly more common in group B (if those were the only subjects). However, for all but the most common words I would be dealing with 0% for most patients, which would skew the data enough that a T-test would be inappropriate.
Any suggestions about an appropriate approach would be appreciated. I'm not familiar with many techniques beyond those taught in a basic statistics course, but would be happy to learn if pointed in the right direction. Thank you!
There are 16 subjects in group A and 18 in group B, with a few hundred words for each subject. There are about 1500 distinct words total.
A few approaches that I've considered already (please let me know if these need clarification):
Initially, I simply looked at those words for which the difference between total percentages is large -- e.g. the word "***" constitutes 7% of all words for group A but only 5% of all words for group B, yielding a large difference of 2%. This approach, however, fails to identify many words that are rarer but still have significant differences.
I also looked at ratios of percentages -- e.g. "YYY" is 8% of all words for B but only 1% of all words for A. With this approach, it's hard to tell what's significant (0.07% vs 0.01%?), and there's no clear cutoff.
I was thinking of running an analysis in which, for each word, I would calculate the percentage that word constitutes in each patient's text. For each word, I would then run a T-test of the percentages for that word in group A vs group B. For example, if "ZZZ" was 3% for subject A1, 4% for A2, 5% for A3, 10% for B1, 11% for B2, and 12% for B3, then that word would be significantly more common in group B (if those were the only subjects). However, for all but the most common words I would be dealing with 0% for most patients, which would skew the data enough that a T-test would be inappropriate.
Any suggestions about an appropriate approach would be appreciated. I'm not familiar with many techniques beyond those taught in a basic statistics course, but would be happy to learn if pointed in the right direction. Thank you!