# Identify Significance of Frequency

MojoHR

Hello everyone. I'm doing content analysis on a Twitter account. I have a number of hashtags and frequencies for each of them. Let's say I have a table like this:

Name of the Hashtag---------------Frequency
Hashtag 1-------------------------------356
Hashtag 2-------------------------------298
Hashtag 3-------------------------------255
.
.
.
Hashtag154 to Hashtag 194-----------1

Now I want to present the most frequent hashtags in my results and I'm wondering what the criteria should be or how should I set a threshold to select and present the most significant ones.
Thanks

Karabiner

What do you mean by "signficant" here? Something like large, important, meaningful?

You could start by creating a line or bar chart (X-Axis: rank, Y-Axis: frequencies).

MojoHR

Well, I want to discuss a couple of them (say 7 or 10) in detail, so I need statistical criteria to choose the most important ones. I mean, if the editor asks me on what grounds did you choose to report only the first 10, there should be a valid justification behind setting the threshold. Hope this is clearer now.
Karabiner

Statistical criteria will not select the most important ones. Importance is a matter of judgement.
IMO you can use some arbitrary but sensible criterion just as well. For example, does the proposed
line chart show some "elbow"; or, you want to use those hashtags which represent 50% or 30% or
whatever % of the total frequency. Or, a hashtag is included if its frequency higher than the mean
frequency across the 194 hashtags (I guess one could even apply some statistical test here, but this
could be further complicated by that fact that these are not completely independent observations,
i.e. one tweet can have more than one hashtag).

MojoHR

Thanks Karabiner,
I guess what you are suggesting makes sense; I like the idea of setting the mean frequency as the inclusion criteria. I'm still not sure if a test like Wilcoxon signed-rank test can help me here given the complication of the observations in my corpus. I appreciate your suggestion anyway.
Karabiner

Wilcoxon signed rank test cannot be use here, you do not have an interval scaled variable.

If you ignore the dependency of observations, then a Chi² test would be used for comparing
frequencies between hashtags. Or, you construct a confidence interval for the mean frequency,
and choose hashtags with frequencies higher than the upper limit of that CI.

This doesn't make an awful lot of sense, but looks nicely "statistical".

MojoHR

You're right Karabiner. Thanks a ton