Extreme Value Theory and Peaks Over Threshold Analysis to determine extreme values - the best method?

#1
Hi there,

This type of statistics is very new to me so would be very thankful for some assistance! Apologies in advance for any oversights or misunderstandings!

Background: I have some data which is based on email address usage by single users (images attached) which details how many different email accounts have been used by a single user in a given time period. Importantly, the vast majority of users are humans, however a small subset of users are machines.

As you can imagine, the vast majority of human users use only 1 email account, however some may use 2 or 3 or more. Machine users however often utilise numerous accounts which can be in the 100s or even 1,000s (think spam type accounts). Nonetheless, it is still feasible (although unlikely) that a human user could be responsible for (say) the use of 50-100 email accounts (think manual spamming) in the time period, even though it is much less likely.

My objective: I am trying to calculate a threshold value which will allow me to accurately discriminate between a human user and a machine user based on number of email accounts used (n). I am happy to accept a method/value which is too 'safe' (i.e. a higher 'cut off' value which encompass all human users at the cost of including a small number of machine users).

Initial thoughts: I have read up on Extreme Value Theory (EVT) and have considered using a Peaks Over Threshold (POT) analysis assuming my data fits a Generalised Pareto Distribution (GPD)
  • Would this be a suitable type of analysis to employ? Are there better alternatives?
  • Can this method work with my discrete data rather than continuous?
  • Should I use a subset of the data only? (e.g. only data where n>10?)
  • How do I calculate the parameters needed to establish a threshold value? Specifically - Shape parameter (γ) and scale parameter (β) ?
I'd be really grateful for some advice on methodology, as I'd like to apply the same calculations on data from different providers and across different time periods.

Happy to answer any questions on my data, and many thanks for your help!
 

Attachments

#2
This is a Poisson distribution, n is too large for my computer.
There's some data missing.
Neither of the above matter, clearly your threshold number is 7-10 or less, and no statistical test is going to make any number from 7 to 10 more certain.