z scores for anomaly detection

#1
hi all,

I'm exploring z-scores as a method to spot anomalies in my data. What I don't really understand is what the benefit of using a z-score is over just looking at the percentage difference a value has from the mean average in the data?

For example, if I calculate that a value is 70% different from the mean, isn't that enough to assess if it's an anomaly ? What is the added value to calculate it's z-score?

Thanks for any tips!
Pat
 

hlsmith

Less is more. Stay pure. Stay poor.
#2
Z-scores tell you how many standard deviations the value is away from the mean and we know the general coverage area of standard deviations. For example 1, 2, and 3 standard deviations represent 68%, 95, ~99% data land within.

@Miner - any input for a person looking for outliers/anomalies?
 

Dason

Ambassador to the humans
#3
For example 1, 2, and 3 standard deviations represent 68%, 95, ~99% data land within.
That's only true if the data is distributed with a normal distribution and if you're using the parameters (not estimates - although for large enough sample sizes the estimates will work mostly but you might want to consider a robust estimate if you expect there are some values that don't actually follow the distribution).

For the general case you can use Chebyshev's Inequality.
 

hlsmith

Less is more. Stay pure. Stay poor.
#4
Correct, I thought about saying that - but it is Monday. I would imagine also, if the value was really an anomaly it would be pulling the mean, so the anomaly if erroneous would actually be further away from the true mean then suggested by the above process.
 
#5
thanks all.
Follow up question:
What i'm trying to work out is the best way to spot outliers in the following scenario:

Set of retailers that sell a subscription product. Each month i have a total number of 'new subscriptions' for each retailer. In each month I want to see any unusually high (or low) number of new subscriptions. if I use the z score method, I plan to do as follows:

z score for each retailer based on the mean of its own new subscription figures in the past x months. the extract out whenever the score is above 3 or below -3 (as this seems to be the thresholds in z-score boundaries to be considered outliers).

also note that my data is heavily skewed with just a very small number of 'large retailers' that make up most of the new subscriptions each month.

the reason i want to find outliers to identity strange behaviour which may help understand the data. eg. has there been big price promotions for some retailers etc . I 'think' this method could be helpful, but wanted to write to this forum to check as well as to ask if there are better solutions?

Thanks!