Appropriate use of z-scores

#1
I was thinking about z-scores and I'm curious about their usage when data are skewed/non-normal. I often see zscores being used to identify outliers, e.g. with z>1.96, 2.58, etc. HOWEVER: the z-score calculation of z = (x - mean(x)) / stdev(x) is dependent on the mean, and the mean is not an appropriate measure of central tendency when data are skewed. So when data are skewed, I think it's inappropriate to use z-scores and instead we should favour an alternative like Tukey's classification (how boxplots define outliers, can't remember exactly what it's called). Is this correct?
 

obh

Active Member
#2
Hi,

I agree with what you wrote.
Generally I prefer to use the Tukey fence.

I don't think it is a big mistake to use the normal distribution because it is only a rule of thumb that require more action to understand if it is really outliers.
 
#3
and the mean is not an appropriate measure of central tendency when data are skewed.
I don't agree. I think that the mean is an appropriate measure of central tendency. Not only because the sample mean is an unbiased estimate of the population mean, but because you are often interesed in the mean or the sum. Suppose you apply for a job, then it is the wage sum over a longer period that is relevant, not the week by week median. You will want to know the sum i.e. the mean.
 

obh

Active Member
#4
I don't agree. I think that the mean is an appropriate measure of central tendency. Not only because the sample mean is an unbiased estimate of the population mean, but because you are often interesed in the mean or the sum. Suppose you apply for a job, then it is the wage sum over a longer period that is relevant, not the week by week median. You will want to know the sum i.e. the mean.
Hi Greta :)

You can say that both measures mean and median are good for central tendency, each had a different aspect. Now the question is which method is better for outliers calculation?
Using mean in skewed distribution will result uneven tails, say you potentially get more outliers from one tail.
 
#5
I don't agree. I think that the mean is an appropriate measure of central tendency. Not only because the sample mean is an unbiased estimate of the population mean, but because you are often interesed in the mean or the sum. Suppose you apply for a job, then it is the wage sum over a longer period that is relevant, not the week by week median. You will want to know the sum i.e. the mean.
When the data is highly skewed, say the distribution of age at death, then the mean is pulled in the direction of outliers. The mean is not robust to outliers. So, I think the median would be a better representation of central tendency is such cases.
 
#6
I don't agree. I think that the mean is an appropriate measure of central tendency. Not only because the sample mean is an unbiased estimate of the population mean, but because you are often interesed in the mean or the sum. Suppose you apply for a job, then it is the wage sum over a longer period that is relevant, not the week by week median. You will want to know the sum i.e. the mean.
Ok, I see your point that it is an unbiased estimate of the population mean. But I am not sure if you are disagreeing with the entire post or with just the quoted statement?
 
#7
......and the mean is not an appropriate measure of central tendency when data are skewed.
I should have said that it was just this statement that I disagreed with.

If you are interested in the mean or median (of course both are good measures of localisation) depends on your objective. And if you apply for a job and are going to work there for 100 weeks, then which would you be more interested to be informed about? The weekly median salary or the mean salary? I would like to know the mean since it corresponds best to the wage sum.

Or if you are at a hospital department, and you want to cure patients with treatmen A or B, then you are more interested in the mean since if is related to the long run sum.

Consider there two results for a and b:

Code:
> a <- c(1  , 2, 4,  8, 16)
> b <- c(1/4, 1, 4, 16, 64)

> mean(a)
[1] 6.2

> mean(b)
[1] 17.05

> sum(a)
[1] 31

> sum(b)
[1] 85.25
Clearly the median is the same, 4, but mean of b is much higher.

Also, there is a tendency to talk about outliers as some kind off error. But there are lots of natural data that are highly skewed, e.g the income distribution and substances in environment pollutants. There is nothing wrong with these values. But to delete them would be wrong.