Please help... normal distribution of data and checking normality?

#1
Hi,

I have a set of data which I would like to find whether it is parametric or non-parametric in order to find out which test to do next to compare differences between 2 groups.

I have been told to do a histogram followed by Shapiro Wilk test which I have done.

The problem is that the data represents volumes in litres which can not be negative. I have 30 sets of values, so a fairly small number, and they range from around 0.80L to 4.0L. It is impossible for these volumes be below around 0.5/ above 5.0, even in the normal population, due to what they represent.

The histogram has shown that my data is not normally distributed because the smooth line has gone into the negatives. Furthermore, the Shaprio Wilk test has also shown my data is very far from being normally distributed, but again I feel it has probably taken negative values into account.

I am using SPSS.

Can anyone please give some advice?
 

Karabiner

TS Contributor
#2
I have a set of data which I would like to find whether it is parametric or non-parametric
There is no such thing as parametric or nonparametric data.
There are non-parametric tests, though. Tests which do not need certain assumptions to be fulfilled,
e.g. that measurements are interval scaled.
I have been told to do a histogram followed by Shapiro Wilk test which I have done.
Histograms are useless, IMHO, because assumptions deal with distributions in the population,
not in the sample. Shapiro-Wilks is useless, IMHO, because with a small sample, it has not much
statistical power, and cannot detect medium sized deviations. With a large sample, it has too much
power and detects irrelevant deviations.

If you want to perform a t-test for the comparison of means
between two samples, then preferably each of the samples (not the total sample)
should be from a normally distributed population. But if your total sample size
is large enough (often n >= 30 is assumed as sufficient), then the t-test is valid
even if the two populations are non-normal. What is more important, are
equal variances in the 2 populations, if sample sizes are unequal. So one
should use the Welch-corrected t-test by default.

If you feel uneasy with the Welch-t-test, then you should indeed consider the U test.
It does not compare means, but which groups tends to have the higher values
(expressed as ranks). Often, this is more or less what one wants to know.

With kind regards

Karabiner
 
Last edited:

obh

Well-Known Member
#3
There is no such thing as parametric or nonparametric data.
There are non-parametric tests, though. Tests which do not need certain assumptions to be fulfilled,
e.g. that measurements are interval scaled.

Histograms are useless, bcause assumtions deal with distributions in the population, not
in the sample. Shapiro-Wilks is useless, because with a small sample, it has not much
statistical power, and cannot detect medium sized deviations. With a large sample, it
has too much power and detects irreoolevant deviations.
Karabiner
Hi @Karabiner

Everything you wrote is correct (as always), but histogram and SW test can still give you more insights.
As you say SW with a small sample size can not prove normality, and may have weak power for small or medium effect sizes, effect size of the deviation from the normality, but it may have enough power to identify a large effect size to find data that extremely deviation from the normality.

SW test with large sample size, like any other test, may identify a very small effect size which is practically meaningless, as there will always be a small effect size (no distribution is perfectly normal)
The problem with the SW test is that it doesn't show the effect size. Otherwise, you may get a "significance p-value with small effect size" and assume normal data.
As I understand there is no simple way to calculate the effect size, which would solve the large sample problem.

The " n >= 30" is a nice rule of thumb, but for extreme asymmetrical data, it might not be good enough.
 

Karabiner

TS Contributor
#4
Everything you wrote is correct (as always), but histogram and SW test can still give you more insights.
So I edited my contribution, and added two IMHOs.
As you say SW with a small sample size can not prove normality, and may have weak power for small or medium effect sizes, effect size of the deviation from the normality, but it may have enough power to identify a large effect size to find data that extremely deviation from the normality.
Well, if the sample is large enough as to permit interpretations of sample effect size measures
as approximations of the true parameter (in the population), because smapling error is small,
then why should I check such an assumption anyway? Except for the statistical test of significance
for the Pearson correlation [which produces a meaningless, dimensionless coefficient and
can often be substituted by simple linear regression] , I do not easily remind any procedure
which compellingly assumes something being normally distributed in the population, if sample
size is large.

The " n >= 30" is a nice rule of thumb, but for extreme asymmetrical data, it might not be good enough.
This is a necessary qualification of my contribution, thank you. The n=30 is just an information,
and indeed cannot be considered as being carved in stone. But if, in my work, sample data appear
very asymmetrical, then there usually is a substantial reason for this, i.e. I expected that there
will be a non-symmetrical distribution, or post hoc a theoretically sound explaination emerges.
This is the case with e.g. clinical scales applied in the general population, reaction times, income etc.
- generally speaking, it can often happen in case of variables with a natural zero.

With kind regards

Karabiner
 
Last edited:

obh

Well-Known Member
#5
Hi @Karabiner

Sorry for the late response.
I wrote about the effect size of the deviation of the population distribution from the normality.

As you wrote we don't need population distribution to be normal, usually, we need only the average to distribute normally, and we have the CLT to help us.

1. So let's see if I understand your 30 rule of thumb, it is as following:
You use it only if you don't have a good reason to assume asymmetrical data
Reasons for asymmetrical data, like the following.
1. Clinical scales, Likert?
2. Every variable with a natural zero (like reaction time, income, etc ..)

If you don't have the above reasons then you use the 30 rule even if the histogram is asymmetrical? (as this may be a random skewness of small sample size)

2. What about F-test for variances? as I know it should be sensitive to the deviation from the normality?
Chi-squared test for a variance?

3. What do you do when the sample size is lower than 30?
From what sample size we should say that if we don't have a strong prior assumption that the population distribution is normal, we should use a non-parametric test?

4. I assume the answers are defined on the test sensitivity to the normality assumption and on the IMHOs :)