Testing normality of distribution: Shapiro-Wilk's test/Levene's test?

I have data from an experiment with 5 conditions. The goal is to compare the impact of treatment on dependent variables between the conditions. To determine whether to use ANOVA or non-parametric equivalents, I conducted Shapiro-Wilk's test. It shows that the data is not normally distributed, so I decided to use Kruskal-Wallis test for the analysis. But then, I run Levene's test to analyze the homogeneity of variance between the groups, and the result shows that there the variances are homogenous. Could I use ANOVA instead of a non-parametric test? Thank you for your help.


Super Moderator
Hi there, sorry for the delay in releasing your post - it was caught in the spam filter for some reason. A couple of thoughts:

1) ANOVA (and other regression models) do not assume that the marginal ("overall") distribution of the dependent variable is normal (link). ANOVA does assume that the distribution of the DV is normal within each group. That said, testing this with a Shapiro-Wilk test is virtually pointless: If the sample is small, the normality assumption matters, but the Shapiro-Wilk test will have poor power to detect violations of normality; if the sample is large, the Shapiro-Wilk test will have good power, but the normality assumption probably won't matter (due to the central limit theorem).

2) A non-significant Levene's test statistic indicates a lack of evidence to reject a null hypothesis that the variances are equal. It doesn't necessarily indicate that the variances are homogenous; a non-significant result might just be due to low power.

3) A Kruskal-Wallis test is a non-parametric alternative to ANOVA, but it tests a completely different null hypothesis (that the mean ranks are equal across the populations). That might not be what you're interested in testing. If all you're worried about is normality, a simpler alternative would be to use ANOVA, but apply bootstrapping or a permutation test to calculate confidence intervals or p values.

Hope that helps!
Thank you for your response! A couple of thoughts:
- How do I know if Levene's test brings a statistically significant result because of low power, rather than because the variances are not equal?
- Several sources outline that the general assumptions of ANOVA assume both normal distribution of the DVs (that is what I have been testing with Shapiro-Wilks) and homogeneity of variances (testing with Levene's test). See e.g. here: https://statistics.laerd.com/statistical-guides/one-way-anova-statistical-guide-2.php. That is why I'm concerned about distributions, although ANOVA is pretty resistant to non-normality
- Non-parametric alternatives are recommended for small-sample (30 participants per condition) Likert-scale data, see e.g. https://dl.acm.org/citation.cfm?id=1753686&dl=ACM&coll=DL&CFID=997388384&CFTOKEN=76933434
- Kruskal-Wallis doesn't use means, it uses medians. See e.g. here: http://www.statisticshowto.com/kruskal-wallis/, https://statistics.laerd.com/spss-tutorials/kruskal-wallis-h-test-using-spss-statistics.php
- I use Dunn post-hoc test for pairwise comparisons.
Thanks a lot


TS Contributor
just to build on what CB said, you should run the ANOVA test and then do the diagnostics: check whether the residuals are normal or not, whether you have unhomogenous variances (like the horn shape in the residual graph) etc. This is generally much easier and more sensible to do then the checks before the test. If your residuals show patterns and/or non-normality you might want to either use more advanced techniques (like a data transformationtransformation) or move over to a non-parametric test. They have generally a lower power, so you might want to consider gathering more data in this case.


TS Contributor
- How do I know if Levene's test brings a statistically significant result because of low power, rather than because the variances are not equal?
Small sample size. For example, with a total n=40, the power to detect that population SDs of 5 versus 9 are different would just be 40.5%
https://ncss-wpengine.netdna-ssl.co.../PASS/Levene_Test_of_Variances-Simulation.pdf page 553-11

- Several sources outline that the general assumptions of ANOVA assume both normal distribution of the DVs (...). See e.g. here: https://statistics.laerd.com/statistical-guides/one-way-anova-statistical-guide-2.php
But your source does NOT outline what you say. Instead, it states:
"1.The dependent variable is normally distributed in each group that is being compared in the one-way ANOVA..."
Careful reading is highly recommended.

With kind regards

My view is that few things have made as much damage to the practice of statistics as the dichotomy of "parametric" and "non-parametric" metods. That division seems to be the story of elementary statistics book. The division was correct in the 1950ies. But in the 1960ies the Box-Cox transformation came (to transform to approximate normality) and in the 1970ies the generalized linear models (with many other parametric distributions) and it was at least well established in the 1990ies. Many other non-parametric metods appeared.

In my knowledge CBear is correct in that the Wilcoxon-Mann-Whitney-Kruskal-Wallace is based on the null hypothesis that the MEAN of the ranks is the same. I am sorry but I am to lazy to search for sources for that.

I am sorry but I don't trust the sources that luckycat gives in post 3. We all know that some sources on the internt are not reliable.

But Fagerland Sandvik (2009) "Performance of five two-sample location tests for skewed distributions with unequal variances" says that it is not generally true that the Wilcoxon-Mann-Whitney is a test of medians.

Have a look at Fagerland, Sandvik and Mowinckel (2015) where the abstract says:

The Welch U test (the T test with adjustment for unequal variances) and its associated confidence interval performed well for almost all situations considered. The Brunner-Munzel test also performed well, except for small sample sizes (10 in each group). The ordinary T test, the Wilcoxon-Mann-Whitney test, the percentile bootstrap interval, and the bootstrap-t interval did not perform satisfactorily.

The difference between the means is an appropriate effect measure for comparing two independent discrete numerical variables that has both lower and upper bounds. To analyze this problem, we encourage more frequent use of parametric hypothesis tests and confidence intervals.