# Clarification in statistical test

#### Alisonjones

##### New Member
Hi. I would like to know if Mann-whitney U test is appropriate. My data is non normal. I want to compare 3 catorgories ( human-trail, fences and control) with the dependant variable being animal trail lengths. Is this an appropriate test to use or is there a better suited statistical test?

#### katxt

##### Member
The Kruskal-Wallace test is the one for three groups so long as the three groups have the same shape but possibly different locations. Failing that, a permutation test should work. kat

#### obh

##### Active Member
Hi Alison and Kat,

The Mann-Whitney U test compares only 2 groups. the Kruskal-Wallace is the equivalent non-parametric test to the one way above assuming H0 the rank of all groups is equal. So it is a good suggestion

If you want to compare each pair (ab, ac, cb) then you must use a lower significant level (α) to avoid rejection correct Ho (type1 error). as you are doing 3 tests and not only one. (each test is a random variable, the probability to get randomly extream value is higher with 3 variables )

as the three groups have the same shape but possibly different locations.
In Mann-Whitney U test (and I guess in Kruskal-Wallace as well) "the same shape" assumption is not mandatory if you compare the "rank", say you compare the entire distribution and not only one statistic like median or mean. If you want to compare the medians than you need this assumption.

(For symmetrical distributions mean equal median)

#### Karabiner

##### TS Contributor
My data is non normal.
It is NOT an assumption of a t-test or an analysis of variance that "data" have to be normal
(who told you so, by the way?).

If sample size is small, then the prediction errors (residuals) of your ANOVA should prefereably
be sampled from a normally distributed population of residuals..

If your sample size is large, then the t-test or the F-test, respectively, is considered as being
robust against violations of even that assumption.

How large is your sample size?

With kind regards

Karabiner

#### obh

##### Active Member
Hi Karabiner

Correct, but I may rephrase it in another way:
The t-test uses t distribution.
t-distribution assume normal data (uses instead of z when you don't know the population)
Due to central limit theorem usually, when the sample size is large enough the distribution is approximately normal.
So you can use the t-test

I also tried to run a simple simulation (hopefully correct ???)
The expected result should be pvalue=0.05, the probability to reject a correct H0 if the t-test is perfect for this data.

I should also show the power to reject an incorrect H0...

I use chi-square with df=4 independent of the sample size, just to show a non-symmetrical distribution. (usually, the df going up with sample size)
I tried the same also for a normal distribution.

df <- 4 # degree of freedom
reps <- 200000 # number of simulations per one sample size
sample_size=c(2,4,6,8,10,15,20,25,30,35,40) # sample size

mean_pvalues <- numeric(length(sample_size))
set.seed(1)

j <- 1
for (n in sample_size)
{
pvalues <- numeric(reps)
for (i in 1:reps)
{
x1 <- rchisq(n, df, ncp = 0)
x2 <- rchisq(n, df, ncp = 0)
pvalues <- t.test(x2,x1,alternative="greater")$p.value } mean_pvalues[j] <- mean(pvalues < 0.05) j=j+1 } mean_pvalues plot(sample_size,mean_pvalues) lines(sample_size,mean_pvalues) And compares chi-squared distribution (blue) with a normal (red) Last edited: #### Dason ##### Ambassador to the humans Was that the literal code you used? #### obh ##### Active Member Hi Dason, I used the following code, I added normal distribution and colors and increased reps. Code: df <- 4 # degree of freedom reps <- 800000 # number of simulations per one sample size sample_size=c(2,4,6,8,10,15,20,25,30,35,40) # sample size mean_pvalues <- numeric(length(sample_size)) set.seed(1) j <- 1 for (n in sample_size) { pvalues <- numeric(reps) for (i in 1:reps) { x1 <- rchisq(n, df, ncp = 0) x2 <- rchisq(n, df, ncp = 0) pvalues[I] [ i ]<- t.test(x2,x1,alternative="greater")$p.value
}
mean_pvalues[j] <- mean(pvalues < 0.05)
j=j+1
}
mean_pvalues
plot(sample_size,mean_pvalues)
lines(sample_size,mean_pvalues,col="blue")
#-2.-------------------
mu <- 10  # mean under the null hypothesis
sigma <- 20  # mean under the null hypothesis
#reps <- 800000  # number of simulations per one sample size
sample_size=c(2,4,6,8,10,15,20,25,30,35,40)  # sample size[/I]

[I]mean_pvalues2 <- numeric(length(sample_size))[/I]

[I]set.seed(1)[/I]

[I]j <- 1
for (n in sample_size)
{
pvalues <- numeric(reps)
for (i in 1:reps)
{
x1 <- rnorm(n, mu, sigma)
x2 <- rnorm(n, mu, sigma)

#### obh

##### Active Member
Okay, I fall for it again
From some reason ... the website doesn't like the [ i ] and remove it from the code I paste
I will update the code to [ i ] with spaces

But the [j] works fine

#### Dason

Use code tags around your code and that won't happen

#### obh

##### Active Member
What is the code tag? ~~code ~~

Any way, my conclusion from the simulation is that even for a non symmetrical distribution t test will do a good job. In this specific case example from around n=20. (I know the general thumb rule is say 30).

Is this correct conclusion from this simulation?

#### Dason

Right before your code block put

[ code ]

And when you're done end it with

[ /code ]

With no spaces and it will render better and keep spaces/indentation intact.

#### Dason

I mean... It's approximately correct for a chi-square with 4 degree of freedom. Which isn't really that badly behaved of a distribution

#### obh

##### Active Member
Yes that's what I meant
I run a special case to feel the statistics.

Is the "reasonably symmetrical" requirement for t-test also required when n>=30?

Do you know any example of "badly behaved of a distribution" in R?

Last edited: