Clarification in statistical test

#1
Hi. I would like to know if Mann-whitney U test is appropriate. My data is non normal. I want to compare 3 catorgories ( human-trail, fences and control) with the dependant variable being animal trail lengths. Is this an appropriate test to use or is there a better suited statistical test?
 
#2
The Kruskal-Wallace test is the one for three groups so long as the three groups have the same shape but possibly different locations. Failing that, a permutation test should work. kat
 

obh

Active Member
#3
Hi Alison and Kat,

The Mann-Whitney U test compares only 2 groups. the Kruskal-Wallace is the equivalent non-parametric test to the one way above assuming H0 the rank of all groups is equal. So it is a good suggestion :)

If you want to compare each pair (ab, ac, cb) then you must use a lower significant level (α) to avoid rejection correct Ho (type1 error). as you are doing 3 tests and not only one. (each test is a random variable, the probability to get randomly extream value is higher with 3 variables )

as the three groups have the same shape but possibly different locations.
In Mann-Whitney U test (and I guess in Kruskal-Wallace as well) "the same shape" assumption is not mandatory if you compare the "rank", say you compare the entire distribution and not only one statistic like median or mean. If you want to compare the medians than you need this assumption.

(For symmetrical distributions mean equal median)
 

Karabiner

TS Contributor
#4
My data is non normal.
It is NOT an assumption of a t-test or an analysis of variance that "data" have to be normal
(who told you so, by the way?).

If sample size is small, then the prediction errors (residuals) of your ANOVA should prefereably
be sampled from a normally distributed population of residuals..

If your sample size is large, then the t-test or the F-test, respectively, is considered as being
robust against violations of even that assumption.

How large is your sample size?

With kind regards

Karabiner
 

obh

Active Member
#5
Hi Karabiner

Correct, but I may rephrase it in another way:
The t-test uses t distribution.
t-distribution assume normal data (uses instead of z when you don't know the population)
Due to central limit theorem usually, when the sample size is large enough the distribution is approximately normal.
So you can use the t-test

I also tried to run a simple simulation (hopefully correct ???)
The expected result should be pvalue=0.05, the probability to reject a correct H0 if the t-test is perfect for this data.

I should also show the power to reject an incorrect H0...

I use chi-square with df=4 independent of the sample size, just to show a non-symmetrical distribution. (usually, the df going up with sample size)
I tried the same also for a normal distribution.

df <- 4 # degree of freedom
reps <- 200000 # number of simulations per one sample size
sample_size=c(2,4,6,8,10,15,20,25,30,35,40) # sample size

mean_pvalues <- numeric(length(sample_size))
set.seed(1)

j <- 1
for (n in sample_size)
{
pvalues <- numeric(reps)
for (i in 1:reps)
{
x1 <- rchisq(n, df, ncp = 0)
x2 <- rchisq(n, df, ncp = 0)
pvalues <- t.test(x2,x1,alternative="greater")$p.value
}
mean_pvalues[j] <- mean(pvalues < 0.05)
j=j+1
}
mean_pvalues
plot(sample_size,mean_pvalues)
lines(sample_size,mean_pvalues)


1563081072826.png

And compares chi-squared distribution (blue) with a normal (red)

1563086829165.png
 
Last edited:

obh

Active Member
#7
Hi Dason,

I used the following code, I added normal distribution and colors and increased reps.

Code:
df <- 4  # degree of freedom
reps <- 800000  # number of simulations per one sample size
sample_size=c(2,4,6,8,10,15,20,25,30,35,40)  # sample size

mean_pvalues <- numeric(length(sample_size))

set.seed(1)

j <- 1
for (n in sample_size)
{
  pvalues <- numeric(reps)
  for (i in 1:reps)
  {
    x1 <- rchisq(n, df, ncp = 0)
    x2 <- rchisq(n, df, ncp = 0)
    pvalues[I] [ i ]<- t.test(x2,x1,alternative="greater")$p.value
  }
  mean_pvalues[j] <- mean(pvalues < 0.05)
  j=j+1
}
mean_pvalues
plot(sample_size,mean_pvalues)
lines(sample_size,mean_pvalues,col="blue")
#-2.-------------------
mu <- 10  # mean under the null hypothesis
sigma <- 20  # mean under the null hypothesis
#reps <- 800000  # number of simulations per one sample size
sample_size=c(2,4,6,8,10,15,20,25,30,35,40)  # sample size[/I]

[I]mean_pvalues2 <- numeric(length(sample_size))[/I]

[I]set.seed(1)[/I]

[I]j <- 1
for (n in sample_size)
{
  pvalues <- numeric(reps)
  for (i in 1:reps) 
  {
    x1 <- rnorm(n, mu, sigma)
    x2 <- rnorm(n, mu, sigma)
    pvalues[I] [ i ]<- t.test(x2,x1,alternative="greater")$p.value
  }
  mean_pvalues2[j] <- mean(pvalues < 0.05)
  j=j+1
}
mean_pvalues2
#plot(sample_size,mean_pvalues2)
lines(sample_size,mean_pvalues2,col="red")
[ /code][/I][/I]
 
Last edited:

obh

Active Member
#9
Okay, I fall for it again :)
From some reason ... the website doesn't like the [ i ] and remove it from the code I paste :(
I will update the code to [ i ] with spaces :)

But the [j] works fine
 
#11
What is the code tag? ~~code ~~

Any way, my conclusion from the simulation is that even for a non symmetrical distribution t test will do a good job. In this specific case example from around n=20. (I know the general thumb rule is say 30).

Is this correct conclusion from this simulation?
 

Dason

Ambassador to the humans
#12
Right before your code block put

[ code ]

And when you're done end it with

[ /code ]

With no spaces and it will render better and keep spaces/indentation intact.
 
#14
Yes that's what I meant
I run a special case to feel the statistics.

Is the "reasonably symmetrical" requirement for t-test also required when n>=30?

Do you know any example of "badly behaved of a distribution" in R?
 
Last edited: