CLT - What sample size is large enough? your thoughts?

obh

Active Member
#1
Hi, I will glad to read your thoughts

Whats sample size is large enough so I can assume normality based on the Central Limit Theorem? Some write 30 some write 100.
I run some simulations and it seems that for some skewed data (finite variance, independent ...) you need a sample bigger than 200.
And what is reasonably symmetrical ???

I run simulations from F(8,8) to F(19,19) with a large number of repeats (100,000) and checked what sample size brings the average close to the Normal distribution.

What is "close to the Normal distribution"? I thought about two options:
1. Sample distribution's Skewness<0.5 and Sample distribution's excess Kurtosis<0.5
2. SW test - since limited to 5000, an average of each 5000 size blocks. Definitely too powerful for a large sample, so maybe with low p-value 0.01 or 0.001.

Currently, I used option1, I may also try option 2
And I run the following regression:
DV=sample size
IVs: Population's parameters: Skewness, Excess Kurtosis, Skewness*Kurtosis, Excess Kurtosis^2, Skewness^2

A potential problem you know the sample statistics (Skewness, Kurtosis) while the regression is base on the true statistics.

Your thoughts? Any recommended article?
 
Last edited:
#2
What would be a relevant criterion to check if central limit theorem is applicable?

I would say that it is the error rate (under the null hypothesis). If your error rate deviates a lot from 0,05 I would say that the test is not so good.

For me, if the actual error rate is 6% or 7%, in contrast to the nominal error rate of 5%, I would find it acceptable.
(This is in comparision to e.g. omitted variable bias, where I believe the error rate can be 30% or 40% or higher.)

So when I do a simulation I check the actuall error rate as compared to 5%.)

(Also, to do 100,000 simulations seems to much. The number of "significances" will be binomially distributed with p and n=100,000. Check it!)
 

Dason

Ambassador to the humans
#3
(Also, to do 100,000 simulations seems to much. The number of "significances" will be binomially distributed with p and n=100,000. Check it!)
It might seem like too much but it still completes the simulation almost instantly and there really is no downside to having a larger sample size.
 

obh

Active Member
#4
Thank you Greta!

How do I calculate the error rate? Did I do correctly ...or there is a better way?

If for example, I run 10,000-time samples size=30.: (I used to different words to avoid confusing repeates=10,000, sample size=30)

1. I divide the histogram into 8 ranges ( [-∞,-3],[-3,-2], ..[2,3],[3,∞])
2. Can do ~40 in the simulation, and probably should be equal probabilities instead of equal ranges ...
3. Summed the number of occurrences in each section. and divided by the totals occurances from all 10,000 repeats
4. Calculate the Normal Probability to be in this section, for example Normal P([2,3)]= P(Z≤3) - P(Z≤2)
5. Error Rate = sum(Actual P - Normal P) =0.17
It should be 0.05 or less to count as normal.


(Also, to do 100,000 simulations seems too much. The number of "significances" will be binomially distributed with p and n=100,000. Check it!)
I started with 10,000 repeats, I raised up until got similar results. (convergent to the correct sample size)

Maybe with the "error rate" option, I will be able to have a smaller number of repeats.
As Dason wrote, raising the number of repeats won't damage the results (only warm my laptop :) )


Example calculation of error_rate: (Following the png, but I also attached the excel.)
1574467880217.png
 

Attachments

#5
(Also, to do 100,000 simulations seems to much. The number of "significances" will be binomially distributed with p and n=100,000. Check it!)
It might seem like too much but it still completes the simulation almost instantly and there really is no downside to having a larger sample size.
Sure! But with that attitude one can as well do 100 millions simulations.

As someone said: "When statisticians do simulations, they tend to forget that they are statisticians."

(I believe that I have read that when doing a permutations test simulation, it is enough with 1000 repeats. Why do much to many?)

How do I calculate the error rate? Did I do correctly ...or there is a better way?
Well, I just mean that if you are simulating under the null then you will get an error rate of 5% (your nominal significance rate) then you can just count the number of significances and that will be your achieved significance level. And I say that if the achieved significance level is larger than 7 or 8%, then the estimator is not so good.

Code:
y1 <- 700
n1 <- 10000

p1 <- y1/n1

p1
# [1] 0.07

#confidence interval:
p1 + c(-1, +1)*1.96*sqrt(p1*(1-p1)/n1 )
# [1] 0.06499912 0.07500088

But, I guess that there are many other ways to simulate.
 

obh

Active Member
#6
Thanks Greta!

As someone said: "When statisticians do simulations, they tend to forget that they are statisticians."
:D Don't think, just use the club! (simulation)
(I believe that I have read that when doing a permutations test simulation, it is enough with 1000 repeats. Why do much to many?)
I usually start with a small sample size, then multiply again and again until convergent to similar results. In this case, 1000 wasn't good.


Well, I just mean that if you are simulating under the null then you will get an error rate of 5% (your nominal significance rate) then you can just count the number of significances and that will be your achieved significance level. And I say that if the achieved significance level is larger than 7 or 8%, then the estimator is not so good.

Code:
y1 <- 700
n1 <- 10000

p1 <- y1/n1

p1
# [1] 0.07

#confidence interval:
p1 + c(-1, +1)*1.96*sqrt(p1*(1-p1)/n1 )
# [1] 0.06499912 0.07500088
But, I guess that there are many other ways to simulate.
[/QUOTE]

So you say that the success decision should depend on the number of repeats?

I tried the following:

The overall program runs the lion in the desert algorithm until finding the smallest sample size that meets the decision criteria. (pass1 is TRUE)
But to make it easier, the following code is only for one decision, the case example, Sample size = 30 of F(15,15).

pass1 - should answer the question: "Is the average distribute normally? "

I run the Shapiro Wilk test with a significant level of 0.05, this is the default
So did you suggest that I will use 0.05 in the SW but in the decision criteria use alpha=0.07?

Code:
n1 <-30; df_1<-15; df_2<-15;
group <- 100 #one shapiro wilk test run over 100 averages.
reps <- 2000 #run 2000 times the shapiro wilk test .
alpha=0.05  #this used for decision not for the SW test. do you suggest alpha=0.07 ???
# shapiro wilk is one tail so I used Z_0.95=1.64
critical_p=alpha+1.644854*sqrt(alpha*(1-alpha)/reps)

#----------------
sw1 <- numeric(reps)
      for (i in 1:reps)
      {
        avg1 <- numeric(group)
        for (j in 1:group)
        {
          x<-rf(n=30, df1=15, df2=15)
          avg1[j] <-mean(x)
        }
        sw1[i] <- shapiro.test(avg1)$p.value
      }
      sw_avg=mean(sw1<alpha)
      pass1=(sw_avg<critical_p)
#----------------
It doesn't seem okay as I get the following results: (overall algorithm)
Reps n1
1000 473
2000 494
4000 938
8000 969
16000 1000
 
#7
So did you suggest that I will use 0.05 in the SW but in the decision criteria use alpha=0.07?
No, I just sum the number of significances. Under the null they should be around 5 percent. If it is much higher then test is not so good.

A significant Shapiro-Wilks and an achieved significance level close to 5% indicates that the test is robust to non-normality.
 

obh

Active Member
#8
I use success criteria. If it fails I try a bigger sample size, until pass.

Did you suggest in your "spoiler" that I will fix the 0.05 based on the number of repeats? using the confidence interval range instead of 0.05?
Code:
critical_p=0.05+1.644854*sqrt(0.05*(1-0.05)/reps)
reps- the number of repeats

I tried and it doesn't seem to be convergent, as you can see in the above post.
 
#9
Did you suggest in your "spoiler" that I will fix the 0.05 based on the number of repeats? using the confidence interval range instead of 0.05?
No, I just wanted to point out that if it is significant (coded as 1) or not (coded as 0), then the number of significances is binomial distributed and you can do a test or a confidence intervall for the proportion significances. That will give you a clue if the number of repeats is enough and if it deviates from 0.05.

I use success criteria. If it fails I try a bigger sample size, until pass.
Is that a criteria that is based on sequential inferens?

Or is it that when statisticians do simulation, they tend to forget that they are statisticians?
 

obh

Active Member
#10
Hi Greta,

Is that a criteria that is based on sequential inferens?
Or is it that when statisticians do simulation, they tend to forget that they are statisticians?
I use the binary search to reduce the number of checks.
No, I didn't forget the statistics... I understand that each additional iteration increases the probability of a mistake. p(mistake)^iterations

Now the question is what is the criterion for one iteration?
For normal data I expect the % of significance not normal results to be p=0.05
I also understand that if assume normal, the CI for % of significance is: 0.05±1.96*sqrt(0.05*(1-0.095)/reps)
I allowed a confidence level of 0.8, CI: 0.05±1.28*sqrt(0.05*(1-0.095)/reps)

I thought about the following:
p - a ratio of the significance SW results (error rate)
If the CI is [L,R] (for example: L=0.041, R=0.0588)
p>R: False- doesn't distribute normally.
p<L or reps>15000: True - distribute normally.

Other: run the check again with 2*reps.
It may run "other" more than once.

Does this make sense? or do you have a better idea? Thanks
 
Last edited:
#11
Maybe I have misunderstood what the question is here.
Whats sample size is large enough so I can assume normality based on the Central Limit Theorem? Some write 30 some write 100.
I believed that the question was: When can I use a test, to test if the population mean mu1 is equal to mu2, when the sample size is n, and when the test is based on the assumption of normality of the sample means?

Example: you take a sample of size n from the exponential distribution and compare with an other sample from the exponential distribution.

Then if the t-test is not so sensitive to non-normality then I would say that the t-test is OK. (If the observed number of significances is around 5%).

So if the sample size of n=30 from two exponentials gives an error rate of about 5% when tested with a t-test, then I would say that the sample size of 30 is large enough.

But suppose that you want to do a confidence interval for the variance when you have taken a sample of 30 from the exponential distribution. I believe that I have read that that would be sensitive to deviations from normality, So that a confidence interval for the variance would cover the true value more seldom than 95%. And therefore I would say that 30 is not enough.

Here it seem to me that @obh want to test if the mean is normal. The Shapiro Wilks test will be sensitive to that. That is what the test is designed to do - to discover non-normality. If it could not discover non-normality, it would be useless. But the t-test is not sensitive to non-normality. For the t-test 30 is enough.

So, as usual, it depends, (on what you want to do).
 

obh

Active Member
#12
Hi Greta,

Maybe I have misunderstood what the question is here.
I believed that the question was: When can I use a test, to test if the population mean mu1 is equal to mu2, when the sample size is n, and when the test is based on the assumption of normality of the sample means?
.
Yes and no. you misunderstood the question, but in the back of my mind was also the next question about the t-test :)

So if the sample size of n=30 from two exponentials gives an error rate of about 5% when tested with a t-test, then I would say that the sample size of 30 is large enough.
.
The skewness level of the exponential distribution is 2. What about more skewed data like skewness level=5 or 6 .
Will the sample size of 30 be still okay for the t-test?

Thanks
 

obh

Active Member
#14
Of course, I will :)

Back to the question, what error rate around 0.05 is okay when using significance level of 0.05?
Is there there any agreed method???