Central Limit Theorem Indicative of Underlying Distribution?

Hi All,

I've been reading threads off and on for a couple days, and I can't get a good feel for how to answer this question I've got stuck in my head. I'm sure the answer is in here, but I'm an engineer, not a statistician, so I can't make the leap to apply the posts I've read to the question I have.

My question relates to the Central Limit Theorem and how it can be applied to glean information about the underlying distribution. I understand that if you take the mean of a large enough sample of data, it will be correlated to the underlying population mean and the standard deviation. However, what does this really tell me? If the underlying distribution is normal, then I can understand that the results could be directly applicable and could help me predict the underlying distribution, but what if the underlying distribution is not normal? What value does being able to get the mean really have? Consider the below:

You have a bag full of identically shaped/sized stones. The stones each have a number printed on them from 1-10. I want to be able to determine the likelihood of pulling a stone with a '10' on it, and I'd also like to know what the underlying distribution is. Are the stones numbered following a uniform, normal, bimodal, or other distribution? Can I even use the central limit theorem to answer this question, or do I simply need to take a certain # of samples to get a good estimate here? If I simply have to take samples, how many do I have to take?

I've coded up some perl scripts that generate large volumes of "population data" based on distributions that I define, which I then 'sample' and do various tests, but I was hoping that a real stats person could help shed some light on the actual theory and math behind this since my empirical tinkering really isn't giving me the direction on this I'd like.

Thanks for any help!


TS Contributor
my understanding if the CLT is that if you take samples fron your population, which may have any distribution whatsoever, the mean value if the sample will be approximatively normally distributed. So, in your example, if you take 10 stones, say, calculate the mean of the ten, then put the stones back and repeat - the mean value of the distribution of the calculated means will be 5.5 with a given std dev std=3/sqrt(120). If the distribution of the numbers in your population is not uniform but something else then the mean value of the sample means and the std dev of the sample means will change accordingly, but the distribution will stay normal.

AFAIK the CLT will not help you to figure out the underlying distribution, to do that you would need to run specific tests. What it might help you with is the other way of reading the CLT : if your data is generated by a process that is the sum if small independent random effects then you can expect that your data will be approximatively normally distributed.

Thanks for the reply, rogojel. In order for the central limit theorem to apply to any arbitrary distribution, you need a sample size of ~30. After some more reading, it appears that this number was based off of empirical monte carlo simulation, and has no theoretical basis. I understand that researchers basically took the opposite of a normal distribution (which is the exponential) and ran some empirical studies to see how many samples needed to be taken to ensure a normal distribution of the sample means.

Anyway, the reason I mention that is because I've written some more code that will basically do just this. I can take millions of collections samples, each with 'n' number of discrete samples and then figure out empirically how large 'n' has to be in order to estimate the population distribution with a certain degree of confidence. I'll then run this study for a myriad of different population distributions to get some guidance on how many samples are needed in the worst case.

So for example, I'll take a sample that has n=50 discrete stone selections. I'll then bucket this and count how many of those stones were '1', '2', etc. I normalize this number to get a percentage and then take another sample with n=50. I do this process say a few million times and I can then build histograms for each of the possible stone values. So if I run 1 million collections of n=50, I now have 1 million samples showing the percentage of stones that were '5'. If I plot these percentages, I assume that per the central limit theorem, the mean should be representative of the true population mean and that the distribution should be normal. In other words, I simply need to collect enough samples (set n high enough) such that the values of stones I care about appear normal.

Unfortunately, I've found that for my initially assumed distribution (which was a skewed normal with a very small sigma), if I want to properly characterize the tail of the distribution (IE if I want to know the number of '1' and '10' stones with high accuracy), I need something on the order of 800-1000 samples. Any fewer than this, and the distribution of the sample means for the '1' and '10' stones does not look normally distributed. If I have a distribution with a larger sigma or I don't care about the '1' and '10' stones and look only at stones with larger percentages, then a sample size of n=30 or even less is sufficient.

I don't know if any of the above makes sense, but I do think I have a way of figuring this out, at least from an empirical data collection standpoint. I'd still like more input since I'd like a better way to determine the tails of a population distribution without having to take 1000 samples. I was hoping that maybe the hypothesis testing aspect of the CLT could somehow be applied, but that's where my comprehension of how to apply the theorem falls a bit flat.


Less is more. Stay pure. Stay poor.
I apologize for not reading the entirety of your last post, so I may be missing part of your thoughts, but I think in general you are may be blending components of CLT and Law of Large numbers.