Very very weird observation...

#1
:confused: I have a group of data of size 15,000. The variance is 2.7950. I plot the histogram of them and it looks really quite Gaussian (Gaussian, not thick-tailed Student).

I take the mean of every 2 of the data points and thus I have 7,500 means. Their variance is 1.4444 -- about 1/2 of 2.7950, which seems reasonable.

I take the mean of every 5 of the data points and thus I have 3,000 means.
Their variance is 0.5889 -- about 1/5 of 2.7950, which seems reasonable.

Similarly, 10-wise mean: variance 0.2757 -- about 1/10 of 2.7950 -- reasonable.

Now, 20-wise mean: variance 0.1010 -- ONLY about 1/30 of 2.7950.

60-wise mean: variance 0.0133 -- ONLY about 1/200 of 2.7950.

I got the data from sensor measurements in an experiment, so there is no deliberate tampering to the data. How come when means are taken over large samples, the variance of the resultant means shrink much greater than is reasonable? What could be causing the problem?
 
#2
Kazec you should consider the formula you are using to take the mean of means:

I am assuming you have ordered your data from x1 to x15000 then you are taking the ordered means xbar_i = x_i+...x_i+(n-1) where n is the size of the means you are taking for xbar_i.

Then, you are running a sum of xbar_bar= SUM(xbar_i) from j to N/n and N is the total sample size.

If you write that out with summation notation and fiddle around with the first and second derivatives or perhaps graph your formula in maple or mathematica or whatever software you will be able to get some idea of what is going on.
 
#4
Hi Ironman. I'm not taking mean of sorted data. The data are used in the sequence they are acquired, and if I plot them against time, the plot really looks like white Gaussian noise. Auto-corrrelation is almost 0. Histogram shows normal.

I used MATLAB to calculate variance. For small sample size n, the variance of the means changes as predicted by the property of normal distribution. When n become big, say n>20, the decrease in variance becomes much faster as you've noted in the numbers I posted. ... How is that possible...?

MATLAB fucntion VAR is fine. I generate some normal data using it and calculate their VAR. The results are reasonable! But my experiment data are not, though in many ways they look very well posed!
 
#6
Hi. As for what I'm actually doing, it's a long story; basically about data fusion, and I don't try your patience by putting down its details. But since means over large samples exhibit unreasonably small variances, my Gaussian assumption on experimental data seems invalid. There must be a loophole in the assumption, and the assumption is a theoretical basis for my research.

However, in all other aspects we can think of, the data indeed look quite Gaussian... I've being thinking it through, but so far in vain...
 
#7
Kazec, here's a thought, look at the variance of 7500 means (2), 5000 means(3), 3750 means (4) etc and record the variance each time. Then I'd suggest doing some analysis on how the variance changes with sample size and perhaps finding a direct relationship with this may give you more insight into what is going on. (I'd be curious what kind of relationship, probably variance decreases by some constant A/root(n)

It's either that or start doing some hardcore mathematical statistics of your variance formulae that matlab is using. At least that's my thought.
 
#8
Ironman, thanks for tracing the thread. I took the point in your last post. Large sample means give unreasonably small variance, and the marginal density of my data looks Gaussian.

I thought about the experiment carefully and came up with this idea. The experiment data were acquired from a sensor measuring a SUPPOSEDLY constant quality (temperature, to be specific). The temperature, in fact, keeps changing, though on a very small scale as compared to sensor noise power. So the data there actually should be seen as a sum of Gaussian noise and some signal fluctuation. I was fooled by the Gaussian histogram which concealed the fact that there was non-Gaussian quantities hidden in the data.
 
#10
So here is a further observation.

If you are familiar with control theory, the simulation is easy to understand.

Construct a system with 1/(s+1) in the forward path and unit negative feedback. Call output of 1/(s+1) "x". Introduce Gaussian noise to "x" so that the feedback quantity is "y = x+noise". Simulations show that y exhibits most Gaussian properties except that means over large samples have whackily small variance.

Since y is the sum of noise and "x", so I'm almost sure it's "x", a low-pass filtered quantity that contributes to non-normality. But I still don't see the picture quite clearly. Any idea?
 
#11
Have you tried sampling just your noise? It may sound weird but could possibly explain what you're seeing with your Y observations? I'm sorry if this isn't helpful but my knowledge of control theory is limited.
 
#12
OK. I did a little experiment. I put my temperature sensor in boiling water, so that the true temperature can be kept constant, and the reading of the sensor is just constant+noise. The data thus collected really do better than those I collected from the output of a controlled heater. Now I'm getting sure that it's the change of underlying quantity (temperature, in this case) that caused the problem. But how it did that, I still don't know.