Confidence interval for single sample analysis

#1
Hi,
I work in a phytochemistry lab where substances are analysed by gas chromotography to estimate quantities of individual compounds in a sample. I want to work out how much variation there is in our process. To do so, I prepared three samples from the same source of tea tree oil, and injected each sample six times into the GC machine. Each injection was analysed for the quantity of 15 different compounds. So for each compound there are 18 measurements. I've calculated the mean and standard deviation for each of these.

Normally we would only analyse a given sample once. I want to use the above data to be able to say that the score we measured for a given compound is accurate within a given range. How can I work out what that range is?
 

obh

Active Member
#2
one tree => 3 main samples => 6 samples for each main samples.

So one random reason is the different location in the tree?
And one random reason is the measurement tool?
 
#3
No, the tea tree oil is distilled already. It's all one source, already bottled. I just made three preparations to determine if my preparation process introduces variation. The other variation is in the injection and measurement process in the Gas Chromotographer.
 

obh

Active Member
#6
Hi PN,

For next time, It would be better to have data instead of the photo :)

By just looking it looks like random.

I checked the first column. ( a-pinene)
I ran one-way ANOVA to compare the averages of the 3 groups. the p-value is low but insignificant. (0.1979)
since the sample size is low and as you can see in the Instagram there may be a difference between the groups

I run Shapiro Wilk and the p-value is low but insignificant. but similarily the power shouldn't be strong. and if you look on the histogram it doesn't look like a normal distribution.

practically I don't see a reason why not using the t confidence interval assuming normality.
Mean confidence interval: [2.643711 , 2.650597]
2.647154 ± 0.00344308




confidence example.png
http://www.statskingdom.com/40_confidence_interval.html
 
#7
Ok thanks for your input. For the record, I cut and pasted straight from excel, it was the website that interpreted it as an image.

Is this confidence interval based on 95% confidence? Also, if I work this out for all compounds, then work out the confidence interval as a percentage of the mean of each (eg 0.00344308 / 2.647154 * 100 = 0.1300676%), can I then average these values to give a rough generalised confidence interval?
 
#8
It's all one source, already bottled.
So, there is just one product. The mean is the same.

I just made three preparations to determine if my preparation process introduces variation.
So the n for preperations is 3 (n_prep =3).

The other variation is in the injection and measurement process in the Gas Chromotographer.
And the number of measurements are 6 (n_measure =6).

It seems that the poster want to estimate the variance components. So how will the total variation split up in preparations variation ( (sigma_prep)^2 ) and measuremnet variation (say (sigma_measure)^2 ).

Search for variance components.

Of course it is good to investigate the sources of variation.
(But still (to estimate a mean) : it is better to have many products (now you have one) and few preparation and few measurements than the other way around (few products and many preparations and even more measurments).
 

obh

Active Member
#9
So, there is just one product. The mean is the same.
It seems that the poster want to estimate the variance components. So how will the total variation split up in preparations variation ( (sigma_prep)^2 ) and measurement variation (say (sigma_measure)^2 ).
Search for variance components.
.
Hi @GretaGarbo

Interesting :) Thanks for your nice input as always.

As I understand the goal is to calculate the confidence interval of each component.
Due to the 2 steps process, it seems that the data distribution is not normal (as expected)
But the at least for the first component it is reasonably symmetrical.

So practically won't it be okay to use the t confidence interval? (at least for the first component)
 
#10
As I understand the goal is to calculate the confidence interval of each component.
Yes, a confidence interval for each variance component.

A confidence interval for a variance is given in this link.

Due to the 2 steps process, it seems that the data distribution is not normal (as expected)
But a model for the data can be:

y = mu + epsilon + u

Where y is the measurd value, mu is the poulation mean, epsilon is the preparation random variable (with expected value of zero) and u is the measurement error (with expected value of zero). Now you want to estimate the variance of epsilon and the variance of u.

I agree that y is not normally distributed, but epsilon and u can be normal.

An estimate of mu and its confidence interval is relatively robust to deviations from normality. But estimates of confidence intervals for the variance is not so robust (as I remember it). It is fairly sensitive to the assumtion of normality. But it can be good to do such a calculation. (Often the length of the interval is longer than one would initially expect.)

There is an ugly thing with this. When doing the calculations the estimate of the variance of epsilon can be negative. Of course a variance can not be negative. But it can be zero or very small. An alternative is maximum likelihood estimation but it is still a mess (epecially if the population variance is zero.)
 
#11
Also, if I work this out for all compounds, then work out the confidence interval as a percentage of the mean of each (eg 0.00344308 / 2.647154 * 100 = 0.1300676%), can I then average these values to give a rough generalised confidence interval?
I can not see any basis for this.

That is supposed to be the coefficient of variation (cv). cv = 100*sigma/mean.

Why would it be such that the cv would be constant and the same for all substances? What is the bases for that? It would be nice to see some evidence for chemical data to evaluate if this a good rule of thum.

Second, if the cv is constant, why would the mean of different cv:s for different compounds, be a good estimator? In what way would that give a 95% confidence interval?

- - -

I know that chemist are fond of taking n=3, but why? Sometimes you need n = 10000 and sometimes n=1.

In an anova, based on the normal distribution, the standard deviation is supposed to be constant. It is the nuisanse parameter. But in the gamma distribution the cv is constant and the nuisanse parameter. These compounds can not be negative. The normal distribution can be negative. So in a way it does not fit. But the gamma distribution can not be negative. So why not use the gamma distribution?

I don't know if it is possible to extract a sort of cv:s from variance components, but it is worth testing.

I hypothesize that, it is possible to do a general liklihood test over all samples and all compounds to test constancy of the cv, and to do a maximum likelihood estimate of its value.
 
#12
Most of this conversation is going over my head because I'm not a statistician, but one thing I'm curious about is why the data wouldn't be a normal distribution? It seems like it should be, so I'm wondering if it just doesn't look normal because I don't have enough samples?
 

obh

Active Member
#13
Most of this conversation is going over my head because I'm not a statistician, but one thing I'm curious about is why the data wouldn't be a normal distribution? It seems like it should be, so I'm wondering if it just doesn't look normal because I don't have enough samples?
I don't think so.

The experiment has two steps process and each process inserts his own variance.
So the question should be opposite, "why should it distribute normally"?

Surely a simple example will be easier to understand. if you sample the weight of 6 ants 6 dogs and 6 elephants will this data distribute normally?

I thought that if you sample 6 donkeys 6 horses and 6 zebras you may assume the distribution is symmetrical and close enough to normal so the t distribution will have "reasonable" results. but Greta suggested a better model.

Independently ...In many cases when the sample is big enough it may distribute nearly normal(read central limit theorem)
So probably in your experiment, more preparations and more measures will have near-normal distribution.
 
#14
I agree that y is not normally distributed, but epsilon and u can be normal.
Hey, wait a minute!

The sum of two normally distributed variabels will be normally distributed with variance equal to the sum of variances (plus 2 times the covariance) and with mean equal to the sum of the means. So I was wrong when I said that y would not be normal. The variance components of epsilon and u can be normally distributed.


Most of this conversation is going over my head
What software are you using? Maybe we can suggest something to make it easier to estimate the variance components.
 
#16
So the t confidence interval will be good after all ...
Yes you can make a confidence interval for the overall mean. But as I have tried to say above, I believe that the original poster is more interested in how it varies between preparations and measurements.

And you need a bigger preparation sample size.
That is often said here on talkstats. And I don't agree because we don't know precise these estimates would be and we don't know how precise the original poster wants them to be.

I think it would be good if someone could find a good link with examples about how to calculate variance components.
 

obh

Active Member
#17
Yes you can make a confidence interval for the overall mean. But as I have tried to say above, I believe that the original poster is more interested in how it varies between preparations and measurements..
Yes, I understood what you wrote, @Phenomniverse should say what he is interest on.
Or do you mean he should be interested in varies between preparations and measurements? like to choose how to deviate between the preparations and the measures, for better optimization :) ?

That is often said here on talkstats. And I don't agree because we don't know precise these estimates would be and we don't know how precise the original poster wants them to be..
Generally, I agree with what you said for one dimension of samples. but in this special case, I analyzed the a-pinene and there is a big variance between the preparations and smaller between the measurements, so it doesn't make sense to do 3 preparations and 6 measurements. as you have a sample size of 18 and it is more like a sample size of 3.
woe, I think I understand your point :) what I understand intuitively you want to let the model choose?
 
#18
I analyzed the a-pinene and there is a big variance between the preparations and smaller between the measurements,
OK, so there is a large variance between preparations and a smaller one between measurments. But the original poster (OP) did not know that.


so it doesn't make sense to do 3 preparations and 6 measurements. as you have a sample size of 18 and it is more like a sample size of 3.
Again the original poster (OP) did not know that. Besides, if it costs very little to make many measurements then it is good to measure many. Otherwise it is best to do two measurment and many proparations if you are interested in variance components. Otherwise it is better with one measurement and many preparations.

I just hope she (he) can estimate these components.
 
#20
The purpose was twofold. Firstly, usually we will get a sample and do one preparation and one measurement. We want to be able to give a margin for error on the measurement we provide to the client. For example, if we measure the a-pinene to be 2.63938, should we say plus or minus 0.001, or 0.1, or what? I think the best way to estimate the precision of the measurement would be to calculate the relative standard deviation, that is the standard deviation as a percentage of the mean for each compound.
A second consideration was to see how much of the overall variation is contributed by the preparation process, and how much by the gas chromotography (the measurement process). I wondered if an ANOVA test would be appropriate here.
Finally, it might be relevant to point out that the values given are area percent values. That is, each compound value represents the area of a peak on a chromotography read out, expressed as a percentage of all the peaks that are integrated. The named peaks contribute about 90% of the total area integrated. So the values for the respective compounds are not entirely independent of each other.