# standard deviation of multiple sample sets

#### shoes

##### New Member
I'm trying to determine the standard deviation of multiple sample sets (measurements A, B, C taken on 3 different days), for which I know the means and standard deviations (but not the individual values). In a related thread I saw the advice to check out "pooled standard deviation" at http://en.wikipedia.org/wiki/Pooled_standard_deviation, but this doesn't seem to fit my case. The means of A, B, and C vary a bit (with a standard deviation of their means of, say, 0.1), while each individual standard deviation (sA, sB, sC) is pretty tight (say 0.01 - fyi, these standard deviations reflect errors in the measurement device). The wiki link gives a formula based only on sA, sB, and sC, which not surprisingly gives a low standard deviation for the whole population. I know this can't be right, since the means of A, B, and C have greater variability.

#### Masteras

##### TS Contributor
the pooled variance is used in cases where you can assume that the variances do not differ statistically significantly. You say they differ. Have you tested your hypothesis using Levene's test for equality of variances?

#### Dragan

##### Super Moderator
I'm trying to determine the standard deviation of multiple sample sets .... Thanks!

Let me just ask the following for clarification of your problem. Are you suggesting that your scenario is this:

Let X={x1,x2,…,xN}, Y={y1,y2,…yN}, Z={z1,z2,…,zN} denote 3 data sets with known means and standard deviations (not necessarily with equal sample sizes).

Let A be the union of these data sets, i.e.
A ={x1,x2,…,xN,y1,y2,…yN,z1,z2,…,zN}.

Now, are you asking what is the mean and standard deviation of A when you don’t have the data but have only knowledge of the means and standard deviations of X, Y, and Z?...Is this scenario I describe correct?

#### shoes

##### New Member
Yes, you have it right. I'd also say I wouldn't want to weight X, Y, and Z by the # of measurements of each, as you have effectively done in "A", but since each has the same # of measurements, this is moot. Put another way, consider I have 3 people, with 3 measurements of each one's height, and I'm given the mean value and standard deviation for each person. How does one calculate the standard deviation for the 3 people?
Thank you!

#### Dragan

##### Super Moderator
Yes, you have it right.
Thank you!
Okay, here is the formulae you need. This will give you the (exact) mean and variance as if you actually had the data. What you need to do is merge the data sets one by one using the results on the subsequent data set.

mean=[n1 /(n1+n2)]*Xbar1 + [n2 /(n1+n2)]*Xbar2

variance=[ n1^2*Var1 + n2^2*Var2 – n1*Var1 – n1*Var2 – n2*Var1 -
n2*Var2 + n1*n2*Var1 + n1*n2*Var2 +n1*n2*(Xbar1 – Xbar2)^2 ] / [ (n1+n2-1)*(n1+n2) ]

I’ll show an example for the means so you can get the idea on how to do this. This idea is the same for the variance (standard deviation).

Example: Suppose I have 3 data sets with:

Xbar1=5; Std.dev1.=2; Var1=4; n1=10
Xbar2=15; Std.dev2=3; Var2=9;n2=15
Xbar3=8; Std.dev3=5; Var3=25; n3=20

Now to get the mean of the 3 data sets apply the first two sets of statistics

mean(1,2) = [10 /(10+15)]*5 + [15 /(10+15)]*15 =11

Now, use this result as follows:

mean(1,2,3) = [25 /(25+20)]*11 + [20 /(25+20)]*8 = 9.66666.

Now just apply this idea using the formula for variance above.

Obviously, in the end just take the sqrt of the variance to get the standard deviation for the merged (3) sets of data.

BTW, this idea is completely general for k sets of data.

Last edited:

#### shoes

##### New Member
Okay, here is the formulae you need.

Thanks for the formula. This weights each mean (and standard deviation) the number of measurements of each, which is not exactly intuitive to me. For example, if I have 2 people, and A is 4' tall with 3 measurements, while B is 7' tall with 27 measurements, the mean height is:

mean(A,B) = [3/(27+3)]*4 + [27/(30)]*7 = 6.7 feet.

Very odd indeed, since I'd expect the mean to be at least near 5.5', but I'll take your word for it - perhaps an indication that one really should have equal numbers of measurements.

Last edited:

#### dervast

##### New Member
Could you please provide how this "theory" is called? I want to do the same but I would like to study also on the theory first.

One more thing. Is it possible to have a more general equation that could be used for more parameters? I have something like 10 such populations so applying your equation 9 times is little bit time consuming.

Best Regards
Alex.

Okay, here is the formulae you need. This will give you the (exact) mean and variance as if you actually had the data. What you need to do is merge the data sets one by one using the results on the subsequent data set.

mean=[n1 /(n1+n2)]*Xbar1 + [n2 /(n1+n2)]*Xbar2

variance=[ n1^2*Var1 + n2^2*Var2 – n1*Var1 – n1*Var2 – n2*Var1 -
n2*Var2 + n1*n2*Var1 + n1*n2*Var2 +n1*n2*(Xbar1 – Xbar2)^2 ] / [ (n1+n2-1)*(n1+n2) ]

#### BGM

##### TS Contributor
Suppose in your data set you have total $$r$$ groups and
there are sample size $$n_i$$ for each group

Furthermore suppose you already got the
sample mean estimate
$$\bar{X_i} = \frac {\sum_{j=1}^{n_i}X_{ij}} {n_i}$$
and the sample variance estimate
$$\hat{\sigma}_i^2 = \frac {\sum_{j=1}^{n_i}(X_{ij}-\bar{X_i})^2} {n_i - 1}$$
for the each group, i.e. $$i = 1, 2, ..., r$$

Then the pooled sample mean $$= \frac {\sum_{i=1}^r\sum_{j=1}^{n_i}X_{ij}} {\sum_{i=1}^rn_i} = \frac {\sum_{i=1}^rn_i\bar{X_i}} {\sum_{i=1}^rn_i}$$
and the pooled sample variance $$= \frac {\sum_{i=1}^r \sum_{j=1}^{n_i}(X_{ij}-\bar{X_i})^2} {\sum_{i=1}^r(n_i - 1)} = \frac {\sum_{i=1}^r (n_i - 1)\hat{\sigma}_i^2} {\sum_{i=1}^r(n_i - 1)}$$

It would be the same if you got the data in the form of the sufficient statistics
$$\sum_{j=1}^{n_i}X_{ij}, \sum_{j=1}^{n_i}X_{ij}^2$$ in each group i

#### dervast

##### New Member
I would like to thank you for your reply. As masteras said before pooled statistics could only be used for samples that their variance does not differ too much (Actually How do you know that the variances do not differ too much to use this technique?)

In my case the mean value is the same and only variances change
here are some typical examples for my study
1) N(119,3)
2) N(119,12)
3) N(119,8)
4) N(119,30)

Best Regards
Alex.

Suppose in your data set you have total $$r$$ groups and
there are sample size $$n_i$$ for each group

Furthermore suppose you already got the
sample mean estimate
$$\bar{X_i} = \frac {\sum_{j=1}^{n_i}X_{ij}} {n_i}$$
and the sample variance estimate
$$\hat{\sigma}_i^2 = \frac {\sum_{j=1}^{n_i}(X_{ij}-\bar{X_i})^2} {n_i - 1}$$
for the each group, i.e. $$i = 1, 2, ..., r$$

Then the pooled sample mean $$= \frac {\sum_{i=1}^r\sum_{j=1}^{n_i}X_{ij}} {\sum_{i=1}^rn_i} = \frac {\sum_{i=1}^rn_i\bar{X_i}} {\sum_{i=1}^rn_i}$$
and the pooled sample variance $$= \frac {\sum_{i=1}^r \sum_{j=1}^{n_i}(X_{ij}-\bar{X_i})^2} {\sum_{i=1}^r(n_i - 1)} = \frac {\sum_{i=1}^r (n_i - 1)\hat{\sigma}_i^2} {\sum_{i=1}^r(n_i - 1)}$$

It would be the same if you got the data in the form of the sufficient statistics
$$\sum_{j=1}^{n_i}X_{ij}, \sum_{j=1}^{n_i}X_{ij}^2$$ in each group i

#### avbferry

##### New Member
Could someone kindly respond to dervast's question above "Actually How do you know that the variances do not differ too much to use this technique?"

If the variances differ too much, what technique should we be using?

I am also interested to know.

Thanks!

#### wjt

##### New Member
For example, if I have 2 people, and A is 4' tall with 3 measurements, while B is 7' tall with 27 measurements, the mean height is:

mean(A,B) = [3/(27+3)]*4 + [27/(30)]*7 = 6.7 feet.

Very odd indeed
No, I do not agree with shoes' comments here. Statistics is a branch of mathematics and always dealing honestly with data. If we measure A for 3 times we have 3 pieces of data. Since we have 3 pieces of data to enter into statistical process I could not accept that those 3 pieces of data have only 1 weight unit. I would think your example should be the same as you have 3 persons of size A and 27 persons of size B. So what dragon said was reasonable in this scenario. His idea was not very odd.

Last edited:

#### splictionary

##### New Member
Thank you Dragan, does this method have a name? Have been trying to work out this problem for a few days now.

Thank you wjy for the PDF, i can now call this "joint standard deviation"

Last edited:

#### Jo87

##### New Member
Soooo happy I found this discussion: I've had exactly the same problem a few days ago and couldn't find a solution. The replies of Dragan and wjt are very helpful.

I would also be very interested in a general equation (and a name of this method) to calculate the variance (as shown by Dragan). As I understand, the equation presented by BGM isn't the same since variance between the mean values is not considered?!?

Cheers,
Jo

#### bs0srj

##### New Member
Hi.
I wonder if you guys could help. I'm a scientist and looking to present my research.
I'm measuring two linked values - substance production and cell number, and present research as a value of substance produced per cell - I have experimental data where I have 25 observations each for several different conditions, measuring amount of substance produced and number of cells per reaction (this varies depending on the condition, so not constant), each giving a mean and standard deviation - I then take mean values from each set of observations to give mean substance production / cell. However, I would also like to be able to present the standard deviation for the substance/cell value - I'm sure there must be an equation to let me combine the standard deviations of substance production and cell number to give an overall standard deviation, but don't know what this is! Can anyone help?
Many thanks.

#### Jo87

##### New Member
Hi bs0srj,

Just for clarification purposes: You grew n batches of cells each under different conditions with 25 observations for cell number and substance produced each. Subsequently, you took the mean and standard deviation for cell number and substance produced for each batch, and calculated the quotient to derive the mean substance produced per cell for each batch?!

If you want to calculate the standard deviation for this quotient you have to apply the rules of error propagation. For multiplication and division the rule is as follows:

If c = a * b, or c = \frac{a}{b}

then \frac{\sigma_{c}}{\left | c \right |} = \sqrt{\left( \frac{\sigma_{a}}{a}\right )^{2} + \left(\frac{\sigma_{b}}{b}\right )^{2}}

Also have a look here: http://en.wikipedia.org/wiki/Propagation_of_uncertainty

Hope that helps!

#### srk

##### New Member
Dragan and wjt--your posts were very helpful and I used the equation to calculate standard deviation for the average of 3 samples with 3 replicate analyses each. I am reporting the results and was asked to use the equation in the report and also want to provide a citation.

Can I cite something formally and what should I refer to the result as that is understood by the statistical and scientific community? I have been referring to the value as the overall standard deviation, but is there a formal name? Is joint standard deviation used? I could not find a convention for naming this value and could not find this equation in any text book. I know these posts are going back awhile but I appreciate the help! I really need to provide a reference.