# 2 sample z test - Standardized Effect Size

#### obh

##### Member
Hi,

comparing the average of 2 groups when we know each group's standard deviation: σ1, σ2

Cohen’s effect Size = | avg(x1)- avg(x2)| / σ pooled

How do you calculate the σ pooled for 2 sample z test?

σ pooled^2= (n1σ1^2+n2σ2^2)/(n1+n2) {similar to t-test}

or just a simple average σ pooled =(σ1+σ1)/2 ?

Last edited:

#### hlsmith

##### Not a robit
See Introduction to Meta-Analysis by Borenstein et al. - available online, page 26.

#### obh

##### Member
Thanks hlsmith

I can't see this page, do you recommend to buy this book?

#### hlsmith

##### Not a robit
I believe there are free pdfs available of this book on the web.

#### obh

##### Member
Hi hlsmith,

Per my understanding, the book describes the: 2 sample t test - Standardized Effect Size.
When you use the sample standard deviation:

S pooled ^2= ( (n1-1)S1^2+(n2-1)S2^2)/(n1+n2-2)

The question is how to calculate the 2 sample z test Cohens effect size, when you know the standard deviation of the two groups are: σ1,σ2.

The book treats only the case that σ1=σ2: effect size=(μ1-μ2)/σ

I know the basic assumption of "pool" idea is that the standard deviations are the same but only the sample standard deviations are not the same, and in this case, we know that the standard deviations are not the same, so probably the average of the standard deviations? or the average of the variances is more appropriate?

Last edited:

#### hlsmith

##### Not a robit
No, I agree - if you need to address that it needs some type of weighted pooling. I would imagine that should be a formula out there. You didn't see anything in that book? I don't recall ever doing that process myself. If I was you I would continue combing the web to confirm your approach or find a formula. It seems like it would be a common process in meta-analyses. #### ondansetron

##### TS Contributor
I think the satterthwaite formula for calculating a t-test with unequal variances might be useful because the gist is to pool and use the appropriate (approximate) degrees of freedom. This is just of the top off my head.

Last edited by a moderator:

#### obh

##### Member
No, I agree - if you need to address that it needs some type of weighted pooling. I would imagine that should be a formula out there. You didn't see anything in that book? I don't recall ever doing that process myself. If I was you I would continue combing the web to confirm your approach or find a formula. It seems like it would be a common process in meta-analyses. Thanks hlsmith,

Thanks for the book suggestion, I didn't find anything related to the option you know the two samples standard deviations.
It is always sample standard deviations from other sources. probably z test is much less practical ...
I guess the z test was used more when people used tables for calculations, z table is more detailed with no df ....

of course, I searched the web before asking the question ..., but the magic word "meta-analyses" gain more results. I found one video that suggested : (σ1+σ1)/2
and another place: SQRT(Treatment group variance + Control group variance)/2) https://www.creative-wisdom.com/teaching/WBI/es.shtml

So probably the average of the variances is the correct one ...

#### obh

##### Member
I think the satterthwaite formula for calculating a t-test with unequal variances might be useful because the gist is to pool and use the appropriate (approximate) degrees of freedom. This is just of the top off my head.
Hi Ondansetron The question is about z-test.
Do you mean using the same formula used for the t t-test to the z-test?

I think, but not sure that the average of the variances is the correct answer.

#### ondansetron

##### TS Contributor
Sorry, I saw the t-test part.

1) how do you know the variances?
2) I might be wrong, but I think the variance of a sum (or difference) is just the sum of the variances, assuming the two variables in the sum or difference are independent. For non-independent variables you need to adjust for the covariance between them. @Buckeye @Dason are going to be able to clarify better, but in my mind, it's irrelevant that it's a "Cohen's" standardized difference and really just a problem of solving for the variance of a difference (which is similar to the variance of a difference).

#### obh

##### Member
I guessed you saw t-test Cohen's effect size uses the population's standard deviation, not the statistic's standard deviation.

The population standard deviation when you mixed 2 groups depends on the number of items from each group in the entire population (not in the samples).

But I believe the entire population is not relevant when you compare 2 groups, and it should be a mixture of 50% 50%.
So the answer is: sqrt ( (variance1+variance2) / 2 )

#### ondansetron

##### TS Contributor
Variance of a sum (or difference) of two independent random variables is the sum of the variances: Var(X+Y) OR Var(X-Y) = Var(X) + Var(Y). You would then take the square root of Var(X-Y) to get the SD(X-Y).

Maybe I'm missing something here...

#### obh

##### Member
The calculation is for the population standards deviation, and not for a combination of 2 random variables (X+Y) or (X-Y)

for example, if taking a specific case when variance1=variance2=8, the population variance should be 8, and standard deviation sqrt(8)

#### ondansetron

##### TS Contributor
I'm not an R aficionado, but here is a quick simulation to show that it is incorrect to simply average to get the pooled SD for two independent random variables.

Code:
> set.seed(123)
> x <- rnorm(100, 2, 5)
> y <- rnorm(100, 6, 10)
> diffxy<-x-y
> SD(x)
Error in SD(x) : could not find function "SD"
> sd(x)
 4.564079
> sd(y)
 9.669866
> sd(diffxy)
 10.89538
> diffxy<-(x-y)
> sumxy<-x+y
> sd(sumxy)
 10.48642
> var(x)
 20.83082
> var(y)
 93.50631
> var(diffxy)
 118.7092
> set.seed(1234)
> a<- rnorm(100, 2, 2)
> b<- rnorm(100, 5, 2)
> diffab<-a-b
> sumab<-a+b
> var(a,b, diffab, sumab)
Error in var(a, b, diffab, sumab) : invalid 'use' argument
In addition: Warning message:
In if (is.na(na.method)) stop("invalid 'use' argument") :
the condition has length > 1 and only the first element will be used
> var(a)
 4.03532
> var(b)
 4.261643
> var(sumab)
 8.086441
> var(diffab)
 8.507485
Code:

I even left in my error messages to show how bad I am at R, but the point remains clear. The first case with X and Y has different variances (assigned SD of 5 and 10, var of 25 and 100 to x and y, respectively) and shows you the variance is additive between the two. The second case is A and B with the same variance (SD) assigned (SD of 2, var of 4) to show you the pooled variance is again additive (equals 8 rather than 4 as you claim).

I'm not sure if there is a miscommunication but I think I can't really make any other comments. The numerator is the difference in means of a random variable and the appropriate variance/sd would be the sum of the variances (again assuming two independent RVs). You just account for the covariance if they're not independent.

#### spunky

##### Doesn't actually exist
But I believe the entire population is not relevant when you compare 2 groups, and it should be a mixture of 50% 50%.
So the answer is: sqrt ( (variance1+variance2) / 2 )
If you're assuming a Gaussian mixture then your answer is wrong UNLESS you further assume that the population means are zero. The general expression for the variance of a mixture is:

$$\sigma^{2}=\sum_{i=1}^{k}p_i(\mu^{2}_i+\sigma^{2}_i)-\mu$$ (where $$\mu$$ is the grand mean of the mixture)

So if you have a two-component mixture, the variance is:

$$p_1\sigma^{2}_{1}+p_2\sigma^{2}_{2}+[p_1\mu^{2}_{1}+p_2\mu^{2}_{2}-(p_1\mu_1+p_2\mu_2)^{2}]$$

To get the expression that you posted, everything in the square brackets should be 0 and that can only happen if the population means are 0.

Last edited:

#### obh

##### Member
I'm not an R aficionado, but here is a quick simulation to show that it is incorrect to simply average to get the pooled SD for two independent random variables.

Code:
> set.seed(123)
> x <- rnorm(100, 2, 5)
> y <- rnorm(100, 6, 10)
> diffxy<-x-y
> SD(x)
Error in SD(x) : could not find function "SD"
> sd(x)
 4.564079
> sd(y)
 9.669866
> sd(diffxy)
 10.89538
> diffxy<-(x-y)
> sumxy<-x+y
> sd(sumxy)
 10.48642
> var(x)
 20.83082
> var(y)
 93.50631
> var(diffxy)
 118.7092
> set.seed(1234)
> a<- rnorm(100, 2, 2)
> b<- rnorm(100, 5, 2)
> diffab<-a-b
> sumab<-a+b
> var(a,b, diffab, sumab)
Error in var(a, b, diffab, sumab) : invalid 'use' argument
In addition: Warning message:
In if (is.na(na.method)) stop("invalid 'use' argument") :
the condition has length > 1 and only the first element will be used
> var(a)
 4.03532
> var(b)
 4.261643
> var(sumab)
 8.086441
> var(diffab)
 8.507485
Code:

I even left in my error messages to show how bad I am at R, but the point remains clear. The first case with X and Y has different variances (assigned SD of 5 and 10, var of 25 and 100 to x and y, respectively) and shows you the variance is additive between the two. The second case is A and B with the same variance (SD) assigned (SD of 2, var of 4) to show you the pooled variance is again additive (equals 8 rather than 4 as you claim).

I'm not sure if there is a miscommunication but I think I can't really make any other comments. The numerator is the difference in means of a random variable and the appropriate variance/sd would be the sum of the variances (again assuming two independent RVs). You just account for the covariance if they're not independent.
Good morning Ondansetron.

You proved with simulation (nicely) the basic statistics Var(x+y)=var(x)+var(y) and Var(x-y)=var(x)+var(y).
thanks for the nice demonstration Pool variance is not the variance of (X+Y), but is the entire population variance.
When using t-test you treat the 2 samples as one big sample to estimate the pooled variance S pooled ^2= ( (n1-1)S1^2+(n2-1)S2^2)/(n1+n2-2)
(I think doing the same also for Welch's unequal variances??? it should be like z-test bellow but here the standard deviation of the group with more values is more accurate ...not sure)

in z-test, you assume you know the variances. so there is no sample to calculate from. so the assumption is probably an equal number of values per each group.

R Demo:

x <- rnorm(1000, 2, 10)
> y <- rnorm(1000, 3, 20)
> z=c(x,y)
> var(x)
 108.4411
> var(y)
 378.2819
> var(z)
 243.2904

( var(x)+var(y))/2 = 243.3615 {not exactly the same as the averages are not exactly the same)

#### obh

##### Member
If you're assuming a Gaussian mixture then your answer is wrong UNLESS you further assume that the population means are zero. The general expression for the variance of a mixture is:

$$\sigma^{2}=\sum_{i=1}^{k}p_i(\mu^{2}_i+\sigma^{2}_i)-\mu$$ (where $$\mu$$ is the grand mean of the mixture)

So if you have a two-component mixture, the variance is:

$$p_1\sigma^{2}_{1}+p_2\sigma^{2}_{2}+[p_1\mu^{2}_{1}+p_2\mu^{2}_{2}-(p_1\mu_1+p_2\mu_2)^{2}]$$

To get the expression that you posted, everything in the square brackets should be 0 and that can only happen if the population means are 0.
Thanks Spunky,

It is clear that the average of variances doesn't provide the exact population variances.

You can see in my response to Ondansetron that the average of variances provide almost correct results in the R demo

Anyway my original question was how to calculate the Cohen effect size for 2 sample z test.

I believe now that the common use is the average of variances. Do you you know otherwise?