Sample mean differs from population mean due to weighting

#1
Hi folks,

The disclaimer first: I'm not primarily trained in statistics, so this problem might sound very naive to the experienced statisticians. Your expertise is what I'm looking for and any help is highly appreciated.


Now the problem:
I am supposed to calculate total chocolate consumption in a population. Further, I need to calculate per capita chocolate consumption in the overall population, per capita chocolate consumption in rural and urban areas of the population, per capita chocolate consumption by age groups and by income groups.

A primary survey was conducted in a representative random sample of the population wherein annual chocolate consumption data was collected. The sample was created using a two-tier stratification. Tier I was by geographical region and Tier II was by rural and urban within each geographical region. This is illustrated as follows:

Sample size: 1200 persons
Number of regions: 4
Sample size per region:300

Within each region the sample was further subdivided into rural and urban based on the population mix of that region.

Once the survey data was received, aggregations were done as follows:

Total consumption (T) = sum of consumption in each region (T1 + T2 + T3 +T4)

Consumption in a region = consumption in region's urban stratum + consumption in region's rural stratum [eg. T1 = Tu1 + Tr1, etc]

Consumption in the region's urban/rural stratum = (Consumption in the specific stratum / sample size in the specific stratum) * Population of the specific stratum
[eg. Tu1 = Cu1/Su1*Pu1 ; Tr1 = Cr1/Sr1*Pr1]


Now that we have the total consumption, calculating overall per capita consumption is fairly simple... Y = T/P

Per capita consumption by rural/urban stratum is calculated as:
Yu = (Tu1 + Tu2 + Tu3 +Tu4) / (Pu1 + Pu2 + Pu3 + Pu4)

Yr = (Tr1 + Tr2 + Tr3 + Tr4) / (Pr1 + Pr2 + Pr3 + Pr4)

The problem is, when I try to calculate per capita consumption by income group and by age group, I don't have data on distribution of population by income and age. Hence I resort to the crude method of using unweighted sum of consumption in the sample for these calculations. This is leading to problems, eg. if consumption of chocolates is far higher in one of the geographical regions compared to all others, the overall per capita consumption falls outside the range of per capita consumption by age.

For example, in my data, the results are as follows:

Age group Per capita chocolate consumption
0 - 7 2.75
8-12 4.07
13 - 19 4.86
20 - 35 7.42
36 - 45 7.65
46 - 60 8.58
Above 60 10.88

whereas in overall stratified sample, the per capita consumption is only 1.7 units.

Similar discrepency is observed in distribution by income groups also.



I understand that the source of problem is non availability of age and income distribution data in the various geographical and rural/urban strata. However, I hope a statistical solution to this problem exists.

I would highly appreciate if anyone could advise on this and point me to some resources that I can refer.

Thanks,
Sumeet
 

katxt

Active Member
#2
Strange things can happen with averages, like Simpson's paradox, but in this case the difference seems much too large. Can you explain where the age data came from, and how you calculated the averages?
 
#3
Strange things can happen with averages, like Simpson's paradox, but in this case the difference seems much too large. Can you explain where the age data came from, and how you calculated the averages?
Hi katxt,
Thank you for your time and effort.

The age data came from the primary survey (interviews with random households) I had conducted.

Caluclation of averages

Total consumption was calculated as a sum of consumption in various strata. Per capita consumption in the overall sample (which yielded a result of 1.7 units) was calculated by dividing this total consumption by total population (which is the sum of population of all strata).

Per capita consumption within age brackets were calculated as simple mean from the sample (eg. sum of consumption reported by 8-12 year olds divided by number of persons in that age bracket). I know this is not the ideal way to do it, but I don't have the age distribution data in the overall population. The solution for my problem could perhaps lie here- if you could suggest me another more accurate way to find the average consumption within different age brackets.

I have tried to explain the calculations in more detail the original post. Reproducing here...


Sample size: 1200 persons
Number of regions: 4
Sample size per region:300

Within each region the sample was further subdivided into rural and urban based on the population mix of that region.

Once the survey data was received, aggregations were done as follows:

Total consumption (T) = sum of consumption in each region (T1 + T2 + T3 +T4)

Consumption in a region = consumption in region's urban stratum + consumption in region's rural stratum [eg. T1 = Tu1 + Tr1, etc]

Consumption in the region's urban/rural stratum = (Consumption in the specific stratum / sample size in the specific stratum) * Population of the specific stratum
[eg. Tu1 = Cu1/Su1*Pu1 ; Tr1 = Cr1/Sr1*Pr1]


Now that we have the total consumption, calculating overall per capita consumption is fairly simple... Y = T/P

Per capita consumption by rural/urban stratum is calculated as:
Yu = (Tu1 + Tu2 + Tu3 +Tu4) / (Pu1 + Pu2 + Pu3 + Pu4)

Yr = (Tr1 + Tr2 + Tr3 + Tr4) / (Pr1 + Pr2 + Pr3 + Pr4)

The problem is, when I try to calculate per capita consumption by income group and by age group, I don't have data on distribution of population by income and age. Hence I resort to the crude method of using unweighted sum of consumption in the sample for these calculations. This is leading to problems, eg. if consumption of chocolates is far higher in one of the geographical regions compared to all others, the overall per capita consumption falls outside the range of per capita consumption by age.


Looking forward to your views.

Regards,
rsindore
 

katxt

Active Member
#4
I assume that they are random samples of 300 out of strata of P1, P2, P3, and P4. So the samples can be scaled up into reasonable estimates of the full strata values by multiplying by Pi/300.
If we look at one group, say 20 to 25, we will have n1, n2, ... in each strata with chocolate totals of T1, T2, ...
Our best estimate of the average consumption of this group over the entire population is (total consumed by the group)/(total number in the group)
= (T1*P1/300 + T2*P2/300 + ... )/(n1*P1/300 + n2*P2/300 + ... ).
The 300s all cancel in this case so we have = (T1*P1 + T2*P2 + ... )/(n1*P1 + n2*P2 + ... ) for the average for the group.
For the total average consumption, the groups are all n = 300 points so the overall average is = (T1*P1 + T2*P2 + ... )/(P1 + P2 + ... )/300
 
#5
I assume that they are random samples of 300 out of strata of P1, P2, P3, and P4. So the samples can be scaled up into reasonable estimates of the full strata values by multiplying by Pi/300.
If we look at one group, say 20 to 25, we will have n1, n2, ... in each strata with chocolate totals of T1, T2, ...
Our best estimate of the average consumption of this group over the entire population is (total consumed by the group)/(total number in the group)
= (T1*P1/300 + T2*P2/300 + ... )/(n1*P1/300 + n2*P2/300 + ... ).
The 300s all cancel in this case so we have = (T1*P1 + T2*P2 + ... )/(n1*P1 + n2*P2 + ... ) for the average for the group.
For the total average consumption, the groups are all n = 300 points so the overall average is = (T1*P1 + T2*P2 + ... )/(P1 + P2 + ... )/300

Understood and implemented. :)
This is indeed the solution I was looking for... Thanks again :):):tup:

In case I need to communicate this to someone briefly, is there a specific term used for this kind of calculation? eg. "The workaround to this problem was found using _______ method."?