Standard deviation and sample size

#1
Hi. I have no background on statistics, so please just go down to fundamental for me. But here are some basic questions about statistics and standard deviation ....

Assuming study from 20 samples (please don't mind the sample size) found that, for example an average American play 3 +/- 2 hours of gaming a day.

But if we want to predict/forecast out how many hours the whole US (let say 300 millions in population) spent in gaming per year (let say 300 days for easy calculation). Of course the statistics won't be 270+/-180 billions hours per year. Because firstly the standard deviation will be too large for the study to be much meaningful. And secondly, we expect that if you add that many samples together, the fluctuate should cancel each other out, and we will get closer to the 270 billions hour rather than the large range 90-450 hours billions hour like the standard deviation implied...


So my first question is: is there a name or a rule in statistics, pointing out that if you want to use a smaller sample to predict a much larger one, it's better to not using the standard deviation. Because it's misleading and useless if you do? And maybe it's better to use some other methods?

My second question is: what is the differences between, let say pick a study sample and monitor for a much longer time (for example pick 5 random person, then see how much games they play every day for a month), in comparison with picking 150 people, and ask how many hours of games they were playing today. I know the first is skewed if you don't pick correct study candidates (that can present the whole population). While the later data is skewed if you pick the wrong time and location (when a blockbuster game is just released, or during weekend, which will show a much higher average gaming time). But is there a more academic answer to show the difference between the two? If there is even a name for it to refer to, it will be perfect

Thank guys :)
 
Last edited:

CB

Super Moderator
#2
Assuming study from 20 samples (please don't mind the sample size) found that, for example an average American play 3 +/- 2 hours of gaming a day.

But if we want to predict/forecast out how many hours the whole US (let say 300 millions in population) spent in gaming per year (let say 300 days for easy calculation). Of course the statistics won't be 270+/-180 billions hours per year.
Nope it won't. The overall sum will be 3*300*300,000,000 = 270b, yes. But you can't just multiply the observation-level standard deviation by 300*300,000,000 to get the population-level standard deviation. To calculate the standard deviation of a sum of independent random variates we need to square the unit-level standard deviation to get the variance first, then multiply it by the number of variates (300*300,000,000), then take the square root:

sqrt(2^2*300*300,000,000) = 600,000.

That gets us a much smaller standard deviation of 600,000 hours. Note that this assumes (probably unrealistically) that the number of hours spent gaming per person per day are all independent of one another.

So if we observed the total number of gaming hours of the whole US population over a number of years, and assumed that the number of gaming hours per person was stable at 3 hours per day (SD 2 hours) over the whole period, with each observation being independent of the others, the yearly sum would have a mean of 270b hours and an SD of 600,000 hours.

So my first question is: is there a name or a rule in statistics, pointing out that if you want to use a smaller sample to predict a much larger one, it's better to not using the standard deviation. Because it's misleading and useless if you do?
No, because this isn't really true - hopefully the explanation above will help you see why.

My second question is: what is the differences between, let say pick a study sample and monitor for a much longer time (for example pick 5 random person, then see how much games they play every day for a month), in comparison with picking 150 people, and ask how many hours of games they were playing today. I know the first is skewed if you don't pick correct study candidates (that can present the whole population). While the later data is skewed if you pick the wrong time and location (when a blockbuster game is just released, or during weekend, which will show a much higher average gaming time). But is there a more academic answer to show the difference between the two? If there is even a name for it to refer to, it will be perfect
The first design is called a longitudinal design, and the second design is called a cross-sectional design. They allow you to answer different questions. It's hard to say which is "better" because it depends on what you are wanting to find out.
 
#3
Nope it won't. The overall sum will be 3*300*300,000,000 = 270b, yes. But you can't just multiply the observation-level standard deviation by 300*300,000,000 to get the population-level standard deviation. To calculate the standard deviation of a sum of independent random variates we need to square the unit-level standard deviation to get the variance first, then multiply it by the number of variates (300*300,000,000), then take the square root:

sqrt(2^2*300*300,000,000) = 600,000.

That gets us a much smaller standard deviation of 600,000 hours. Note that this assumes (probably unrealistically) that the number of hours spent gaming per person per day are all independent of one another.

So if we observed the total number of gaming hours of the whole US population over a number of years, and assumed that the number of gaming hours per person was stable at 3 hours per day (SD 2 hours) over the whole period, with each observation being independent of the others, the yearly sum would have a mean of 270b hours and an SD of 600,000 hours.

No, because this isn't really true - hopefully the explanation above will help you see why.

The first design is called a longitudinal design, and the second design is called a cross-sectional design. They allow you to answer different questions. It's hard to say which is "better" because it depends on what you are wanting to find out.
Hey thank.

It doesn't quite like what i would expect, but i guess it still provide similar result? Like 270b +/- 600,000 would kinda make the SD meaningless in this case right? I means even with a smaller population, let's say a school of 300 students and how many hourts they spent on gaming per year. It will still gonna 270,000 +/- 600, still fairly insignificant.

Is there a name for it in this case, or saying "because we multiply it by a millions, it will make the SD insignificant" be enough?

I will look at the longitudinal design, and cross-sectional design. I means as long as one is not better than another by default, and i still can argue/prove that one suit my case better :)


Edit: since we are at it, is there a way to defend our selection of longitudinal experimental design to a general audience who doesn't quite know the process? Like in my case, doing a longitudinal design is a must and much more convenient. But you still have to pay for 150 samples by the end of it. But saying "monitor 5 people for a whole month", does not sounds as grand as "did (big number) 150 surveys"
 
Last edited:

CB

Super Moderator
#4
Like 270b +/- 600,000 would kinda make the SD meaningless in this case right? [...]Is there a name for it in this case, or saying "because we multiply it by a millions, it will make the SD insignificant" be enough?
I don't really understand why you think the SD is meaningless and "insignificant" here. Imagine that we have observations of the daily gaming hours of the entire population of 300 million people over a very long number of years. And each year we added up the gaming hours of the entire population. We assume that the daily gaming hours for individuals are independently distributed, and stay stable at M = 3h, SD = 2h throughout the study period. Then the standard deviation of the yearly totals will be 600,000 hours. This standard deviation is no more or less meaningful than any other standard deviation. Can you explain why you see the number as meaningless?

Edit: since we are at it, is there a way to defend our selection of longitudinal experimental design to a general audience who doesn't quite know the process? Like in my case, doing a longitudinal design is a must and much more convenient. But you still have to pay for 150 samples by the end of it. But saying "monitor 5 people for a whole month", does not sounds as grand as "did (big number) 150 surveys"
You will need to explain why, for your specific research questions, this design is preferable. In general, longitudinal designs are popular when you want to know how individuals' behavior changes over time, and/or if you are trying to gather evidence to support causal conclusions in a case where a true experiment is impossible (longitudinal designs help establish temporal precedence, which is one of the requirements to establish causality). Obviously in your case a limitation of following just 5 people would be that you have very little evidence to make generalizations to a wider population. But again, it all depends on what you're trying to find out.
 
#5
I don't really understand why you think the SD is meaningless and "insignificant" here. Imagine that we have observations of the daily gaming hours of the entire population of 300 million people over a very long number of years. And each year we added up the gaming hours of the entire population. We assume that the daily gaming hours for individuals are independently distributed, and stay stable at M = 3h, SD = 2h throughout the study period. Then the standard deviation of the yearly totals will be 600,000 hours. This standard deviation is no more or less meaningful than any other standard deviation. Can you explain why you see the number as meaningless?
To my understanding, the SD tells you how much the data has dispersed from its mean. Right? So for example if the calculated result is correct, and if we actually do the real survey of the whole population in this case, most likely the total gaming hour will fall between the 270b-600k and 270b+600k.

If based on the accuracy and precision definition in Wikipedia. Then my conclusion (the US spent 270b hours on gaming each year) may not be as accurate/ high value in trueness, as it only based on 20 samples . But the result should be fairly precise, because the SD value here is very small/tiny when compare to the total value?

But because every time you multiply a smaller number by , up to 1000 or 10,000 times, the SD will always become really small in comparison to the total value. Would that means everytime we use a really small population to predict a much larger population, it will be unnecessary to take the SD into account?


Btw, really thank for persisting with me on this :)
 
Last edited:

CB

Super Moderator
#6
To my understanding, the SD tells you how much the data has dispersed from its mean.
That's right.

So for example if the calculated result is correct, and if we actually do the real survey of the whole population in this case, most likely the total gaming hour will fall between the 270b-600k and 270b+600k.
I do wonder here if you might be confusing the standard deviation - which tells us about the dispersion of the data - with some measure of how certain we can be that the true parameter (e.g., the true population mean) falls within some particular interval. These are two quite different things.

If we already knew for certain that the true mean of daily gaming hours per person was 3 hours, with standard deviation 3, then the standard deviation of annual summed gaming hours for a population of 300 million would be 600,000 hours. (Given the simplifying assumptions discussed above). This is how I'd interpreted your question: As being about what the standard deviation of a sum of random variables is, given that those random variables have a known standard deviation.

However, what really might be of more interest to you here is calculating an interval within which you can be reasonably sure the true population mean falls, based only on a sample of data.

For example, say you collect data from 20 people, and find that within this sample, the mean daily gaming hours are 3h, and standard deviation 2h. You can use this online calculator to obtain a 95% confidence interval for the population mean. This confidence interval would be 2.06 to 3.94 hours.

Given a "year" of 300 days and a population of 300,000 million, this would make your 95% confidence interval for the annual summed population gaming hours 185.4 billion to 354.6 billion hours. Which is obviously quite a wide range. The bigger your sample, the smaller this range would be.

If based on the accuracy and precision definition in Wikipedia. Then my conclusion (the US spent 270b hours on gaming each year) may not be as accurate/ high value in trueness, as it only based on 20 samples . But the result should be fairly precise, because the SD value here is very small/tiny when compare to the total value?
In stats we don't generally use the terms accuracy and precision in the sense that they're used in that WP article, but conceptually it's really the other way round. If you collect random samples of 20 observations, the expected value of the sample mean - i.e. the long-term average over repeated samplings - will be equal to the population mean. In stats we call this unbiasedness; in the terminology you mention, this measurement would be accurate. However, because the sample size is quite small, the sample means will be quite dispersed either size of the true mean. The mean from any one sample could be quite far from the true population mean. In the terminology you mention, the estimation method is not very precise.

Would that means everytime we use a really small population to predict a much larger population, it will be unnecessary to take the SD into account?
Not really. When trying to make inferences about a population mean, the standard deviation of the variable within the sample we've collected will always be an important input when calculating an interval estimate for the true population mean.

Hope that all makes more sense now!
 
#7
I don't understand everything you said, but i started to realise that my take on SD was very misleading.

But first and most important, is there a way to interpret the R^2 value (in line-of-best-fit into the 95% confidence interval?

Let's me explain. I used the gaming hours because i thought it will be less confusing. But since you have gone so far for me, i think i will try to make it more specific. But the actual data i worked with was more on the constant rate of chemical X degrade in water. Because the rate of degenerate is slow, let say 1 unit/L/day , while the total concentration is high, 100 units/L. So you can see that even a 0.3% error in set up, collections and analysis will results in a fluctuate of 1 +/-0.6 unit/L/hour. And we are looking at pond size here means it's more like unit/gigalitre/ year

What i did is to see how much it degenerate periodically over a long time (with duplicates sampling and analysis). Duplicates aside, the amount of independent data point i did is 20. So I run a line of best fit through them all to find the constant rate. And depend on R^2 to justify my results. The problem is some higher up, not so sure about the whole methodology asking about SD, as they presume that because i have ONE number (rate), it means i only take and analyse ONE sample and spent the rest of the month on youtube or something like that.


So as you can guess i am basically trying to defend my case. Because all of the expert in that specific field has told me that the method was correct. What i have to explain is why i don't show SD in this sutation
 
Last edited:
#8
It is much better if you explain the actual problem.

Let's me explain. I used the gaming hours because i thought it will be less confusing.
To do a made up hypothetical example just causes confusion.

Why don't you show us the data. 20 point is not much. (If they are "top secret" you can multiply it by 2.5 and add 3, or what ever)

The problem is some higher up
Is it the top management that is causing problem?

If this is about water pollution I believe that there is at least one member here that knows a lot about that. (So it can be an advantage to be open about this.)
 

CB

Super Moderator
#9
Ok. I think the description given does change things somewhat. In the gaming hours analogy, the hypothetical goal was to estimate a population mean, whereas your substantive research question is actually about estimating a relationship (between time and concentration, I think?)

I assume that the 20 independent data points you mention are all at different points in time?

I agree with Greta that one of the ecologists on the forum might be able to give some good help here. But a couple of thoughts that strike me:

1. The standard error of the estimate (i.e. the standard deviation of the regression residuals) might be a more useful quantity to report than the standard deviation of the data points themselves. The SEE would tell you about how much variability there is around the line of best fit. Obviously it's closely related to the R^2.

2. Are you sure a model assuming a constant rate of degrade is appropriate? I do wonder if the rate of decay might not depend on the concentration in the water... (But I am not a biologist or ecologist so don't worry if that's a dumb issue to raise!)
 
#10
Thank guys.

I'm sure that based on past research, the rate should be constant. And even if it does not, it's better to think that it is.

A small set of data for example is here, we did triplicate of the experiments, then independent test of random samples, so that's why we have quite a varies amount of samples at certain time. Please don't mind about that... it's just part of the design.

To find the rate, basically i would just go with line of best fit + show equation. But like i said we only look at a small scale of 1L here. The real thing is a scale of a huge pond, and it's a totally mixed so can expect the whole thing to be uniform




Is it the top management that is causing problem?
Sort of. Because our test is different with what they often see. So they were like "Did you only do it once? where is the Standard Deviation!! What do you mean it doesn't need. Everything need SD. Everything....!!"

1. The standard error of the estimate (i.e. the standard deviation of the regression residuals) might be a more useful quantity to report than the standard deviation of the data points themselves. The SEE would tell you about how much variability there is around the line of best fit. Obviously it's closely related to the R^2.
Is it the one using the command STEYX on Excel ?

I just tried, got the number, the only issue is....i don't know is it high or low (i means unlike R^2 where you kinda expect 95+ is "high", here it's just a number )


PS: I'm not in hurry as it's a small part of a much bigger project which is still ongoing, so i'm in no rush yet. Just point me at any direction
 
Last edited:
#11
I believe that it is the standard error in the slope that Risingstar needs.

I entered the data in R (one dark stormy rainy night when I couldn't sleep). R is free and you can run it (the code below) by copying in one line at a time.


Code:
one   <- c(99, 99, 99, 99, 99)
t1    <- c( 1,  1,  1,  1,  1)
two   <- c(98.788, 98.280, 98.364, 98.331, 98.447)
t2    <- c(     2,      2,      2,      2,      2)
three <- c( 98.223, 98.023, 98.198)
t3    <- c(      3,      3,      3)
four  <- c(97.667, 97.302, 97.259)
t4    <- c(      4,      4,      4)
five  <- c(96.804, 96.174, 96.623, 96.865)
t5    <- c(     5,      5,     5,      5)
six   <- c(96.434, 96.454, 96.291)
t6    <- c(     6,      6,      6)
seven <- c(95.720)
t7    <- c(7)

chem <- c(one, two, three, four, five, six, seven)
time <- c(t1, t2, t3, t4, t5, t6, t7)

plot(time, chem, ylim= c(90, 100))
abline(lm(chem ~ time))

summary(lm(chem ~ time))

This is the result of the estimation:

Code:
Call:
lm(formula = chem ~ time)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.64880 -0.08222 -0.02791  0.11385  0.31137 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 99.57919    0.08874 1122.09   <2e-16 ***
time        -0.55128    0.02301  -23.96   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2104 on 22 degrees of freedom
Multiple R-squared:  0.9631,	Adjusted R-squared:  0.9614 
F-statistic:   574 on 1 and 22 DF,  p-value: < 2.2e-16
The estimated slope is -0.55128 and the standard error is 0.02301. That means that the 'chemical' is decreasing by about one half percentage unit per time period.

The chemical is decreasing by 0.55 and its 'standard deviation' (the correct name is its standard error) is 0.02301. So the 'uncertainty' (to use a non-technical term that the top management will hopefully understand) in the slope is 0.02301.

A (95%) confidence interval is:

Upper: 0.55128 +1.96*0.02301 = 0.5963796

Lower: 0.55128 - 1.96*0.02301 = 0.5061804

So, with rounded numbers, the slope was estimated to be -0.55 (95%CI [-0.51; -0.60])


graphics code below:

Code:
#you need to install "ggplot2" first before you can use it
library(ggplot2)

p <- qplot(x=time ,y=chem, ylim= c(90, 100),
            geom=c("point", "smooth"), 
            method="lm", formula=y~x, ylab='Amount of Chemical (%))')
#p
p2 <- p + theme_bw()
p2

Sorry I don't know how to insert such a graph here on talkstats. Maybe someone else can help.

PS: I'm not in hurry as it's a small part of a much bigger project which is still ongoing, so i'm in no rush yet. Just point me at any direction
It is before the project has started (or at least before it is finished) that the project can be improved. If you have questions about that it is better to ask them as early as possible. It makes me sad when I think about all the projects that are poorly designed and naively analysed.
 
#12
Ah thank GreataGarbo :)

All learning process from my side here so hopefully things would get better as i has more experiences dealing with these. Just two more quick questions from me if you do not mind:
  1. is the 1.96 number (used to calculate the 95% confidence interval) a constant number?
  2. assume the slope (decrease rate) does not change, but the initial amount of chemical was higher (say 200 instead of 99), it still won't affect the standard error value, right?

Thanks
 
Last edited:
#13
[*]is the 1.96 number (used to calculate the 95% confidence interval) a constant number?
I did a pedagogical simplification about the "1.96"-value [which comes from a 95% confidence interval from a normal distribution]. The formally more correct value would be the t-value based on degrees of freedom of 22 (as it is in risingstars case).

There is a t-table on top of this site. (click on "normal table" and then on "t-table").

The t-value that give 95% confidence level and has 22 degrees of freedom is 2.074. (This value is close to 1.96. As the sample size increases the t-value will tend to the value 1.96.)

I thought that it was difficult enough for Risingstar so I did not include that. That would make the confidence interval:

0.55128 +/- 2.074*0.02301


[*]assume the slope (decrease rate) does not change, but the initial amount of chemical was higher (say 200 instead of 99), it still won't affect the standard error value, right?
If all the values were increased by 100 units the slope and standard error would be the same, but the intercept would increase.

(By the way I question the first observation. There was not five measurement that were exactly 99.000?)