Extrapolation - Getting info of a population from a sample

#1
Hello,

I'm working on a problem where I have 30% of a full dataset and I have to estimate the generalization error.
To be more precise, let's say I have the information of the transactions of the clients of a bank which has 30% of the countries market.
I can easily get the mean, standard deviation and so on of my dataset but I can't figure out how to extrapolate.

I know all about basic statistics and so I went through all my lectures to try to find the answer.
I would like to calculate the sampling error of my dataset.

To explain clearly:

population: the information of all transactions in a certain country.
mean : m (unknown)
standard deviation: sigma (unknown)

my dataset : the information of the transactions of the clients of 1 bank owning 30% of the countries market. (so 30% of the whole big dataset)
mean : m* (known)
standard deviation : s (known)

The problem is that in all the examples I see, the formulas all include the standard deviation of the population, which i don't have.

I used this formula:
m=[m* ± 1.96*sigma/√(n)] n being the size of my sample
They use "sigma" which I don't know.


The second part would be to get the variance (or standard deviation) error but i'm not there yet.

Any help would be apreciated,
Thank you
Nicolas
 
#3
Ok, thank you! So you use the fact that sigma=s*√(n/(n-1) ?

Otherwise would you have any leads to find the standard deviation error or do I stick with the formula I just wrote?
 
#4
Ok, thank you! So you use the fact that sigma=s*√(n/(n-1) ?
No.
(I dont know what you mean by "sigma=s*√(n/(n-1)" )

Sigma is the standard deviation in the population.
s is the standard deviation in the sample.

s is an estimate of sigma.

The standard error is s/√(n). That gives the "uncertainty" in the sample mean.

Example:

The population mean of males (in my country) is 180 cm with a standard deviation sigma of 7 cm.

If we randomly select n=100 males and and get a sample mean of 181 cm, then how "uncertain" is that?

the standar error will be s/√(n) = 7/√(100) = 7/10 = 0.7

the t value with 99 degrees of freedom will be 1.984217 (You can get that from a table. I got it from the software R.)

So the uncertainty will be t*s/√(n) = 1.984217*0.7 = 1.388952

So the "uncertainty" in the sample mean of 181 is +/- 1.388952
So the 95% confidence interval is [179.611 ; 182.389]

I hope this helps. :)
 
#5
Your example totally illustrates my problem.
You say that the mean of the population is 180 cm and it's standard deviation sigma=7cm.
then you take a sample and use s/√(n) to calculate the standard error. But why do you say s=7? Why would the estimate of sigma be directly equal to sigma?

In any case thank you for your help.
 
#6
But why do you say s=7? Why would the estimate of sigma be directly equal to sigma?
No, I should have made an example with say s=6.5. That would have been a better example. Of course sigma is (in general) not known.

If sigma happened to be known (maybe from a large population study or many other similar studies) then we would use 1.96*sigma/√(n)