Appropriate Sample Size

#1
Hi all -

I'm very new to stats so looking for help with a problem I'm having. Any resources or tips would be greatly appreciated!

I have a huge population dataset under different categories. This dataset contains one record per person, and has the following qualities:

MEMBER_ID (one record per person, one ID per person)
US_STATE (state abbreviation they live in)
MALE_OR_FEMALE (two options in this field)
TOTAL_INCOME

This dataset is about 10,000,000 records. What I'd like to do is sample this dataset and use to represent the entire population. I need to figure out what is an appropriate sample size in two different scenarios:

1. Sample size to represent the population as a whole
2. Sample size for each bucket to represent each bucket accurately (NY/Male, NY/Female, AL/Male, AL/Female, etc.)

For 2, I think I saw something on Chi Square by buckets, but need help. Essentially, I want to know what sample sizes I should take in both scenarios to make sure my sample is representing the population.

Thank for you any help or guidance you can provide!
 

hlsmith

Less is more. Stay pure. Stay poor.
#2
You word this as though you have access to the full population data. If so, why not use the who thing?

In both scenarios, you just use random sampling without replacement. Boom you are done.

Is there a particular question you are trying to power using these data or a certain level of precision on an estimate? If not, a simple random sample without replace is the go to.
 
#3
I'm trying to get a large enough sample in both cases to be confident that sample represents the whole dataset.

The whole dataset cannot be used because i want to test hundreds of interations of different scenarios that will take far too long/data to run with the whole dataset.

In 1, is 1000 enough to represent the income distribution? is 10,000?

In 2, if 10,000 is enough in 1 to represent the income distribution as a whole, does it accurately reflect the state/gender distributions as well? Is a higher number needed in order to achieve this?
 
Last edited:

katxt

Well-Known Member
#4
I'm trying to get a large enough sample in both cases to be confident that sample represents the whole dataset.
If the sample is truly random, it will will automatically represent the whole data set. The thing that will change with the sample size is the accuracy of your estimates.
So, how accurate do you want your estimates to be? Decide that first, then get a random sample of say a few hundred to determine the variability of the data, then you can do a proper power/sample size calculation.
Alternatively, decide how much data you can handle with your research budget and use the most you can afford.
 

hlsmith

Less is more. Stay pure. Stay poor.
#5
Yeah if you have the population use it. Run analyses on a server if at an university, if you are using a local machine run it in parallel. If you don't have multiple cores run it over night. Not using the full data is running from the real problem, your inability to truly answer your questions definitively. Write out your objectives and figure out how to use all the data, period.

If you are planning to write this up and publish and if I was one of your peer-reviewer I would demand that you use all the data.

PS, Explain what you mean by 'interactions'!
 
#6
Yeah if you have the population use it. Run analyses on a server if at an university, if you are using a local machine run it in parallel. If you don't have multiple cores run it over night. Not using the full data is running from the real problem, your inability to truly answer your questions definitively. Write out your objectives and figure out how to use all the data, period.

If you are planning to write this up and publish and if I was one of your peer-reviewer I would demand that you use all the data.

PS, Explain what you mean by 'interactions'!
Apologies - I meant iterations :) running 500 iterations/scenarios on about 10,000,000 x 500 columns is too cumbersome, and trying to see if running it on a much smaller set can get me insights quickly before testing on larger dataset.

kayxt, any measurements/tests/equations you can point me to to determine confidence/accuracy given a specific sample size?
 

katxt

Well-Known Member
#7
Google power analysis sample size. It's a very common technique with lots of calculators on line if you understand what is going on. However, if your project is important, buy an hour or so of a statistician's time rather than rely on advice given by folk on the net who are certainly competent and well meaning but aren't in a position to appreciate exactly what you are trying to do in this complex situation.
A bigger problem for your investigation is that it sounds like you will have 500 sets of p values. How will you distinguish between true positives and false? You really need to have some face to face time with a statistics expert you can trust and quote.
 
#8
Agreed. I think this was more of a starting point to get the conversation going on whether or not it's something doable in our department and how to go about it. Unfortunately i just don't have the knowledge base to answer accurately.

I'll see what we can do and find a data scientist in house :) thanks for your help.
 

hlsmith

Less is more. Stay pure. Stay poor.
#9
Alright, for a sample size calculation you should pick the closes (smallest difference)comparison you can think of that you are interested in and power for that. All other tests should then be reasonably powered. You can validated the results using the data you didn't sample from.

HOWEVER, if you use all of your data you don't have to run any test at all. If you find a difference between two groups in income, guess what, there is a difference between incomes. No formal test is actually needed and running them will only confuse the point. When you have the full population, testing using samples are pointless and may by chance mask the truth!!!
 

katxt

Well-Known Member
#10
We have talked about the population, but I suspect that the 10M cases may not be a "population" in the classical sense. Income, for example is not fixed but varies from time to time, so the incomes we have are at the best a snapshot of those at a particular point in time and if we went back we may well find them different, along with many of the other 500 columns. If we want to make statements about incomes in general, we are actually considering the 10M cases as very large sample of all the possible relevant incomes. In this situation we can make statistical inferences but we are massively overpowered. With such a large sample, the great majority of tests will prove to be statistically significant , but of little practical importance. (In real life the null hypothesis is very seldom true.) In the first part of the last post #9 hlsmith makes what seems to me to be a sensible suggestion.