Is a random sample from a random sample a random sample?

#1
Sorry for the confusing title.
I have a population dataset with ~15million records from which I must produce a random sample of 100K records for a survey. There is some processing and filtering that we do to the population (like removing inactive accounts or removing people surveyed in the last 6 months) that typically eliminate 5-10% of the records before the sample is taken.
Would it be valid to take a random sample of 1million records (or 2M or 3M, etc.) before the data is processed and the final sample for 100k is taken. By valid I mean can we still generalize observations we get from the 100K sample to the original population.
Thanks very much in advance for any help.
 

Jake

Cookie Scientist
#2
I don't really see what the two-step sampling procedure buys you. In both cases you are systematically excluding cases that meet certain criteria, so you can of course only make conclusions about the population of people who do not meet those criteria. I don't think that the two-step procedure is detrimental, but I don't see the benefit either.
 
#3
Hi, thanks very much for your reply.
Currently I have about a dozen filtering steps on the population dataset before I take my random sample. If I can run those steps against 1Million records rather than 15Million records and still be able to generalize the results to the full 15Million population I'll save alot of time building this sample.
 

Jake

Cookie Scientist
#4
Well your final sample size is going to be 100,000 right? So why not just draw a 100,000 size sample and then do your processing steps only on that sample? This seems just as sound as anything else we've discussed.
 

Dason

Ambassador to the humans
#5
I think they're hoping to get a final sample size of 100,000 whereas if they draw 100,000 from the original and then do the filtering they'll have less than 100,000. If they draw 1,000,000 and then do the filtering and then draw 100,000 from that then they'll definitely have 100,000.
 
#7
I wish I could, that would really cut down on the run time for my script. Unfortunately, each of those filtering steps will invalidate or remove some records. Typically, when I run the entire population I'll lose 10-15% of the starting records to the filters. I suppose I could run with just 150k records then after the filtering process just prune the file to get 100k records but that wouldn't seem like a random sample. Anyway, thanks again for your help.
 
#8
Isn’t it so that, to take a simple random sample (of say 10%) out of a simple random sample (of say again 10%) out of a population, that such a sample in it self will be a simple random sample (in this case of 1% because 0.10*0.10=0.01)?

Even if one is filtering away some of the not eligible units, for example if one deletes all children and retired people.

Even if it is done in several steps and the sampling probabilities can be computed in each step I believe that the final sampling probabilities and therefore the weights can be computed. (I mean if it is just a simple random sample the weights for the mean is just 1/n (where n is the sample size)).

Kurt b wrote:
“ but that wouldn't seem like a random sample.”
To me it seems like a random sample.
 

Dason

Ambassador to the humans
#9
I agree with GretaGarbo as long as you're randomizing at each step and not doing some sort of convenience sample. So don't just grab 1,000,000 ranodmly, do the filtering and then grab the top 100,000 - this wouldn't be a random sample anymore.