Sorry for the confusing title.
I have a population dataset with ~15million records from which I must produce a random sample of 100K records for a survey. There is some processing and filtering that we do to the population (like removing inactive accounts or removing people surveyed in the last 6 months) that typically eliminate 5-10% of the records before the sample is taken.
Would it be valid to take a random sample of 1million records (or 2M or 3M, etc.) before the data is processed and the final sample for 100k is taken. By valid I mean can we still generalize observations we get from the 100K sample to the original population.
Thanks very much in advance for any help.
I have a population dataset with ~15million records from which I must produce a random sample of 100K records for a survey. There is some processing and filtering that we do to the population (like removing inactive accounts or removing people surveyed in the last 6 months) that typically eliminate 5-10% of the records before the sample is taken.
Would it be valid to take a random sample of 1million records (or 2M or 3M, etc.) before the data is processed and the final sample for 100k is taken. By valid I mean can we still generalize observations we get from the 100K sample to the original population.
Thanks very much in advance for any help.