Best techniques for surveying large populations

#1
Hello,
I'm struggling with some statistical concepts for a survey I'm implementing. I'm planning on using SAS for this.
Here is the problem:
I have to check a large company geo database in the field . That is, the company have many kinds of equipments (about 20 kinds and millions of each) spreaded all over the state and I have to check and see how representative that database really is. (check if the equipment exists and if it is of the same kind they claim to be)
That state is really big, the size of a medium european country - and it also has some transportation issues, so first, I've created stratas for each equipment type and then I'll do a single-stage cluster sampling (of unequal sizes) of the cities in the state.
I've decided to use proportional to sizes sampling in each cluster. (A city with many equipments gets more samples)

Then I was thinking in using dummy variables for the things I'll see in the field (i.e. 0 or 1 if the equipment exists; 1 or 0 if it's location is correct; 1 or 0 if it's the same kind; 1 or 0 if it's the same power spec etc).
My question for you guys is:
- Is this a good method?
- Is this method resulting in a probability sampling?
- How do I calculate the error and variance of my sample after (can I do it before?) I check it? Do I use like linear regression?
- How big should my sample be? 95% confidence and 5% error. How do I calculate it since I'm using two types of sampling methods?

I'm an electrical engineer with scarce statistical knowledge.

Thank you very much for your time and attention.
 
Last edited:

hlsmith

Not a robit
#2
So you have potential population numbers (which may or may not be correct). Now you want to sample from them to see if the sample has comparable properties to those potential for the population? Is this correct?

Well, yeah using samples proportional to the population size seems fine. Though, having the sample being as random as possible usually avails desirable attributes. The size of the samples depends on how off you think they may be to the population (effect size).
 
#3
So you have potential population numbers (which may or may not be correct). Now you want to sample from them to see if the sample has comparable properties to those potential for the population? Is this correct?
It is exactly that.

Well, yeah using samples proportional to the population size seems fine. Though, having the sample being as random as possible usually avails desirable attributes. The size of the samples depends on how off you think they may be to the population (effect size).
The thing is that I have one kind of equipment that has dozen of millions (like street transformers) and others that are much more expensive and have only a few hundred. I thought that stratified sample would be the best in this scenario, is this right?
Also, I understand that the cluster sampling would increase substantially my variance, but it's the only feasable thing I can think of.
 
Last edited:

hlsmith

Not a robit
#4
Was the population list correct at one point? That to me is where things get weird. The rest can just emulate representative sample strategies. Where you can do stratified (group) samplings - which are in some regard random in some way. So you get all eligible and randomly select some within the groups.

But you are comparing outcomes to a possibly unreliable population, this actually probably means your samples are the truth, so you are comparing the population list to the samples. OK - I just cleared that up in my mind a little.
 
#5
Was the population list correct at one point?
It was never audited this way, I'm proposing this new metodology.
I believe the population is pretty accurate, the company uses it in its technical departments. But I can't be sure of that, some companies just don't keep a clean record of things.

But you are comparing outcomes to a possibly unreliable population, this actually probably means your samples are the truth, so you are comparing the population list to the samples. OK - I just cleared that up in my mind a little.
This mean that the sampling strategy is OK?
 

hlsmith

Not a robit
#7
Well yeah. This seems like it falls under a type of industrial engineering or quality control process, which probably have a better framework. I am just generalizing basic statistics. In doing that, you could state a priori how big of a difference would be considered of importance and determine which statistical test you would want to use and figure out the power based on the acceptable tradeoff between type I and type II errors. The selection of the test would be important (e.g., Chi-square or perhaps a non-inferiority or superiority, or maybe even calculating sensitivity or specificity).

@Miner - do you have any suggestions?