Census data: Should I worry about sample size and pvalues?

#1
Hi all,
I have a questions regarding statistical power and sample size. (1) Do I have to worry about sample size in a multiple logistic regression if I am using all the individuals in a population (census) and not a sample? Let’s say that I want to see how many tourists in a resort report a complaint. This would be my dependent variable (complaint Yes/No). I have 7 independent variables including age, sex, and ethnicity, past complain (yes/no), etc. Again, I am including all guests in a period of time (not a sample). The problem a very small proportion of people report a complain (DV). 271 reported a complaint and 32,469 did not (so less than 1% report a complaint). I wonder that some cells (categories) of my independent and dependent variables will not contain any people since we don’t have too many people who said Yes for my DV. For example, the Asian category may have only 2 people and both did not report a complaint. (2) Would this affect the regression and the pvalues? We expected people with previous complaints to be more likely to complain but in my regressions analysis this is not statistically significant, however OR is 1.6. (3) Can this be affected by the low N? (4) Should I report and consider p-values or not since I don’t have sampling errors? I would appreciate any help/ideas!

Thank you!
 

noetsi

Fortran must die
#2
Census really does not have the population. Commonly it does samples as with the ACS, but even when in theory it does populations it likely does not capture everyone because some don't get the census and some don't return it. That said if you have the population you know the true effect size. I don't see how type 1 or type 2 errors can apply when you know beyond any doubt what the true population result is. For the same reason I don't think p values matter when you know the true effect size and there is no sampling error.

Having too few people in a cell might cause your regression not to run. But I think a more basic issue is whether your results are reasonable. If only a tiny portion of the population has complaints, does that mean they truly don't have any or they just did not go through the process of complaining?
 

rogojel

TS Contributor
#3
hi,
this will depend on the scope of your conclusions. If you only describe what happened until this moment then you are probably right and you need not to worry. If you or the readers of your report interpret your results as something talking about the future as well (like having to take some actions based on the knowledge tühat 35% of tourists complained in the past) then you conceptually have a sample of the population of tourists past and future and then all the sample size issues should be considered. This is tricky because your reades might look at the future even if you explicitly state that you do not draw any conclusions .

regards
 
#4
I am sorry for the late reply. I red both of your comments and I still have a couple of questions.

1. I think it makes sense to consider the p-values. The main task of this project is to predict which guests will report a complaint in their stay. So I can treat my population as a sample of a future guest universe. This can be very tricky since in reality I don’t this is not a sample of a population. Thus I am not predicting a parameter of a static population. My universe is constantly moving. For example, the next group of guest can be totally different than the population of which I run the logistic regression. Is there a standard terminology for this type of population/sample??? Does it make sense to consider and use the p values?

2. Since only less than 1% of the guests reported a compliant in their stay I have cells with very few observations. After my logistic regression (using Stata), two race categories were empty and Stata recognized this and alert me with the following message "Asian and Indian was dropped because it predicts failure perfectly". That is, none of the Asian and Indian guests reported a complaint (DV). How would this affect my regression? Why Stata is dropping these observations? What about if this is true? For example, if you are Asian you have 0 probability of reporting a compliant? Or in order to generate a logistic regression coefficient, at least one person needs to be in on of the categories?


3. On the other hand, we were expecting that past complaints (IV) will predict our DV (complaint in the current stay). The regression shows that those with a previous complaint have 1.6 greater odds of complaining but it is not significant. Can be this be affected by the few people who had a complaint in the past, as well as by the low proportion of people who complaint in their current stay (DV)? Namely, 13 out of the 271 people who currently complained also had a complaint in the past. On the other hand, 133 out of the 32,00 people who did not have a complaint, had a complaint in the past. Are these numbers too small and therefore influence my p value? If I have a larger N, would the result became significant?
 

noetsi

Fortran must die
#5
hi,
this will depend on the scope of your conclusions. If you only describe what happened until this moment then you are probably right and you need not to worry. If you or the readers of your report interpret your results as something talking about the future as well (like having to take some actions based on the knowledge tühat 35% of tourists complained in the past) then you conceptually have a sample of the population of tourists past and future and then all the sample size issues should be considered. This is tricky because your reades might look at the future even if you explicitly state that you do not draw any conclusions .

regards
This is true of course only if other populations effect size varies. :p
 

noetsi

Fortran must die
#6
1. I think it makes sense to consider the p-values. The main task of this project is to predict which guests will report a complaint in their stay. So I can treat my population as a sample of a future guest universe. This can be very tricky since in reality I don’t this is not a sample of a population. Thus I am not predicting a parameter of a static population. My universe is constantly moving. For example, the next group of guest can be totally different than the population of which I run the logistic regression. Is there a standard terminology for this type of population/sample??? Does it make sense to consider and use the p values?
If you think of your population as a sample of some other population you should use p value. If you don't you should not. That pretty much is the standard terminology, a population is the entire unchanging population and a sample is a portion of a larger unknown population. One practical problem here is that many analysis require a "random sample" and you are not sampling randomly (well you don't know if you are or not - since you are not sampling at all in the classical sense).

2. Since only less than 1% of the guests reported a compliant in their stay I have cells with very few observations. After my logistic regression (using Stata), two race categories were empty and Stata recognized this and alert me with the following message "Asian and Indian was dropped because it predicts failure perfectly". That is, none of the Asian and Indian guests reported a complaint (DV). How would this affect my regression? Why Stata is dropping these observations? What about if this is true? For example, if you are Asian you have 0 probability of reporting a compliant? Or in order to generate a logistic regression coefficient, at least one person needs to be in on of the categories?
I don't believe it is possible to run a regression with no variance. Which is why STATA dropped it.

3. On the other hand, we were expecting that past complaints (IV) will predict our DV (complaint in the current stay). The regression shows that those with a previous complaint have 1.6 greater odds of complaining but it is not significant. Can be this be affected by the few people who had a complaint in the past, as well as by the low proportion of people who complaint in their current stay (DV)? Namely, 13 out of the 271 people who currently complained also had a complaint in the past. On the other hand, 133 out of the 32,00 people who did not have a complaint, had a complaint in the past. Are these numbers too small and therefore influence my p value? If I have a larger N, would the result became significant?
I know a very low variation in a predictor (say 95 percent in one group) can negatively effect regression. How it influences odds ratios I have never read.