Mixed model regression for income by ZIP

#1
I'm running a regression analysis to understand the impact of income (predictor variable) on scores on an instrument (dependent variable). The only problem is that I don't know each individual's income; I've estimated it using their ZIP code. I've experimented with a few different methods to account for the fact that 2 or more people might share a ZIP code, and therefore income, but are not necessarily the same. Here's my latest approach:

Mixed model with ZIP code as the random effect. However, there is an average of only 2 observations per group. So, I estimated income by county (instead of by ZIP) and now there is a mean of 5 observations per group. Reduced my p-value a little, but it's still significant, and I think might be safer to defend statistically. I'd appreciate any thoughts on this approach, or if you know of another way to do it.

Thank you!
 

hlsmith

Omega Contributor
#2
So ZIPS have on average about 2 people in them and counties on average have about 5 people in them. I am guessing that when you use county you have less precise data about income, meaning it is not as specific to the individual since it is county level.

If all of this is true, perhaps just give each person their ZIP income and don't use multilevel model, but now incorporate robust standard errors to partially account for the unknown variable you aren't modeling for any more. I think such a approach isn't that uncommon when there are few individuals in groups or if the groups aren't explaining too much variability in the model. It is kind of a hack, but better then not controlling for the variability at all.
 
#3
So ZIPS have on average about 2 people in them and counties on average have about 5 people in them. I am guessing that when you use county you have less precise data about income, meaning it is not as specific to the individual since it is county level.

If all of this is true, perhaps just give each person their ZIP income and don't use multilevel model, but now incorporate robust standard errors to partially account for the unknown variable you aren't modeling for any more. I think such a approach isn't that uncommon when there are few individuals in groups or if the groups aren't explaining too much variability in the model. It is kind of a hack, but better then not controlling for the variability at all.
Thank you. I'm a stats-amateur - could you elaborate on how I would do this? I'm using STATA - is there a special "robust" regression analysis to run?
 

hlsmith

Omega Contributor
#4
Not a STATA user, but they are usually called:

Robust standard errors
Sandwich estimators
White-Huber SE, or a derivative of those names.

What you are trying to do is account for the between and within group variability. If not address via modeling or SEs, then you risk type one errors.

Good luck and keep us posted!