Linear Probability Models and Complex Survey Data

Lazar

Phineas Packard
#1
So there has been a movement back toward linear probability models in cases in which multi-group comparisons are of focal interest. The rational for this is that cross-group comparisons of parameter estimates get shaky in probit and logit models requiring a bunch of finessing to get right.

So I now get frequent requests from reviewers to run LPM in addition to probit/logit. This is fine and I can deal with heteroskedasticity in standard error with the sandwich package in R. The issue comes when there is a complex sample design. For example with PISA, PIAAC, TIMSS, etc. I get 80 replication weights that are designed to account for complex sample design. These are essentially pre-specified bootstrapping selectors for the sample. Given that the SE come from variance in these repeated replications do I need to deal with hetroskedasicity in LPM? If so does anyone know how? I am using the survey package in R.
 

vinux

Dark Knight
#2
Not answering your question. LPM is not a bad model if you are interested in central values (mean, median, etc.). The non tail part of sigmoid curve can be approximated to a linear curve.
 

spunky

Doesn't actually exist
#3
OK, I am far from an expert on this thing and willing to admit that I may be wrong but for the sake of saying something, here are my two cents?

do I need to deal with hetroskedasicity in LPM?
I would be inclined to say "yes". I mean, your dependent variable is still constrained to be a number either 0 or 1, right? I guess it is reasonable to claim that it is bernoulli-distributed and the mean and variance of a bernoulli distribution are not independent. So you're still stuck with the problem that, simply by the way the distribution behave, higher/lower values of the mean are associated with higher/lower values on the variance of the residuals. So I would still be inclined to use some type of robust correction to the standard errors.

If so does anyone know how?
I'm not sure if this is relevant at all but have you heard of the wild bootstrap? If you think this is relevant, one of my profs was obsessed with it last semester and he gave us R code to implement the wild bootstrap both by itself AND in the presence of missing data with multiple imputation (he was also infatuated with multiple imputation). Would that be helpful to you? It does account for heteroskedasticity! :D
 

Lazar

Phineas Packard
#4
I would be inclined to say "yes". I mean, your dependent variable is still constrained to be a number either 0 or 1, right? I guess it is reasonable to claim that it is bernoulli-distributed and the mean and variance of a bernoulli distribution are not independent. So you're still stuck with the problem that, simply by the way the distribution behave, higher/lower values of the mean are associated with higher/lower values on the variance of the residuals. So I would still be inclined to use some type of robust correction to the standard errors.
I am inclined to agree but not certain how this would be done in this context given that the standard errors are taken from the variance in the point estimates of replicants.

I'm not sure if this is relevant at all but have you heard of the wild bootstrap? If you think this is relevant, one of my profs was obsessed with it last semester and he gave us R code to implement the wild bootstrap both by itself AND in the presence of missing data with multiple imputation (he was also infatuated with multiple imputation). Would that be helpful to you? It does account for heteroskedasticity! :D
Not really helpful in this case as the form of the 'bootstraps' is predefined by the survey organisers to account for the complex sample but still cool :)
 
#5
If so does anyone know how?
I don't know! But this is my thoughts.

So there has been a movement back toward linear probability models in cases in which multi-group comparisons are of focal interest. The rational for this is that cross-group comparisons of parameter estimates get shaky in probit and logit models requiring a bunch of finessing to get right.
The usual logit model with link function g( ) is often written like:

g(p) = log(p/(1-p)) = beta'x

As I understand it, the linear probability model is just a model with identity link:

p = beta'x

Maximum likelihood estimates in generalized linear model is iteratively reweighted least squares:

(X'W(t)X)*beta(t+1) = (X'W(t)y) with conventional nomenclature.

where the weights are updated in each round. But this just takes care of the increasing variance as p gets closer to 0.5, not the sampling design.

Let's write the model as:

Y = beta'x + eps

Where Y is still binomial distributed with a mean of beta'x and an awkward disturbance term eps.

Now suppose that there is a complex sampling design so that there is a random selection variable S (selected or not selected) and that the probability of S is not constant as in simple random sampling.

Now my point is that if the disturbance term in the regression model eps and S are statistically independent then, my guess is, that you can ignore the complex sampling design and estimate it as if it was a simple random sampling.

If they are independent the likelihood would just multiply the densities:

L = f(s)*f(y;beta) and in the log-likelihood the sampling would just be an additive constant. logL= log(f(s)+ log(f(y;beta)

But if the sampling design in not independent, so that the sampling probability is a function of beta, then maybe that can be modelled in the likelihood:

L =f(s, y; beta)

If it is estimated with ML (and for me ML is maximum likelihood!) the variances and covariances can be found in the information matrix.

But I believe that this has been solved a long ago. But I guess that the results have not been used very much.

Lazar wants cross-group comparisons of parameter estimates. Then I guess that he need not only the standard error but also the variances and covariances of the parameter estimates.

I don't know much about bootstrapping in this case. But I would guess that the maximum likelihood estimates would be more precise.