Modelling percentages in the closed interval [0-1]

#1
Hi all,

I am trying to model percentages which are in the closed interval [0,1] .
Specifically, I am not modelling the probability of a yes/no outcome (for which logistic regression is often used), but I am modelling a continuous numeric variable which can take all the values in the [0,1] range, including 0 and 1.

I thought of running a linear regression on the logit transformation, ln[y/(1-y)], but it's not defined at the extremes 0 and 1, which is a huge problem because my data is concentrated exactly at the extremes.

I know I could use decision trees or neural networks, but I was wondering if there is any way to apply a regression model to this problem.

Examples of this kind of modelling are the % of time available spent on a certain task, the % of money available dedicated to a given activity, etc.

Thanks a lot for any insight!
 
#3
There is a brief review in this article in the Stata Journal.
Thank you, that's very useful!
I have gone through your link, and the paper it references:

Papke, L. E., and J. M. Wooldridge. 1996. Econometric methods for fractional response variables with an application to 401(K) plan participation rates. Journal of Applied
Econometrics 11: 619–632.

That method
makes use of the logit link
function (that is, the logit transformation of the response variable) and the binomial
distribution, which may be a good choice of family even if the response is continuous.
However, I don't understand how this can solve the problem, since the logit transformation is not defined at the extremes zero and 1.

My statistics is rusty, so apologies if I am missing something very banal here, but any help would be greatly appreciated!

Thanks!

PS The Stata code reported in an example also uses a robust estimation:
glm meals yr_rnd parented api99, link(logit) family(binomial) vce(robust) nolog
 

maartenbuis

TS Contributor
#4
The link function is for the mean (conditional on the explanatory variables) not the actual observations. The conditional mean cannot be exactly 0 or 1, but the actual observations can be. So, if you use this model to explain a proportion and you find an exact 0 in your data, then that means that your probability is very low (but not 0) and this observation just happend to get 0 successes. I can imagine enough situations where this could make sense.

The vce(robust) option is crucial for this model as it is a maximum quasi-likelihood model not a maximum likelihood model.
 
#5
I see.

I don't have access to Stata at the moment; what would happen if I tried to fit a generalised linear model, using the logit link function and the binomial distribution but not the robust estimation, with another software like R or JMP (by SAS)? I mean, I guess the software would run a maximum likelihood estimation, but would this be theoretically wrong, practically would it produce unreliable results, etc?

I am also considering (even if not theoreticallly sound) modelling a linear regression on the logit of the percentage, i.e. on the log-odds, converting the extreme values zero and 1 to something like 1e-20 and 1-1e-20, i.e. to values for which the logit can be calculated, but which are close enough to zero and 1 that the difference is not material.
 

maartenbuis

TS Contributor
#6
The point estimates will be the same, the standard errors would be very incorrect. However, I would be very surprised if R or SAS would not have the ability to use robust standard errors. I don't know these packages well enough to tell you what you would need to type, but both are serious statistical packages so they will probably have it but maybe under a different name. Alternative names for robust standard errors are Huber, White, Huber-White, and sandwich standard errors.

If you would want to go the linear regression route, I would definately not use such extreme values as 1e-20 as that would create huge outliers. The challenge then would be to "nudge" them in enough so that they won't become outliers, but not too much so they are still close to 0 or 1.
 
#9
The link function is for the mean (conditional on the explanatory variables) not the actual observations. The conditional mean cannot be exactly 0 or 1, but the actual observations can be. [...]
You're a star :tup: - I should pay you a consulting fee :D

Can I abuse your patience just a bit more? Can you help me understand this about generalised linear models?

Let's ignore zeros and 1s for a second, and suppose my data is only on the open interval (0,1).
What is the difference between:

1) calculating logit(y), and fitting a linear regression on logit(y), i.e. X Beta = logit(y)

2) fitting a generalised linear model, with the logit as the link function, to model y

Do I understand correctly that (2) means modelling G^-1 (X Beta) = y , where G^-1 is the inverse of the logit function, and X Beta is the linear combination of observations and paramaters of the regression?

If I have zeros and 1s in my data, I cannot apply (1) because I cannot calculate logit(y); but if I don't have any zeros nor 1s, would the two approaches yield the same estimates? :confused:

Thanks a lot!
 

maartenbuis

TS Contributor
#10
If you do a linear regression on the logit transformed proportion you would be modeling \(E(\Lambda(y))\), while a GLM with the logit link function would be modeling \(\Lambda(E(y))\). \(\Lambda(\cdot)\) is the logit transformation, so \(\Lambda(y) = \ln\left(\frac{y}{1-y}\right)\), and \(E(\cdot)\) is the expectation operator. Since \(\Lambda(\cdot)\) is a nonlinear transformation \(E(\Lambda(y)) \neq \Lambda(E(y))\).
 
#11
From my textbook a logit model is used when there is some underlying Bernoulli experiment where the result can be 0 or 1, healthy or deceased, success or fail, where each experiment is independent and has got a probability of p (0<p<1). The sum of such events would be binomially distributed.

I don't understand the link above to stata journal. If y take the values of 0 or 1 it “will not work” with the link ln(y/(1-y)) since it would mean: ln(0/(1-0)) or ln(1/(1-1)), taking log of 0 or dividing by 0, and that is not “allowed”.

The usual thing is to use: ln(p/(1-p)) as Maartenbuis points out above.

But if the response variable is a share, like: “what share of your income is used on food?” or “what's the share of oxygen in the air where you are right now?”, then such a statement is not based on Bernoulli experiment, a 0 or 1 experiment.

In such a case I would say that it would be more appropriate to model with a beta distribution, as Dason point out above. In such a distribution it is not “allowed” with 0 or 1, but it can be modelled with a zeroinflated beta distribution (on one side of the scale) and that the values are rescaled so that the 1-value takes a value like 0.99. This can be estimated with various software.
 

maartenbuis

TS Contributor
#12
The key thing about a fractional logit model is that it is a maximum quasi-likelihood model not a maximum likelihood model. This means that it does not model the entire distribution of the dependent variable, but just the conditional mean. The main advantage is that it is fairly robust: a beta regression can get biased estimates when you misspecify the parameter that governes the conditional variance, a fractional logit is fairly good at ignoring such problems. The main disadvantage is that if you are interested in other features than effects on the conditional mean (e.g. conditional variance, quartiles, etc.), the fractional logit model obviously cannot give you that.

I like beta regression and zero one inflated beta regression --- in fact I (co-)authored programs in Stata that implemented these --- but as a first model for fractional data with exact 0s and 1s I would recommend a fractional logit because of its robustness.