Logistic regression with partly aggregated data


I want to use logistic regression on a data set. The y-variable is a binary outcome (yes or no to adverse effects) and the x value is blood concentration of a protein (continous variable). I want to test if higher concentration is related to more frequent adverse effects.

I have 450 observations. 150 observations (1/3 of all observations) have a concentration below the laboratory detection limit (<100 units), the rest lies between 100 to 3500 units.

My question: When doing a logistic regression with these data where 1/3 is "pooled", can I just set these data to be equal to the lowest detectable value (100 units) or do i need to make further adjustments to my regression/data to get the best model.

Thank you in advance


Omega Contributor
Is there a known pattern in the values where they could be imputed? I would look to the literature as well to see if this has come up before.
There is population based contration distributions of the x variable. But i can not discriminate between the observations hence i do not see how i can assign them different x-values. Would it be a statistical problem assigning them all with the same value (the lowest detectable limit) when using logistic regression?
Alternatively i may be capable of finding the most accurate average x-value.
I am yet to find discussions regarding similar problems.

Thank you in advance


Omega Contributor
Is this the only term you are modeling in the logistic regression?

Hmm, not my area but this definitely has to have come up previously. Most any approach is going to result in a loss of formation. Does the bioassay say at what level the values became undetectable, that and prior literature. An assumption in logistic regression is linearity in the logit. Which in my interpretation is a linear relationship in the continuous variable and the log transformation in the model. I am not sure how experience you are so I will throw out a few ideas, but first I will note that running a bunch of ideas then selecting one is at risk for false discovery of significant results.

First thing I would think about doing is running a Generalized Additive Model (GAM) based on data you have and excluding the lowest detectable values (LDV). This model will help you understand if there is a linear relationship in the values that you do have or not. Another option may be selecting a couple of values for the LDVs and running the model with each and see what it is doing to your estimates. You could always run models with multiple values and report that, if issues aren't occurring. Another option is if you selected a value for these LDVs, you could also simulate it to get a little more variability in its values. I saw some place as well possible robust SEs for logistic regression. I am not sure if you should also think about these.


Active Member

If you're still around, you might try the following. Create a variable z that takes the value 0 if x is below the detection limit, and 1 otherwise; and assign a value of 0 (or any numerical value at all, it won't matter) to x if x is below the detection limit. Then run the following logistic regression model:

logodds(y) = b0 + b1*z + b2*z*x .

In this model, exp(b0) is the predicted odds of y when x is below the detection limit; exp(b2) is the estimated odds ratio for a 1-unit increase in x, given that x is above the detection limit; and exp(b0 + b1 + b2*x) is the predicted odds of y at detectable concentration x. Note that z is a necessary term in the model, but its regression coefficient, b1 (or exp(b1)), alone has no meaningful interpretation.
Last edited: