Trend analysis / regression model with missing data


I want to apply a Poisson GLM to count data, analyzing trends. However, there are missing counts (i.e. missing values in the outcome variable) in the dataframe. Is there any recommended method how to deal with this? I know there exists "multiple imputation methods" to fill the gaps in the data, but is this the most recomended way if I subsequently want to apply a regression model e.g. for trend analysis?



No cake for spunky
I know nothing about poisson GLM. But, given that statisticians disagree about everything, I think that multiple imputations is the state of the art for missing data that will be used in regression. Or at least it is among the most recommended methods. It is anything but simple for non interval data.


Less is more. Stay pure. Stay poor.
Agreed that multiple imputation is a preferred method. Though what do you know about your missingness patterns? Also, some multilevel approaches for data clusters, which have repeat measures - can slide into this type of issue and still function with missing data.


Less is more. Stay pure. Stay poor.
I just very briefly skimmed the paper. I am not sure you need to use the MID, and it may depend on the type of missingness that you have. Also, their approach, as per their comment MI converges to MID as imputes approach infinity. Well it has been 9 years since this paper and processing lets us conduct now conduct many more imputes easily - this may negate whether you need to go the MID route.

Do you know the mechanism behind your missingness? That is the most important thing to direct you. Also, what is your sample size and how much data is missing?
Thank you for the intersting information, the type of missingness is "missing at random" (MAR), i.e., the missingness can be related to fully avaiable covariates. Furthermore, approx 20 % of the data are missing, I have approx 1000 data for each of 10 different sites (thus, 10.000 data nested witin sites).

Unfortunately I can't open your provided link, "ovid login failure", could you please send me title/authors of the publication? Thanks!

They key thing I want to understand is how imputed values are correctly considered within regression analysis, since 1.) They do not really add additional information to the data but only additional data points, thus, the result should be some kind of "pseudoreplication", 2.) Imputed values are connected to uncertainties, and the question is how to propagate this uncertainties to the SE's of regression coefficients.


Less is more. Stay pure. Stay poor.
Review: JAMA Guide to Statistics and Methods
Analyzing Repeated Measurements Using Mixed Models
January 2016, Volume 315, Issue 4
Michelle A. Detry, PhD; Yan Ma, PhD

Well if data are systematically missing (non-ignorable), then you need them back to attempt not to have an unbiased estimate. Next, multiple imputation does, as I believe you mentioned, provide variability that gets suck up into SE. This accounts for our uncertainty of these imputed values.


No cake for spunky
If you data is missing not at random (MNAR) your basically out of luck. Multiple imputations and all similar approaches only work if data is missing at random (MAR). About the best thing you can do with MNAR data is a form of sensitivity analysis. Allison wrote an article on this, I will try to find it...