# What is the appropriate model?

#### Doktor Baumel

##### New Member
Dear Statisticans,

as the headline suggests I am looking for the appropriate model for my research. I guess, it is quite simple but I am afraid I can't see the wood for the trees...

We have longitudinal data from pupils, surveyed over four waves/years. Our major focus lies on injuries in the school environment. Therefore we ask each wave whether pupils suffered from an injury within last year. If they answer "yes", they have to answer several questions about that injury subsequently.

I am interested in whether the frequency of participation in all panel waves leads to underreporting, meaning that pupils who participated more often should show a lower propensity to answer "yes" to our injury question (because they are aware that many questions will follow and pupils are lazy as you know.

More precisely, my dependent variable is a dummy, indicating whether a pupil reported in wave 4 having suffered from such an injury (1) or not (0). My key independent variable is the frequency of participation in the former three waves. So I created a variable that indicates in how many waves the pupils took part (ranges from 0 for those you participate in wave 4 for the first time to 3). Since we had some refreshments in wave 3 and because some pupils are ill or forgot to bring the declaration of consent from their parents on the day of the survey, this variable is distributed quite evenly in wave 4.

If I now run a multimodel logistic regression model with injury in wave 4 as my depvar and the number of participations as my key indepvar (together with some controls) this works perfectly fine. However, we know that some pupils are especially susceptible for injuries. Therefore, I expect that having reported (at least) one injury in the first three waves has an enormous effect on injuries in the subsequent waves. In my opinion, I have to control for this in order to avoid estimating a biased coefficient for my key indepvar. So I created a variable, indicating the number of injuries in the first three waves and added it as another indepvar in my model. Indeed, the effect of my key indepvar changed remarkebly - but the problem is, unsurprisingly, that both indepvars correlate considerably (.40). I guess that this is not a huge problem in principle but many combinations of those variable do not exist and are completely implausible: if someone, for example, has not participated yet in the study (except from wave 4), she is simply not able to have reported one or more injuries in the first three waves... So I am wondering whether this might create trouble with my reviewers...

Moreover, one would expect that the negative impact of the frequency of participation in the first three waves on the propensity to report an injury in the fourth will be even more pronounced if someone has reported (at least) one injury in the past. This is because only in that case it is completely sure that the pupil is aware of the questions that follow after our injury-question (which may lead to an underreporting in the subsequent waves). To test this, I would need an interaction between frequency of participation in waves 1-3 * number of injuries reported in waves 1-3 which would heavily increase my problems with collinearity...

All in all, I am very unsure about the appropriate model for my research and I would really appreciate any help!

#### kiton

##### Member
The correlation of .4 does not seem to exceed the commonly used threshold of .5 (at least in my field -- MIS). Therefore, I wouldn't be concerned about it. To obtain additional evidence that collinearity does not have a negative influence on your estimates (CIs, technically), you can run a variance inflation factor (VIF) test and/or condition number test. The common thresholds are 5 and 15, respectively.

Now, to decrease collinearity associated with inclusion of the interaction terms, simply center the interacting predictors. That would address the issue (double-check that with VIF).