Hello all, this is my first post here -- hoping to find some helpful advice, and hope to dispense some in the future.
Here is my situation: I am building a model to attribute registrations on a web site driven by television advertising. I am comfortable with parsing the outcome variable (i.e., how many registrations are driven by TV, as opposed to online advertising, word of mouth, etc.) -- where I'm struggling a bit is with the attribution "within" TV.
Currently I have my data in time series format, with each 15-minute period constituting an observation, with registrations in those 15 minutes the outcome. In addition to controls (day of week, etc), the predictors are thousands of impressions on different television stations. To allow for some latency, I'm also including several lag terms, so for example the (much simplified) regression equation would look like:
This produces a decent, usable model for understanding the response driven by different TV stations. However I'd like to improve the model by accounting for accumulated impressions over a longer period. For example, when we go dark on TV for a week, we still see a significant baseline of TV registrations coming in, I presume as a result of having been on air in the preceding weeks.
To incorporate this into my model I've tried adding terms to the above equation representing the total cumulative impressions on the station over the preceding two weeks, coming up to (not overlapping with) the oldest lag term. The problem I'm running into is that this introduces serious multicollinearity problems (VIFs in the 100-200s), which means I can't trust my t-stats.
An added complexity -- which I can't say I completely understand -- is that there are no very high pairwise correlations among these predictors. Still, multicollinearity is a clear problem.
I have thought about possibly combining some of these stations (there are about 50) so there are about 8-10 predictors rather than 50, but I'm not even sure this will solve the problem.
Any ideas? I would greatly appreciate any guidance.
Thanks,
Sean
Here is my situation: I am building a model to attribute registrations on a web site driven by television advertising. I am comfortable with parsing the outcome variable (i.e., how many registrations are driven by TV, as opposed to online advertising, word of mouth, etc.) -- where I'm struggling a bit is with the attribution "within" TV.
Currently I have my data in time series format, with each 15-minute period constituting an observation, with registrations in those 15 minutes the outcome. In addition to controls (day of week, etc), the predictors are thousands of impressions on different television stations. To allow for some latency, I'm also including several lag terms, so for example the (much simplified) regression equation would look like:
ln(TV Registrations) =\( alpha + beta*STA1 + beta*STA1_lagged + beta*STA2 + beta*STA2_lagged + epsilon\)
This produces a decent, usable model for understanding the response driven by different TV stations. However I'd like to improve the model by accounting for accumulated impressions over a longer period. For example, when we go dark on TV for a week, we still see a significant baseline of TV registrations coming in, I presume as a result of having been on air in the preceding weeks.
To incorporate this into my model I've tried adding terms to the above equation representing the total cumulative impressions on the station over the preceding two weeks, coming up to (not overlapping with) the oldest lag term. The problem I'm running into is that this introduces serious multicollinearity problems (VIFs in the 100-200s), which means I can't trust my t-stats.
An added complexity -- which I can't say I completely understand -- is that there are no very high pairwise correlations among these predictors. Still, multicollinearity is a clear problem.
I have thought about possibly combining some of these stations (there are about 50) so there are about 8-10 predictors rather than 50, but I'm not even sure this will solve the problem.
Any ideas? I would greatly appreciate any guidance.
Thanks,
Sean