real world predictive model validation and testing issues

#1
I have a real world business problem of selecting products to recommend to our contacts in an email campaign.

The data set I have contains the click thrus of our products' ads on various websites and other info about the visitors of the websites that have the ads on them.

Some of the variables in the data set does not make sense to use in scoring for email campaigns such as the time of the day the visitor visited or the websites the ads are displayed. However, they need to be in the training data to model the correct behavior of the visitors and they turn out to be strong predictors. My question is "should I use these variables in creating the gains table on my test data set to evaluate the model performance?"

Thank you for any input in advance!
 
Last edited:

bryangoodrich

Probably A Mammal
#2
If I understand your question correctly, you're asking if you should include certain independent variables that may or may not improve predictive success? Suppose you have two models, Y and Y*, with and without those predictors, respectively. Then there is no reason to not validate and test each of them and compare the two of them. If, however, you're asking whether you should include them in validating and testing when you already plan to use them, then you should be testing and validating the model you're going to end up with. However, I would still play around with the two (or five) models that all may be possibly good predictors and test them against each other for their aptness in generating correct predictions (PRESS criterion?).