I just played with Exercise 10 from Chapter 4 – predicting if the stock exchange would go up or down using weekly data. I got an interesting surprise – the exercise proposed that I split the training and the test data according to the year – everything before 2009 was training and after, until 2010 was test data. It also proposed to only use Lag2 as a predictor, out of the 5 available lags and the trade volume. I was not sure about splitting according to time – my suspicion was that if there was a trend or any other time-related pattern this might not be captured in the test set in the same way as in the training, leading to biased performance.
So, I did the logistic regression in 3 cases : with all the available data I got Lag2 as the significant predictor. However, if I ran the logistic regression on the train data alone, then Lag1 was significant and Lag2 was not. I guess this means that probably neither of them is significant, all we see is some fluke in the data. I the decided to take a completely random selection of 800 points as the train data – and sure enough there were no significant predictors there.
Now, apart from the true objective of the exercise, this raises interesting questions about our use of regression and model selection. I would have accepted either Lag1 or Lag2 as a legitimate predictor in any analysis, and I guess anyone would have accepted them as well. Given the recent discussions on the value of the p-value as a tool, this is quite sobering. Maybe, one could extend the p-value testing to require that train and test samples should be used as well? I am thinking of something like either build the model by using a training set and validate it using a test set – or work backwards, find the model using all the data but then to require that there should be some indication of the effect if we used a smaller random subset of the original data?
BTW of the possible choices for a classification algorithm logistic regression with a threshold of 0.5 behaved very poorly, and LDA was only marginally better . Both algorithms essentially bet on a upward movement – the logistic regression only predicted 7 downward movements out of a total of 289. Because there were more upward movements then downward this got them a true positive rate of around 51-52% . Surprisingly QDA got a whooping 58.5% with KNN with k=1 being as bad as logistic regression but k=5 slightly better than logistic regression and QDA. I actually never had QDA on my radar, I guess this will change now.