Bookclub: ISLR


TS Contributor
Chapter 7 is about dealing with nonlinearities - building multiple regression models with polynomials, splines, step functions etc.

Here, my biggest difficulty is to pick a suitable model - is a spline better than a polynomial, or maybe I should use a step function? Ex. 7 is about modelling the mileage of autos from a dataset of about 400 cars. There is a clear non-linearity in de dependency for most parameters.

Trying to pick the right model I first tried to use spline functions and the results were great as far as the R-squared values and the p values were concerned (great as in R-sq of 0.99 and p values of 0.001, roughly). So, the next question was the number of knots to pick.

To cut a long story short, I went with the basic idea of using cross validation to pick the number nodes - expecting bad performance for too few and too many nodes, due to overfit and too stiff models. My big surprise was that the spline models completely failed to produxce any pattern with increasing number of nodes . It is possible that I made a mistake somewhere, of course.

As a counter-example I re-ran the cross validation with loess, varying the span as a substiturte for the number of nodes and I could find a region of the parameter that was about a factor of 5 better at prediction than any spline model I came up with. I could also clearly see a pattern - for very flexible models my RMSE was about the same as that of the spline models, stiffening the model brought a large increase in the predicted RMSE .

This fits nicely with a discussion about the uses of the R-squared metric. It seems to me that as far as predictive quality goes, the R-squared is as good as useless. If I stayed with the model with the best R-Squared performance I would never have tried the alternatives to an arbitrary spline model, though the loess can be five times as good in prediction.