# Correct Cross Validation. How to calculate the projected R Squared or Residual Sum Sq

#### morthill

##### New Member
Hi,
I have read into the subject of finding good estimators to determine the goodness of fit when the regression on a trainingset is projected on a testset (unseen data). I have found a lot of scientific papers but I get completely lost in terminologie and very complex equations I do not understand.
My own approach, that has suited me well so far, is like this:
1) Train the training dataset
2) Apply the equation derived from step 1 on the test dataset
The problem is that the last step is very expensive in terms of computitive power. I am trying to find another approach.

I will illustrate my dilemma in an extensive example.

We have a total of 20 observations like this:

HTML:
Subset  IQ  Income
1  82   30
1  87   75
1  93   25
1  99   45
1  105  50
1  110  40
1  122  70
1  122  65
1  126  110
1  140  95
2  83   45
2  86   30
2  86   50
2  92   45
2  95   120
2  101  75
2  111  65
2  118  85
2  119  120
2  133  200
I have split the data in two subsets. We want to predict the income based on IQ.

Now we do some queries on the datasets:

HTML:
Subset  Count Sum(IQ)    Sum(IQ^2)    Sum(Income)    sum(Iq*Income)      Variance(Income)
Both    20    2110       228578       1440           160900              1745.78947
1       10    1086       121112       605            69045               769.16667
2       10    1024       107466       835            91855               2622.50000

From this we can populate the matrices to calculate the predictors (coefficients) and some basic information of the regression:

Subset   DesignMatrix         ScoreMatrix      (Design Matrix)-1                (Design Matrix)-1 x ScoreMatrix    ESS                TSS      RSS               RMSE              R Squared

All      20     2110          1440              1.9134270886  -0.0176628160      -86.6120877281                    13500.820358279     33170   19669.179641721   983.4589820861   0.4070190039
2110   228578        160900           -0.0176628160   0.0001674201        1.5034321112

1        10     1086          605              3.8176774682   -0.0342327575      -53.9058756777                      3520.6670029      6922.5   3401.8329971     170.091649855    0.5085831712
1086   121112        69045            -0.0342327575   0.0003152188        1.0534611020

2        10     1024          835              4.1199969330   -0.0392577825     -165.8261769667                    15463.5795890201    23602.5  8138.9204109799   406.946020549   0.65516702
1024   107466        91855           -0.0392577825    0.0003833768        2.4348259469

Where:
HTML:
ESS=Explained Sum Squared. We derive this value from using the formula: CoefficientPredictor^2 * Row(1,1) From (Design Matrix)-1. In the case of Subset 1 this would be  1.0534611020^2 * 0.0003152188^-1
TSS=Total Sum Squared=Variance(Income) * (Count-1)
RMSE=Root Mean Squared Error=RSS/Count. I am not sure if this is the correct way to calculate this. Maybe we need to take the square root of this value.
R2=ESS/TSS

The outcome of all these values (except maybe not RMSE) correspond with the values that R calculates.

So for subset one and two we come up with the following equations to predict the income based on IQ:
HTML:
For subset 1:  -53.9058756777 + 1.0534611020 * IQ
For subset 2: -165.8261769667 + 2.4348259469 * IQ
What I normally do now is project equation subset 1 on dataset 2 and equation subset 2 on dataset 1, so in this case a 2-fold cross-validation.
When I query this in Sql it would look something like this:
HTML:
....
Case When SubSet=2 Then -53.9058756777199 + (1.05346110200479 * [IQ]) When SubSet=1 Then -165.826176966724 + (2.43482594694066 * [IQ]
....
I get the following Residual Sum Squares when I cross validate:
HTML:
Subset 1   23968.2996788169
SubSet 2   21837.2586074513
Total         45805.5582862682
Note that the cross-validated residuals for subset 1 (23968.2996788169) is way higher then the not cross validated residuals (3401.8329971).
Although the difference is very high this does not matter for this example.
Now comes my question.

We have used a very small dataset here of a total of 20 observations. Assume the dataset contains millions of observations. Then the cross-validated query would be very expensive. Is there a more efficient way for this last step of cross validation without the need to query the whole dataset again?