Out of sample validation

noetsi

Fortran must die
#1
When you hold out a portion of the data to test your regression model. I have data I can hold out. But I am not sure practically how you would test the models you develop against this hold out data.

SAS does not have a way I know to use hold out data. You could multiply the parameters you develop by the initial data set times new data and get results which you could compare to the actual Y in theory - but I am not sure how you would do this in practice. How would you know from this which is the best model any way?
 

gianmarco

TS Contributor
#2
I implemented in R a couple of methods for internal validation, tailored for logistic regression. Maybe they can give you some ideas.
Please, refer to the website showing up here in my signature, and look in the 'other tools for statistics' page.
 

hlsmith

Not a robit
#3
why dont you think sas has a way to do this. its a juggernaut. not sure if you have access but i think proc plim or something spelt close to that is what you use. also they have a crossvalidation approach, i think it uses the same procedure. let me know how it goes. i would love to use the code but my dataset are never great enough.

what you are looking for is that the model performs as well in the new dataset, primarily the AUC value.

Quit making assumptions on if something is available :)

i hope you got a chance to kick trump in his little nards today!
 

hlsmith

Not a robit
#4
So its Proc plm;

i was thinking it might not be avaliable until the 9.4m3 release. There was something i wanted to run not available until then.

A quick search on CV for logistic looed good. i may try to replicate. Though your outcome are usually continuous.
 
#6
If you want to split your source dataset randomly into a development portion plus a validation portion (i.e., a holdout sample), you could also do so with a single “DATA” step:
Code:
DATA	Dev_Data Val_Data;
	SET	All_Data;
	IF (RAND("UNIFORM") <= 0.8) THEN
		OUTPUT Dev_Data;
	ELSE
		OUTPUT Val_Data;
RUN;
The above code will produce two datasets, namely “Dev_Data” for model development and “Val_Data” for model validation. Any unique observation can appear in only one of these two datasets, not in both. You can vary the proportion of observations in each by adjusting the hard-coded value in the “IF” construct. In the above example, the development portion will have about 80% of the observations and the validation portion the other 20%. (The actual proportion may vary but it becomes more accurate the more observations there are in total.)

Both “Dev_Data” and “Val_Data” will contain the same fields, including the dependent variable. Once the model has been developed using only the “Dev_Data” portion, you can assess its validity by applying it to “Val_Data” and checking, for example, whether the distributions of residuals are similar between “Dev_Data” and “Val_Data”.
 

noetsi

Fortran must die
#7
why dont you think sas has a way to do this. its a juggernaut. not sure if you have access but i think proc plim or something spelt close to that is what you use. also they have a crossvalidation approach, i think it uses the same procedure. let me know how it goes. i would love to use the code but my dataset are never great enough.

what you are looking for is that he model performs as well in the new dataset, primarily the AUC value.

Quit making assumptions on if something is available :)

i hope you got a chance to kick trump in his little nards today!
Given that I spend part of each day for years reading SAS material after a while you think you have a general sense of what it does. And no one I encountered ever mentioned validation in this form in the context of SAS. I even downloaded several years worth of SUGI publications....
 

noetsi

Fortran must die
#8
I guess I am confused here. I thought what you were doing is not just applying the same variables to the test and original data sets, but applying the specific slopes from the first set of data to the second and analyzing the residuals.

So if the model in the training data generated a result of y =1400 + 50 b1 + 47 B2.... then you would use the values from the second data set but the parameters from the first data set. I thought that is what they meant by model. And I could not figure how you would tell SAS to use a specific set of preexisting parameters....

Apparently the answer is you don't.
 
#9
And I could not figure how you would tell SAS to use a specific set of preexisting parameters....

Apparently the answer is you don't.
As far as I’m aware there’s no easy way to “load” an existing model into SAS other than in the form of SAS code that explicitly calculates the model. However, at least some of the PROCs (e.g., LOGISTIC) include an option called “OUTMODEL = <name>”. This option tells SAS to produce a dataset called “<name>” that specifies the model that was built—variables, coefficients and some other stuff. There may be a PROC that uses this dataset to evaluate the model for any input dataset that has the necessary variables. But if there is one, I don’t know about it.

I use this option plus some extra processing that basically produces a SAS text script of the model that can be pasted directly in as SAS code.
 
#10
Maybe I need to stare at the link harder, but they do this in the above link. In particular, they use the generated fitted model's beta coefficient (slopes) but observed values in the validation dataset.


As you can see the first model fit better since its obs were used (along with its own particular nuances based on sampling variability). This process will also make you question whether you may have over fit the model.


I think where the confusion may come in with this approach is that many people think it should be more like k-fold cross validation where, if I remember right and k=10, then 10 unique models are used and I believe averaged. Now, I have not done this in SAS. Which this approach is better when you don't have a lot of data to hold out in the first method.


P.S., As I said, I typically don't have enough data to do this. But I am trying harder to remember to think about it and possibly incorporate it into my protocol before starting a project.
 

Stu

New Member
#11
It depends on the procedure that you're using. Some have built-in scoring, others don't. You can generically use PROC SCORE to do so, but you can just as easily leave your dependent variable blank and achieve the same results. Personally, I don't use PROC SCORE because I find it to be a bit of a hassle for anything that doesn't output an "outest=" dataset. For all others, you'll need to take your ods output from the ParameterEstimates dataset and transform it to what PROC SCORE is expecting.

For example:

Built-in scoring

Code:
data all;
	length Role $8.;

	set sashelp.bweight;

	if(rand('uniform') < 0.7) then Role='TRAIN';
		else Role='VALIDATE';
run;

proc glmselect data=all;
	partition rolevar=Role(train='TRAIN' validate='VALIDATE');

	class black married boy momsmoke visit momedlevel;
	
	model Weight = Black--momedlevel 
	/ showpvalues select=sl 
		      selection=stepwise(sls=0.05 sle=0.2 stop=adjrsq);

	output out=pred_all p=P_Weight ;
run;
Manual scoring

Code:
data all;
	length key 8. role $8.;
	set sashelp.bweight;

	key+1;

	Weight_Actual = Weight;

	if(rand('uniform') < 0.7) then do;
		role='TRAIN';
		call missing(weight);
	end;
		else role='VALIDATE';
run;

proc glmselect data=all;
	class black married boy momsmoke visit momedlevel;

	model Weight = Black--momedlevel 
	/ showpvalues select=sl 
                      selection=stepwise(sls=0.05 sle=0.2 stop=adjrsq);

	output out=pred_all p=P_Weight;
run;

data check;
	length key weight_actual p_weight error 8.;
	set pred_all;

	error = Weight_Actual - P_Weight;

	keep Key Weight_Actual P_Weight Error;
run;

proc sort data=pred_all;
	by role key;
run;
As far as I’m aware there’s no easy way to “load” an existing model into SAS other than in the form of SAS code that explicitly calculates the model. However, at least some of the PROCs (e.g., LOGISTIC) include an option called “OUTMODEL = <name>”. This option tells SAS to produce a dataset called “<name>” that specifies the model that was built—variables, coefficients and some other stuff. There may be a PROC that uses this dataset to evaluate the model for any input dataset that has the necessary variables. But if there is one, I don’t know about it.

I use this option plus some extra processing that basically produces a SAS text script of the model that can be pasted directly in as SAS code.
PROC MODEL can basically do this with the outmodel, outparms, model, and parms options, but that's the only one that I know of. I'm not sure if it's capable of reading a model from PROC LOGISTIC though.
 
Last edited:
#12
Stu,


In your manual example code, you seem to use the full dataset then calculate the errors via observed minus predicted. I understand this, but how does it fit in with the out-of-sample theme? How can it be used to score a new dataset?


Thanks!
 

Stu

New Member
#13
Stu,


In your manual example code, you seem to use the full dataset then calculate the errors via observed minus predicted. I understand this, but how does it fit in with the out-of-sample theme? How can it be used to score a new dataset?


Thanks!
The exact same way :) Simply append the score dataset to your main training dataset. Any missing dependent values are not used for parameter estimation, and will be predicted after estimation. For example, if you only wanted the predicted values for unknown weight:

Code:
data predictions;
	set pred_all;
        where weight = .;
run;