developing and assessing a prediction Cox model using lasso

I wonder if anyone can comment on if the following modelling strategy is valid please? I have a 200 patient survival data set (actually 2 data sets: 40 events and 160 events) and 100,000 ish candidate predictors.
I want to build a prediction model so I plan to use LASSO and elastic net (with glmnet). I was intending to split into 2/3 train, find the best shrinkage lambda (via CV in the training data) and predict on the remaining 1/3 (by c-statistic).
I was planning on say 50 test training splits to see if the same predictors were selected repeatedly and to see the variability in best lambda and estimated c-statistics in the test dataset.
Then I was then going to obtain a final model using all the data, using the shrinkage parameter found by cross-validating the entire data set. I'd then claim the c-statistic of that final model is within the range of the c-statistics found from the 50 train/test splits (maybe around the mean). Are there are flaws in this scheme or room for improvement ??


Less is more. Stay pure. Stay poor.
Hmm, the using the full data set at the end seems a little questionable. So I can see doing the training, as you described, and applying the final active set to the holdout test set for predictions. It is standard practice to end there given the training set was used for feature selection and results are conditional on that process and those data. I think some people would also be comfortable, depending on your agenda, to just use the selected features from the training set in a non-regularized hazard proportion model then using the test set. This latter approach seems the most appealing to me, though I don't fully know your context.

thanks for your help hlsmith, not sure how I'd implement your idea given I've got 50 test/training splits. (due to the small data set size the different splits give very different models)- I thought I'd be Ok using the final data set for a model (but quoting the mean of the 50 test set -c-statistics as its likely performance to avoid an optimistic estimate). I admit I'm not 100% sure that the best way to get a final model is. There seem to be links on this but I can't get to the bottom of it
this paragraph is what I'm clutching onto to justify my approach "So the answers to your question are (i) yes, you should use the full dataset to produce your final model as the more data you use the more likely it is to generalise well but (ii) make sure you obtain an unbiased performance estimate via nested cross-validation and potentially consider penalising the cross-validation statistic to further avoid over-fitting in model selection."
I am assessing performance via (the test c-statistics in) 50 test/train splits.
I know there is a train/validation/test concept but I am have very few events and would like to avoid that as my estimate of predictive performance in a tiny test set would be surely unreliable (for the same reason I've got 50 train/test splits to get more estimates of holdout sample performance) ?
thanks again for your advice


Less is more. Stay pure. Stay poor.
Hmm, unfortunately I don't have the faculties to read that large CV thread right now. For clarity, you have an overall sample size of 200 patients? What is the prevalence for survival in the overall set? With that many candidate predictors, I wonder if the genome (GWAS) community may have a better approach? I know they like qvalues - etc.

So you mention performing a 160:40 split on your data. how do you get the actual 50 test sets you keep mentioning? Also, what is the end purpose of this whole modeling process? Do you want generalizable predictions, inferential statistics, etc.?
Hi hlsmith, ahh sorry I've not been clear -I am doing a 70%:30% split on my data for the test/trains(70% in train) . I did 50 different (70%:30%) splits into train and test - each time saving the significant variables from each of the 50 trainings and the 50 c-statistics from applying the model to the 50 test splits.
I have 2 data sets - one has around 40 events, the other 160 and I'm doing the same procedure to both
thanks for all your help on this


Less is more. Stay pure. Stay poor.
Also, what is the end purpose of this whole modeling process? Do you want generalizable predictions, inferential statistics, etc.?

What is the purpose? Many would say that model 1: y-hat = X1 + X2 + X3 is not the comparable to model 2: y-hat = X1 + X2. You can test if adding X3 is beneficial, but they are two different models so you have X1 controlling for X2 and X3, as well as X1 controlling for X2 in the latter model. The interpretation of X1 is not the same, so you cannot just assume they are the same model. You may get away with it if you are solely interested in prediction, but how would you interpret your predictions. Try writing it out in words what you think your outcome is and how you generalize it if you are using different terms in different models?
Last edited:
sorry for delay I was away yesterday - thanks for your thoughts - I need to think about this - I think its both really - inference to know which features predict and how many and how well they predict - I know that for prediction including non-significant features can be OK (as the study isn't powered to spot differences and they won't hurt prediction much) - is that related to what you're saying ?


Less is more. Stay pure. Stay poor.
Not really, but if you have terms you want to control for in the model regardless - you should do it. I have not done that, but I would image that some may have their coefficient shrunk to 0, so I am not sure how you treat that, since if scoring data you would have a zero product (e.g., age = 35, coefficient = 0; 35*0 = 0). I would if ridge would be a better fit if you can't find an example of a person doing this.

Please keep your questions coming - they help me process and better understand/think about methods.