Finding correlation of independent variables with a response variable in R

#1
I have a set of independent variables X1, X2, X3,..., Xn and a response variable Y. I want to perform some form of data analysis in R so that I only select a subset of those independent variables that have more significant effect on the response variable Y. I want to generate results that should tell me how much percent an independent variable is related to response variable Y. My data set is a bioinformatics data set consisting of molecular descriptors. I want to perform Multiple Linear Regression on the data set to do Cheminformatics based-QSAR predictions. And, for multiple linear regression, i want to select only the most useful/significant independent variables out of total.

Can somebody kindly suggest what statistical tool(s), or approaches should I look at to achieve the above task.
 

staassis

Active Member
#2
Suppose the total number of predictors is n. If the sample size >= (n+1) * 15, use backward stepwise variable selection. Otherwise use forward stepwise variable selection.

More robust / complicated variable selection methods include LASSO and cross-validation. On average, they perfom better than forward / backward stepwise variable selection if the sample size is small.
 
#4
Suppose the total number of predictors is n. If the sample size >= (n+1) * 15, use backward stepwise variable selection. Otherwise use forward stepwise variable selection.

More robust / complicated variable selection methods include LASSO and cross-validation. On average, they perfom better than forward / backward stepwise variable selection if the sample size is small.
I am not very enthusiastic to stepwise regression. Also I am sceptical to that kind of rules of thumb.

But the LASSO seems to be much better, in my view.