Baseball regression analysis

#1
Good Evening all
I seek to obtain a prediction equation for the games won in the baseball season. I have a 30 variable and I want to come up with a model?
I'm confused how can I start in doing my analysis. Should I start with stepwise regression or should first check the multicollinearity for each of the independent variable and start to delete variables.
Any advice on how to start

Thanks alot
 

noetsi

Fortran must die
#2
First forget stepwise. Its a bad idea.
Second, if you have a theory put in variables based on that theory. If you don't, and assuming that you have enough data for decent power run all the variables. I would toss the ones that are not significant and run the model again without these.

Do the standard diagnostics, look for normality with a QQ plot, plot residuals for hetero and non-linearity, do the standard test for Multicolinearity etc (VIF or tolerance). If you have serious errors you will have to fix the results and rerun the regression.
 
#3
I agree with noetsi in some respects. Stepwise regression isn't necessarily bad. But when you build a model the p values are not what they used to be. In other words, you shouldn't do inference once you obtain a model. A pairs plot of all the preds against the response might help get rid of variables exhibiting curvature.
 

noetsi

Fortran must die
#4
A basic problem with stepwise is coincidences or very small differences can make a huge difference in your results. And if what you leave out of the model this way is nearly as important as one of the variables you include and correlated with it you will bias the slope coefficients.