Regression strategy


Fortran must die
I should say ahead of time that in my field we almost never have developed theory to build on. So I have seen two strategies. First, throw out variables that are not statistically significant and rerun the model. Second, keep them in. Which is preferable?


Fortran must die
lol dason. What does it depend on?

Seriously I am trying to improve my regression techniques which consists historically of just putting one variable after another into models as I guess they are important. To end up building a final model. Of course I am not trying to build theory or in the short term to predict. I am trying to tell those I work for if we do this Y will be higher (aka better). I realize that I am doing things way to simple.

I also realize really smart people like dason don't think there is a way to get at realitive impact of variables which is ultimately what I am interested in. :)


Active Member
I am familiar with the procedure by which there is one variable that really *needs* to be significant. Any and all means to make that happen are taken, including adding covariates, removing/adding subjects etc. The process continues until the variable is significant, or new project.


Fortran must die
Content knowledge should always guide covariate selection.
Unfortunately in my agency there is no content knowledge or if it is it is unknown to me. That is what I meant by having no theory to build on. The literature, which I spend a lot of time reading, is weak.


Fortran must die
Not in a formal sense that empirical analysis is done. And they don't talk to the analyst anyhow on such issues if they did. They run the agency and work with outside individuals. Formal analysis is not done here except rarely by me and then primarily financial models and satisfaction.

I am not certain if they know this type of material or not. Their primary role is political and administrative not analytical or process improvement. Like many public agencies I think process improvement is not a big part of what we do traditionally. We get the same amount of money regardless so there is limited incentive to make such improvements. And the organization simply does not (in my observation) put a lot of interest in statistical or financial analysis. They are smart committed people, but there is simply little interest in such things. Or if there is it does not get back to us.

I built a report today about trends and possible causes of economic goals we are supposed to maximize and sent it off. I got no feedback on it and likely won't.


Less is more. Stay pure. Stay poor.
But there should be general content knowledge. I work on many very diverse projects and I can still manage to use intuition to attempt not to put ridiculous variables in the model unless they are a negative control, keep out effects of the outcome or instruments and attempt to include relevant causes and confounders.

If the model is 100% prediction, things may change a little, but I don't run those models, since I want to have interpretability. You have worked at your job for a long time, I think you are undervaluing your knowledge level.


TS Contributor
I would like to hear about sample sizes and number of predictors/covariates.
And whether there's a possibility of empirical validation of models, or at least cross-validation.

The main problem seems to be the significant/not significant dichotomization, which was not
developed for and is not useful in model building. But admittedly, I am not experienced enough
in Bayesian techniques to give a clear recommendation. Have you ever contemplated about
regularization techniques like LASSO (if number of predictors is large)?

With kind regards



Fortran must die
We work with whole populations not samples (I have access to the population data). There are normally thousands or tens of thousands of cases.

I could validate the model, I am not sure how to really. I use hold out data sets but only for time series.

LASSO is something I should consider. I worked on it a year or so ago and it exists in SAS, but I have never actually used it.

My original question, and this was not clear, is the validity of running a model and then throwing out variables that are not statistically significant. Based on Karabiner's comments I realize this may make no sense because I have the population (although I guess you could argue effects can vary over time so I have a chronological subsample - something I have never seen addressed).

Hlsmith although I have worked here a long time, I have never been a counselor and am not a SME in what we do. I have little first hand experience in this. I do data queries largely, and sometimes analysis. I have spent a lot of time reading the literature, which is what I base my analysis on.