Is it appropriate to include intermediate outcomes in a predictive model?

It is quite clear that one should not control for post-treatment variables / intermediate outcomes when the goal is causal inference, but I wasn't sure if the same advice should hold when one's goal is to build a model for prediction.

Context for my question: I'm trying to build a model that predicts if a college student will earn a bachelor's degree within 6 years of high school graduation using a large observational data set. I have data on students' high school variables (HS GPA, test scores, number activities participated in, etc.), some data on the students' college experiences (delayed enrollment in college, full-time / part-time status, transferred within two years of enrolling), as well as data on the characteristics of the college (public/private, enrollment size, funding, etc) they attend. In other words, I have student level and institutional level data. I would account for the nesting of students within a particular institution.

Problem: Some have told me that the college information I have is an intermediate outcome and I shouldn't include it in the model. It isn't clear to me if I should / could include the college experience variables (which could be considered intermediate outcomes) in the predictive model, and if I do include them, how they should be treated.

I have comparable and consistent data from three cohorts of students spanning the 1980s, 1990s, and 2000s. My goal is to see if the predictive ability of the model/variables has changed across cohorts. Given that these are observational data, I think the parameter estimates should only be interpreted as the predicted difference in the response of two individuals that differ by one unit on the regressor in question and that have the same value on all other regressors. I am not making inferences about changes or causality. Thank you.

Any thoughts/feedback/advice is appreciated.
Last edited: