I'm working on a classification problem in which all predictors are categorical. In some cases, it might make sense to combine categories within a predictor and reduce dimensionality. But, I read in an article that "it is important not to look at the outcome variable when working out which categories should be merged". I think the argument is that we run the risk of overfitting the model. However, how the hell do we engineer useful features if we don't look at the distribution of the outcome? I equate this to producing a histogram of two groups and saying "Nah, this isn't helpful even though there's an obvious difference." Thoughts?