Categorical Predictor Variables

#1
I'm working on a classification problem in which all predictors are categorical. In some cases, it might make sense to combine categories within a predictor and reduce dimensionality. But, I read in an article that "it is important not to look at the outcome variable when working out which categories should be merged". I think the argument is that we run the risk of overfitting the model. However, how the hell do we engineer useful features if we don't look at the distribution of the outcome? I equate this to producing a histogram of two groups and saying "Nah, this isn't helpful even though there's an obvious difference." Thoughts?
 

hlsmith

Less is more. Stay pure. Stay poor.
#2
First why do you think you need fewer features?

I haven't done principle component analysis before, just LASSO, and I agree that seems interesting. I would think the best practice would be, if there were enough data, creating data splits (training, validation, and test).

Can you post the link to the article?
 
#3
Here's the article: https://www.displayr.com/feature-engineering-for-categorical-variables/

It's not so much that I need fewer features or categories. It's how to do feature selection or feature engineering when all I have is categorical data. The model is only as good as the data we feed it. But, I imagine binning the categories in such a way that the model learns better. For instance, let's say that state is a predictor variable. And Ohio, Michigan, Wisconsin are the only states that pop out in LASSO. Can I bin the states as Ohio, Michigan, Wisconsin, and all the rest are binned as "other"? Then, use these new categories in subsequent models to get better performance.