Many predictors relative to cases, want to identify interesting ones

I've been helping a colleague with his research project. He is looking at the trustworthiness of profile texts. He is analyzing text entries using the LIWC tool, which generates around 90 dimensions based on text input (positive/negative affect, pronoun use, etc.). Most of it on the ratio or interval level. We might not use all but we will probably want to use 30-70 or so.

In this exploratory research, he wants to identify important predictors of perceived trustworthiness in profile text.

Due to sampling constraints we are likely to get no more than a few hundred (150-400) cases. These are human ratings of trustworthiness. Thus, we have a relatively large number of predictors relative to cases (somewhere between 1:3 and 1:10), which might be a problem. It's exploratory research so I don't think statistical tests make a lot of sense, but I do want to avoid too spurious results.

Now, my question is this: what kind of approach would be most useful for this problem? Preferably something not overly complex, as neither of us are statisticians. Ideally it should be doable in Stata because we're working with that.

My current thinking is something like this:

1. Investigate and report bivariate correlations for all predictors with the outcome
2. Then building a linear regression model with some kind of variable selection, for instance forward selection, backward elimination, or LASSO (the last one might be a bit too complex).

Does that seem at all workable? I'm a bit worried about forward selection/backward elimination, since from what I've read it doesn't produce very stable results.

Any very different ideas would be welcome too. I greatly appreciate any input!
Last edited:


TS Contributor
I would try regression trees and/or principal components. The regression tree is pretty good in handling many predictors with a relatively small number of measurements, PA would reduce the number of dimensions if you are lucky.

Thanks for your reply! Those are interesting suggestions. Not sure if PCA is a good option because from what I've seen the results tend to be quite hard to interpret, whereas we want something that is meaningful to us in the real world.

I thought about regression trees or Random Forest, since I've worked with them and they seem to deliver good results in cases like this. But I think for him this would be a bit too exotic. Probably best to stick to some kind of linear regression. But the more I read, the more things like forward selection seems like a misguided idea. Maybe LASSO would work (if I can convince him :) ). But preferably I would like something better than FS but similar in simplicity.