On assessing predictors importance in a small sample

#1
Good evening.
My question regards the "best" and correct way of finding out which predictors are best for determining an outcome of interest.

I currently have a very small sample (45 patients), a number of continuous variables (20) describing patients hearts' characteristics and a three-levels categorical variable indicating what type of condition each patient has (control, condition A, condition B). The dataset is balanced with 15 patients for each condition.


Outcome | P1 | P2 | P3 | ... | P20
--------------------------------------------
control | 2.5 | 0.7 | 1.1 | ... | 3.5
--------------------------------------------
control | 1.5 | 1.2 | 9.2 | ... | 5
--------------------------------------------
cond. A | 5.5 | 2.3 | 8.2 | ... | 1.2
--------------------------------------------
cond. A | 6.5 | 3.6 | 0.2 | ... | 3.1
--------------------------------------------
control | 2.5 | 1.1 | 2.3 | ... | 0.05
--------------------------------------------
cond. B | 3.5 | 9.8 | 3.5 | ... | 0.7
--------------------------------------------

I want to determine which of these 20 variables help me most in determining the condition of interest.

What I did was:
- perform an ANOVA analysis to determine whether exists a statistically significant difference between the means of the three groups. In other words, I run ANOVA 20 times (one for each variable), then corrected p values with benjamini-hochberg for multiple testing. This procedure left me with a subset of variables (from 20 to 10).
- perform a post-hoc analysis (Tukey's HSD) for each of these 10 variables with multiple comparison correction to find out which groups differ. As a side note, I'm only interested in differences between Control and Condition A and Control and Condition B; I'm not interested in differences between Condition A and Condition B.

Let's suppose we focus on control vs condition A and we adopt a one-vs-rest approach where I consider the 15 patients having the condition A as cases and the other 30 patients as controls.

This is my guess:
  • take a look at the correlation between each independent variable (IV) and the dependent variable.
  • consider the subset of IVs with highest correlation and build an additive logistic regression model (following is an R snippet using generalized linear model with logit link):
    model <- glm(outcome ~ IV1 + IV2 + ... + IVn, data=data, family=binomial)
  • take a look at the model's coefficients and consider the statistically significant ones.
Do note that I would repeat the procedure for control vs condition B.

I'm worried about the following points:
  • as far as I understand it, correlation helps me determining whether a linear relationship exist between two variables, along with its magnitude and direction. By basing my exploratory analysis on correlation, there's a risk I discard a variable that doesn't show correlation with the dependent variable because their relationship is not linear. Is it correct ?
  • By building an additive model, I'm not exploring any interactions between the variables, hence losing important information. Is this correct ?
  • Given my sample is so small, I don't think I will be able to build a model with so many variables because the risk of overfitting is very high. I was reading that as a rule of thumb I should add a variable in my regression model every 10 data points.
How would you go about exploring the "Importance" of each variable in the dataset ?

Thank you very much!!
Francesco
 
Last edited:
#2
Please close this thread as I will re open it on "Statistical Research" forum since it has to do more with metodology than basic statistics.
Thank you
 

hlsmith

Less is more. Stay pure. Stay poor.
#4
The rule of thumb is that you have 10-20 subjects in your smallest outcome group. Frank Harrel has shown that rule is not adequate and that you need even more subjects. Disregarding the last comment, you have enough data to support a model with one predictor. Now you have three outcome groups meaning if you want to model them you risk false discovery when fitting two logistic regs. Typical action would be to correct your alpha. Though changing your alpha will then impact your power, meaning you need even more data to support a single predict.

Take home message, you don't have enough data for inference.
 
#5
That was my suspicious. Thank you very much.
This whole problem started when my supervisor asked me to determine whether using two/three predictors together yielded better predictive performance than using a single predictor.
Another request was to perform ROC analysis to support that thesis. I've computed AUC and 95% CI for both a model built with a single predictor and with two predictors together and the latter AUC was greater than the former (although the unpaired t-test showed no significance).
To summarise: With the current data I won't be able to do any of that, am I correct ?

What type of analysis can be sustained by my limited data ? Just to propose my supervisor some alternatives.

Thank for your help.
 

hlsmith

Less is more. Stay pure. Stay poor.
#6
You can likely fit a model with just a single predictor based solely on background context knowledge and be cognizant that you have no external held out data to examine it generalizability. To be cautious I would also use a smaller alpha level. Tell your boss you need a lot more data in order to do anything!