Good evening.
My question regards the "best" and correct way of finding out which predictors are best for determining an outcome of interest.
I currently have a very small sample (45 patients), a number of continuous variables (20) describing patients hearts' characteristics and a three-levels categorical variable indicating what type of condition each patient has (control, condition A, condition B). The dataset is balanced with 15 patients for each condition.
Outcome | P1 | P2 | P3 | ... | P20
--------------------------------------------
control | 2.5 | 0.7 | 1.1 | ... | 3.5
--------------------------------------------
control | 1.5 | 1.2 | 9.2 | ... | 5
--------------------------------------------
cond. A | 5.5 | 2.3 | 8.2 | ... | 1.2
--------------------------------------------
cond. A | 6.5 | 3.6 | 0.2 | ... | 3.1
--------------------------------------------
control | 2.5 | 1.1 | 2.3 | ... | 0.05
--------------------------------------------
cond. B | 3.5 | 9.8 | 3.5 | ... | 0.7
--------------------------------------------
I want to determine which of these 20 variables help me most in determining the condition of interest.
What I did was:
- perform an ANOVA analysis to determine whether exists a statistically significant difference between the means of the three groups. In other words, I run ANOVA 20 times (one for each variable), then corrected p values with benjamini-hochberg for multiple testing. This procedure left me with a subset of variables (from 20 to 10).
- perform a post-hoc analysis (Tukey's HSD) for each of these 10 variables with multiple comparison correction to find out which groups differ. As a side note, I'm only interested in differences between Control and Condition A and Control and Condition B; I'm not interested in differences between Condition A and Condition B.
Let's suppose we focus on control vs condition A and we adopt a one-vs-rest approach where I consider the 15 patients having the condition A as cases and the other 30 patients as controls.
This is my guess:
I'm worried about the following points:
Thank you very much!!
Francesco
My question regards the "best" and correct way of finding out which predictors are best for determining an outcome of interest.
I currently have a very small sample (45 patients), a number of continuous variables (20) describing patients hearts' characteristics and a three-levels categorical variable indicating what type of condition each patient has (control, condition A, condition B). The dataset is balanced with 15 patients for each condition.
Outcome | P1 | P2 | P3 | ... | P20
--------------------------------------------
control | 2.5 | 0.7 | 1.1 | ... | 3.5
--------------------------------------------
control | 1.5 | 1.2 | 9.2 | ... | 5
--------------------------------------------
cond. A | 5.5 | 2.3 | 8.2 | ... | 1.2
--------------------------------------------
cond. A | 6.5 | 3.6 | 0.2 | ... | 3.1
--------------------------------------------
control | 2.5 | 1.1 | 2.3 | ... | 0.05
--------------------------------------------
cond. B | 3.5 | 9.8 | 3.5 | ... | 0.7
--------------------------------------------
I want to determine which of these 20 variables help me most in determining the condition of interest.
What I did was:
- perform an ANOVA analysis to determine whether exists a statistically significant difference between the means of the three groups. In other words, I run ANOVA 20 times (one for each variable), then corrected p values with benjamini-hochberg for multiple testing. This procedure left me with a subset of variables (from 20 to 10).
- perform a post-hoc analysis (Tukey's HSD) for each of these 10 variables with multiple comparison correction to find out which groups differ. As a side note, I'm only interested in differences between Control and Condition A and Control and Condition B; I'm not interested in differences between Condition A and Condition B.
Let's suppose we focus on control vs condition A and we adopt a one-vs-rest approach where I consider the 15 patients having the condition A as cases and the other 30 patients as controls.
This is my guess:
- take a look at the correlation between each independent variable (IV) and the dependent variable.
- consider the subset of IVs with highest correlation and build an additive logistic regression model (following is an R snippet using generalized linear model with logit link):
model <- glm(outcome ~ IV1 + IV2 + ... + IVn, data=data, family=binomial) - take a look at the model's coefficients and consider the statistically significant ones.
I'm worried about the following points:
- as far as I understand it, correlation helps me determining whether a linear relationship exist between two variables, along with its magnitude and direction. By basing my exploratory analysis on correlation, there's a risk I discard a variable that doesn't show correlation with the dependent variable because their relationship is not linear. Is it correct ?
- By building an additive model, I'm not exploring any interactions between the variables, hence losing important information. Is this correct ?
- Given my sample is so small, I don't think I will be able to build a model with so many variables because the risk of overfitting is very high. I was reading that as a rule of thumb I should add a variable in my regression model every 10 data points.
Thank you very much!!
Francesco
Last edited: