Here is some data to play with

We are interested in the color `red` in particular. We would like to understand which explanatory variables (which among `x1`,`x2`,`x3` and `x4`) affects the probability `red` "differentially that it affects the probability of outcome `blue` or `yellow`". The question might be a little unclear but I hope the following will make my point very obvious.

Below graphs show 95% bootstrap confidence intervals.

"red" seem to differ from the two other groups only in respect to `x4`. I would be tempted to group `yellow` and `blue` into a single group called `NotRed` as shown below.

Because we sampled way more `yellow` than `blue`, it seems that `Red` differ from the other colors for `x2` as well as `x4` but this is misleading. I would rather not consider `x2` as having an effect of interest when I see the first series of plots as I am looking for variables that make `red` different than both other groups.

Let's make a test with `color` as response variable.

Would it be wise to consider only the highest p.value for each explanatory variable to decide (relative to a threshold) whether a variable affects the probability of the outcome `red` differentially than both `blue` and `yellow`?

How can I go about testing what variables affect the probability of the outcome `red` "differentially than both `blue` and `yellow`"?

Similarly, when performing an LDA, I would have wished `x4` to pop up as important driver of the first axis but this is not made obvious due to `x2` again.

Code:

```
set.seed(1e4)
color = factor(c(rep("blue",20), rep("red",20), rep("yellow",150)), levels=c("red","blue","yellow"))
x1 = rnorm(20+20+150,12,3)
x2 = c(rnorm(20,5), rnorm(20,5), rnorm(150,7))
x3 = c(rnorm(20,7), rnorm(20,5), rnorm(150,5))
x4 = c(rnorm(20,5), rnorm(20,7), rnorm(150,5))
RedOrNot = ifelse(color=="red","Red","NotRed")
df = data.frame(
color = color,
RedOrNot = RedOrNot,
x1 = x1,
x2 = x2,
x3 = x3,
x4 = x4
)
```

We are interested in the color `red` in particular. We would like to understand which explanatory variables (which among `x1`,`x2`,`x3` and `x4`) affects the probability `red` "differentially that it affects the probability of outcome `blue` or `yellow`". The question might be a little unclear but I hope the following will make my point very obvious.

Below graphs show 95% bootstrap confidence intervals.

Code:

```
p1 = ggplot(df,aes(x=color,y=x1)) + stat_summary(fun.data=mean_cl_boot) + coord_flip()
p2 = ggplot(df,aes(x=color,y=x2)) + stat_summary(fun.data=mean_cl_boot) + coord_flip()
p3 = ggplot(df,aes(x=color,y=x3)) + stat_summary(fun.data=mean_cl_boot) + coord_flip()
p4 = ggplot(df,aes(x=color,y=x4)) + stat_summary(fun.data=mean_cl_boot) + coord_flip()
multiplot(p1,p2,p3,p4,layout=matrix(1:4,ncol=2))
```

"red" seem to differ from the two other groups only in respect to `x4`. I would be tempted to group `yellow` and `blue` into a single group called `NotRed` as shown below.

Code:

```
p1 = ggplot(df,aes(x=RedOrNot,y=x1)) + stat_summary(fun.data=mean_cl_boot) + coord_flip()
p2 = ggplot(df,aes(x=RedOrNot,y=x2)) + stat_summary(fun.data=mean_cl_boot) + coord_flip()
p3 = ggplot(df,aes(x=RedOrNot,y=x3)) + stat_summary(fun.data=mean_cl_boot) + coord_flip()
p4 = ggplot(df,aes(x=RedOrNot,y=x4)) + stat_summary(fun.data=mean_cl_boot) + coord_flip()
multiplot(p1,p2,p3,p4,layout=matrix(1:4,ncol=2))
```

Because we sampled way more `yellow` than `blue`, it seems that `Red` differ from the other colors for `x2` as well as `x4` but this is misleading. I would rather not consider `x2` as having an effect of interest when I see the first series of plots as I am looking for variables that make `red` different than both other groups.

Let's make a test with `color` as response variable.

Code:

```
k = length(levels(df$color))
I = diag(k-1)
J = matrix(rep(1, (k-1)^2), c(k-1, k-1))
priors = list(R = list(fix=1, V=(1/k) * (I + J), n = k - 1))
m = MCMCglmm(color ~ -1 + trait:(x1) + trait:(x2) + trait:(x3) + trait:(x4) , rcov = ~ us(trait):units, data = df, family = "categorical", prior = priors, verbose = FALSE, burnin = 10000, nitt = 60000, thin = 50)
summary(m)
post.mean l-95% CI u-95% CI eff.samp pMCMC
traitcolor.blue:x1 -0.30261 -0.60335 0.02378 72.04 0.048 *
traitcolor.yellow:x1 -0.02728 -0.24525 0.19561 129.73 0.832
traitcolor.blue:x2 -0.60401 -1.20394 0.04258 32.72 0.118
traitcolor.yellow:x2 2.05721 1.35060 2.94990 26.03 <0.001 ***
traitcolor.blue:x3 2.27062 1.36630 3.27224 30.57 <0.001 ***
traitcolor.yellow:x3 -0.30359 -0.97326 0.27461 86.95 0.306
traitcolor.blue:x4 -1.28126 -2.11725 -0.55014 41.71 <0.001 ***
traitcolor.yellow:x4 -1.41233 -2.05106 -0.80523 68.91 <0.001 ***
```

Would it be wise to consider only the highest p.value for each explanatory variable to decide (relative to a threshold) whether a variable affects the probability of the outcome `red` differentially than both `blue` and `yellow`?

How can I go about testing what variables affect the probability of the outcome `red` "differentially than both `blue` and `yellow`"?

Similarly, when performing an LDA, I would have wished `x4` to pop up as important driver of the first axis but this is not made obvious due to `x2` again.

Code:

```
lda(data=df,RedOrNot~x1+x2+x3+x4)$scaling
LD1
x1 0.004390466
x2 -0.571945258
x3 -0.241371706
x4 0.784905799
lda(data=df,color~x1+x2+x3+x4)$scaling
LD1 LD2
x1 -0.0361047 0.03607743
x2 -0.8110918 -0.09303029
x3 0.4472190 -0.69882190
x4 0.3820449 0.73839111
```

Last edited: