There are only 20 different mutation combinations among the 5 sites.

Each mutation site has only two (only one has three) possible amino acid values.

We also have 4 categorical variables that are environmental (geography,

climate, etc.) with 3 or 4 possible values each.

In short we have for each species, 5 mutation variables ("dependent") and 4 environmental variables ("independent") and we would like to explain the

dependent variables in terms of the independent ones. In other words,

how well the environment explain the mutations. We joint all the mutations in only one

variable called mutation type with the 20 values.

It seems to be a text book problem and we tried multivalue logistic regression

on the logs odds with respect to some base values. But it is not natural to choose

a bottom value in each variable. For instance, a result such as "when the geography value changes from the bottom value "mediterranean" to "template", the log odds of changing from mutation type 1 to 2 (out of the 20) increase by 5.2, with p-value= 10^-3" is difficult to understand (because it is unknown if they have gone from

that place to the other and the mutation type does not change that way). So we do not like the logistic regression. Besides the "independent" variables have high associations as shown with a Xi^2 test.

We also have tried to explain only one mutation site variable, but it seems that it

is better to combine them, maybe not all of them.

There is also a problem with the sparcity of the contingency tables again the

mutation type variable.

Xi^2 tests between all pairs of variables reveals some associations.

Is there any other strategy we should try?

Thanks a lot.

Jairo Rocha

University of the Balearic Islands