Odds Ratio and Gene mutations associations

#1
Hi everyone,

I have a simple (yet to me still trivial) problem to submit to you.

I have a dataset of a group of patients affected by a disease, for which the presence of several genes mutations was inferred.
Each gene is a variable with either 0 for negative and 1 for positive.
I need to assess the presence of associations between these genes, to establish whether some tend to be co-mutated while some other tend to be mutually exclusive. For doing this, I first analyzed all the genes possible combinations in 2x2 contingency tables such as:
Unbenannt1.PNG
in this case for example, the p value is very significant, so I thought it could be useful to compute the OR to establish a relationship. Here, for example, the OR obtained from the formula (OR=A x D/ C x B) is 0.53, hence it should mean that the two genes tend to be more in opposite directions (0-1 or 1-0) compared to same directions (0-0 or 1-1). However, my concern is that in this way it is not clear whether the two genes have a positive or negative correlation. Should I just compare the double positive (1-1) against the total of discordant cases (0-1 and 1-0)? In this case it would be 69/(131+428)=69/559=0,12. Is it useful?

However, each gene has a different % of mutation within the population, so for example gene 1 here has a .18 probability of being mutated whilst gene 2 has a .46 probability. Should I take this into account?
I played around and tried to see how these 4 combinations would look like if they were only due to the each of the two genes expected mutation frequencies, so something like that came out:
Unbenannt.PNG
final numbers are the same, but if you look at it, numbers are ridistributed according to the expected frequencies (ie: total no of mut gene 1 cases is 195/1059=0.18 which is the expected mut frequency of the gene). I then computed another OR for these numbers (12.76) and compared it with the previous one using Tarone´s test of homogeneity between the two tables (in this case, p-value is significant).
From the simple division of each category from the "real life" table / the "expected frequencies" table I obtained a ratio (ie: 0/0 ratio=431/542=0.79, there are less double negative than expected). Do you think this is a correct reasoning? If so, should i use the 1/1 ratio to know if the relation is positive or negative (in this case 69/172=0,4, there are less double positive than expected so the genes are inversely correlated)?

I thank you in advance and look forward for your help!
Best,
Luca
 

hlsmith

Not a robit
#2
I would find the gene combination with the lowest odds of the outcome of interest. Now that group is your overall referent group to compare all other groups with during ORs calculations. So at the end you will have 3 ORs. Now for controlling for group sizes, you should always present ORs with 95% CIs. These will take into account sample size. Lastly, since you are making multiple comparisons, you should correct your alpha level in the CIs calculations to make it smaller and intervals larger. This controls for false discovery. A typical approach is dividing it by the number of comparisons, though if you were going to use say 0.05, I would just go with 0.01 instead.

This can also all be done using logistic regression or by hand. Another possibility is to enter data into a logistic regression model along with an interaction term for the genes. Examining for additive or multiplicative interactions, but you would want to state you protocol before starting analyses to prevent a mining expedition.
 
#3
Hi hlsmith, thank you very much for the answer.
I recalculated table 2 following your advice to have an OR=1 and use it as a reference.
Now it looks like this:
Unbenannt.PNG
The absolute numbers are the same and it respects the % of mutation of both gene1 and gene 2. I did not get exactly what should I do with this table. Should I divide each category of my table 1 per the correspondent one in table 2, to have an idea of if the real data stick to the expected or not? (ie: double negative table1: 431 / double neg table2: 461 = 0,93, this ratio should tell me that the number of 0/0 is slightly inferior to what expected?)
How exactly do you mean I can calculate 3 ORs?

The paper I am using for reference is this:
https://www.ncbi.nlm.nih.gov/pubmed/27276561
If you look at supplemental figure S3 it nicely shows all the correlations based on OR and corrected for FDR (taken into accout also absolute numbers of cases analyzed). Since it´s my first time performing this analysis, would you provide me with some advice on how to do it (I am learning to use R by the way)?

Thank you so much for your help,
Luca
 

hlsmith

Not a robit
#6
For clarification, you are wanting to look at an outcome risk based on the presence combinations of two genes, correct? That is what I assumed and the direction I have been leading you. If so, see below. If not, clarify please.

You will have to examine the document link above, but I don't recall if there is a rare outcome assumption on using the general formula with ORs in place of RR (outcome < 10%) for examining RERI.

I grab this code, the package wouldn't work with my version of R, but should give you direction. It gets at the dual presence of both genes may have an additive affect on the outcome larger than expected and/or a multiplicative effect larger than expected.

Code:
#Source: https://www.rdocumentation.org/packages/epiR/versions/0.9-96/topics/epi.interaction
#Data simulation
can <- c(rep(1, times = 231), rep(0, times = 178), rep(1, times = 11),
         rep(0, times = 38))
smk <- c(rep(1, times = 225), rep(0, times = 6), rep(1, times = 166),
         rep(0, times = 12), rep(1, times = 8), rep(0, times = 3), rep(1, times = 18),
         rep(0, times = 20))
alc <- c(rep(1, times = 409), rep(0, times = 49))
dat <- data.frame(alc, smk, can)
dat
#multiplicative interaction
dat.glm01 <- glm(can ~ alc + smk + alc:smk, family = binomial, data = dat)
summary(dat.glm01)
#additive interaction
install.packages("epi.interaction")
library(epi.interaction)
dat$d <- rep(NA, times = nrow(dat))
dat$d[dat$alc == 0 & dat$smk == 0] <- 0
dat$d[dat$alc == 1 & dat$smk == 0] <- 1
dat$d[dat$alc == 0 & dat$smk == 1] <- 2
dat$d[dat$alc == 1 & dat$smk == 1] <- 3
dat$d <- factor(dat$d)
## Table 3 of Hosmer and Lemeshow (1992):
dat.glm02 <- glm(can ~ d, family = binomial, data = dat)
summary(dat.glm02)
epi.interaction(model = dat.glm02, coeff = c(2,3,4), type = "RERI",
                conf.level = 0.95)
epi.interaction(model = dat.glm02, coeff = c(2,3,4), type = "APAB",
                conf.level = 0.95)
epi.interaction(model = dat.glm02, coeff = c(2,3,4), type = "S",
                conf.level = 0.95)
 
#7
For clarification, you are wanting to look at an outcome risk based on the presence combinations of two genes, correct? That is what I assumed and the direction I have been leading you. If so, see below. If not, clarify please.
I am focusing on two different disease entities: NPM1+ and NPM1- AML: as an initial analysis, I only have to compare the distribution of several gene mutations to see whether there is an actual association of certain gene mutations within NPM1+ AML compared to NPM1- AML, where these associations may not hold true or could potentially be inverted. Finally, the purpose is to correlate these mutations with age to check whether there are some possible association within age groups (ie: one mutation more present in young pts compared to old, etc).

Later, I will have to focus on outcome as well, and i will need to perform a risk-analysis based on gene combinations as well, so your advice is going to be useful for that part.