TS Contributor
My concern is if p value and confidence interval have a role in interaction definition
They don't have any role in the definition of interaction. Interaction is independent of that (although you can test for interaction, like other things).

Generically, interaction implies the the relationship (or effect) of X1 on Y depends on the value of X2. You can switch X1 and X2 in that sentence and it's still true. You might say something like "The odds relationship of antibody status with the Outcome variable depends on how old the person is." Just of course, if you have evidence to suggest there is an interaction.

My question for your project: why is there a cutoff imposed on age, and why is 58 the correct age?


Less is more. Stay pure. Stay poor.
It has been awhile but I thought you just make a 2x2 table with all of the relative risk versions for combos, then see if there appears to be an additive or multiplication appearance for relative risk in the bottom right cell.
this is a case-contro study. By analysing crude dataI found a not significant association. after stratification by age (58 years is the participants' median age) is evident an effect heterogeneity. I think the there is a multiplicative positive interaction . Is this interaction quantitative or qualitative considering that in the <58 years stratum the OR was > 1 but non significant?


TS Contributor
Well, you arbitrarily created an ordered "categorical" variable for age, which isn't advisable. The median is going to maximize power between the groups, but this is an arbitrary grouping and likely doesn't represent real groups. The next study will likely have a different arbitrary "grouping", say 60 years or older, and another study might have 55 and older, for example-- very inadvisable for replication and doesn't actually adhere to assumptions needed to make these groups. (Same argument can apply to the "positive" and "negative" antibody status since this is likely some continuous measure made into groups by a lab.

I would recommend fitting a model (exact logistic regression might be good here) for your outcome as a function of age (not cut into arbitrary groups) and antibody status, including an interaction between the two.

If your specific hypothesis is that the effect of age on the outcome increases when antibody status is positive, then conduct an upper-tailed test on the interaction coefficient; alternatively, just make the CI larger (say, 97.5%, to represent smaller alpha for a one tailed test) for the interaction coefficient and see if the lower bound on the CI is greater than 1. This assumes you code Ab positive is 1 and 0 if negative.
Tis is a real group of patients, the estimated median age is real. According to Scklo and Nieto is a simple a and fesible means to establish if effect heterogeinty exsist. My aim is to understand of what kind of interaction i am deling with. Thank you
I agree with what Ondan said. Even if it is very common to dichotomize it is still not OK.

Imagine that you do an experiment and you assign gradually a substance very carefully with concentration levels from 20 to 85. Then you would cause very big measurement errors if all values less than 58 was coded as “low” and all those with more than 58 was coded as “high”.

Is it really a correct biological description to say that all the people of age 37, 47, and 57 are all the same, but at the age of 58 a mysterious jump happens and then that everybody with the age of 58, 68, and 78 are all the same.

It would have been better to use the actual age a “regression” variable in a logistic regression (as Ondan suggested) or with a quadratic model like b*age+b2*age*age or as a generalized additive model (gam) with local smoothing.


TS Contributor
Tis is a real group of patients, the estimated median age is real. According to Scklo and Nieto is a simple a and fesible means to establish if effect heterogeinty exsist. My aim is to understand of what kind of interaction i am deling with. Thank you
I understand it is a real group of patients, but the group divided by age isn't a real group; its arbitrary and created by sampling variation. You're not really dealing with a "qualitative interaction" because it's not a qualitative variable, despite the appearance of it.

Frank Harrell Jr. and Stephen Senn are two very well studied and respected biostatisticians (Statisticians by education and practice, so they're pretty well versed in the theory and mathematical nuance behind ideas, in addition to the application). Look up their summaries and work on this problem of "dichotomania" which is incredibly present in and deleterious to biomedical research. Alternatively, quickly watch this video to give a great deal of explanation and visualization (can probably do 1.5 speed).

Summary: If a med student on rounds said "The patient's sodium is <135, I think we should...", no clinician would say "Great, it's <135, that's all I need to know." They're clearly going to assess risks from the specific serum sodium level, and differently for a patient who is 129 vs 114. This is a good quick analogy that should demonstrate the issue. But, long form is below...or in the video

By categorizing age (or another continuous variable), you are making many assumptions that aren't really true or reasonable and you're devaluing the work you're trying to do.
1) You're assuming that the outcome is relatively homogeneous within groups and relatively heterogeneous between groups (i.e. within each group, the Y values basically fall on a straight, horizontal line/have the same outcome if categorical). This is clearly not a reasonable assumption. In most biomedical scenarios.
2) You're assuming that continuous variable is not continuously related to the outcome (lines or curves, for example, are ruled out); you assume that the relationship has a discrete jump (think a staircase) to relate Y and that variable (sometimes reasonable in finance, generally not in medicine);
3) the "cut point" or "findings" are not likely replicated in other research.
4) you're assuming that cut point is optimal for every patient, which isn't true.
5) Insurance companies try to use literature to decide what to pay for or what not to pay for (or how much) based on literature, and using arbitrary variables like this can lead to improper policies enacted by those utilizing research. A common issue for this is when people try to relate hospital length of stay to many variables, for example, but the conclusions are spurious and based on improper methodology; in the end, the patient's are at risk of being hurt (I know a few people in public health who have said this has come up in their career).
6) There's a lot more...

I hope this clarifies what is meant by "not a real group" and why a different approach will be more favorable for realistic and repeatable conclusions.
Last edited:
Thank you all. I have understood your explanations. You're right. A further question please: Could I minimize that by stratifyng by smaller age groups? Just for verifying if interaction occurs or not. In other words may I use stratification to assess interaction presence even in cases of continuos variables?
Could I minimize that by stratifyng by smaller age groups?
You could do that. But it would be better to use the variable as a “regression variable” and as I said above. Plot the curves. If they are parallel, then there is no interaction. If they are not parallel, then there is an interaction effect.

Isn’t Aeneas the founder of Rome in Vergilius writing and a gladius the short Roman sword? You can use that gladius to do with dichotomania as Alexander did to the gordian knot! :)


Less is more. Stay pure. Stay poor.
Do you have reasonable suspicion to believe there is an interaction? And if so, what would you imagine the relationship to be? It seems like you are looking for an interaction, be it additive or multiplicative, but why? Also, it seems you are trying to see what type of interaction, but shouldn't you have a reason why either is feasible? Lastly, it seems you are familiar with the possibility of interactions being additive and/or multiplicative - and are trying to dichotomize to find your solution because that is what you think the literature is telling you to do. But as noted, doing this is arbitrary and you risk losing the true signal.

Yes, use the variable as continuously formatted. you can put in the model a continuous, binary, and product term for X1, X2, X1*X2. Then plot the predicted probabilities of the outcome for the two binary groups (so two lines) and see if they cross. This is how you would examine for the multiplicative interaction. Look to the work of Tyler Vander Weele of Harvard to see how to examine for additive interaction given your formatting of variables (contin*cat interaction).

P.S., It has been awhile since I reviewed the Szklo & Nieto text, but more has been written recently on the difference in approaches for dealing with interaction versus effect modification.
Last edited:


Less is more. Stay pure. Stay poor.
I just skimmed the Szklo & Nieto book and saw yeah they used binary * binary interactions. This would be frowned on now days, but it got my thinking. They did dichotomize many of the exposures, but perhaps they had rationale, plus those are the easiest examples and data to simulate. The best approach is to not dichotomize, but I would imagine if you looked at splines or GAMs and saw a possible inflection point in the line the variable could be dichotomized for ease, but as noted this would likely mute the effect since you are working with population approximations and John and Jane may be comparable, but the nonlinear relationship is a blend of their influence and just splitting the variable to fit piecewise lines is an all or nothing approach (black and white with no gray area). Splitting (dichotomizing) should be the last resort or if there really is a threshold in the underlying data generating process, but given genes, environment and possible epigenetics, framing the signal by using a split instead of plotting the nonlinear relationship will loss information since there is some much variability.
I suspected that age could be a counfoundig . In fact in the context of the pathology I am examining, age could be associated both with exposure (i.e. antibody positivity; in fact I found a found that prevalence increased by age) and with the outcome. I am not a statistician and I am not an expert in using statistical mathematical models. I am MD but have some knowledge of epidemiology. I have learned that stratification is a simple and easy way to verify the existence of both confounding and interaction. When I analysed the crude data, I found a strong but not significant association OR= 4.38 (95% CI, 0.94-20.35) between exposure and outcome. Thus, I dichotomized age data with the aim of assessing, in a preliminary, although rough (I’m now definitely aware of that!), way a possible age confounding effect. By stratifying by age (< 58 and ≥58), the estimated odds in both strata were very heterogeneous (OR= 1.5 , 95%CI 0.26-8.48 vs OR=11.71 , 95% CI, 1.56-214.05, after Haldane correction). I found the same strong heterogeneity even after stratifying by 3 smaller age groups [22-52 years (n = 34), OR= 1.09, 95%CI 0.098-12.06; 53-63 years (n = 32), OR= 3.0, 95%CI 0.27-32.45; > 64 years (n = 31); OR 178.2, 95%CI 1.24-25336.3 after Haldane correction). Well now, without questioning advices I received, on the basis of these stratification data, can I say with confidence that there is an interaction between age and presence of antibodies in determining the outcome. Can I say with confidence that there is a multiplicative positive interaction? Regardless f the positive or negative OR value, does the 95%CI interval significance have some role in defining such interaction (e.g. qualitative or quantitative?).


Less is more. Stay pure. Stay poor.
You need to switch over to logistic regression, if you haven't. Run the following models and take screenshots of results and post them. And we will talk out what the results are suggesting.

y is your dichotomous outcome
Beta_0 is your intercept (base case log odds)

y = Beta_0 + Beta_1(antibody)
y = Beta_0 + Beta_1(antibody) + Beta_2(age)
y = Beta_0 + Beta_1(antibody) + Beta_2(age) + Beta_3(antibody*age)