Logistic Regression interaction effect between 2 categorical variables

#1
Hi all,

I am conducting a logistic regression on whether participants 1=have a mammogram before 40 years old and 0=have a mammogram at 40 +.
The two major predictors of this DV is having a family history of breast cancer and having symptoms. However after controlling for these two major effects, race seems to be a factor. Whites are more likely to have breast exam at younger ages than other races controlling for breast cancer history and symptoms. I introduced a new variable "Born in the US". Participants who born in the US are significantly more like to have a mammograms before 40 (controlling for all the covariates).

I want to introduced an interaction effect between born in the US and each race (black, white, latino, asian). Based on my preliminary analysis, asians are the less likely to have a mammogram in younger ages but when I control for "born in the US" variable, Asians who born in the US are much more likely to have a mammogram even more than Whites. And Asians who are immigrants are even less likely to have a mammogram.

How can I do the interaction effect between BORN US and ALL races?

Thanks,
Marvin
 

noetsi

No cake for spunky
#2
I am not sure you actually have an interaction effect, or whether a apparent relationship is actually spurious, explained by another which is what statistical control addressed. Regardless doing an interaction effect is simple. You create a new variable which is the level of Born US * race and analyze that. If you are using a series of dummies to represent race (for example one for white, one for asian) than you would have to create an interaction term for each of these dummies and born in the US. Commonly if the interaction effect is not significant you will remove it from the model.
 
#4
agelog Odds Ratio Std. Err. z P>z [95% Conf. Interval]

firstdegreerc 1.318834 .1141811 3.20 0.001 1.113001 1.562734
symptomsrc 5.912727 .403699 26.03 0.000 5.172149 6.759347
racerc1 .5862488 .1686931 -1.86 0.063 .3335406 1.030422
racerc3 .5082667 .1154678 -2.98 0.003 .3256237 .7933546
racerc4 1.009993 .6244585 0.02 0.987 .3006336 3.393121
racerc5 3.237816 4.024727 0.95 0.345 .2832542 37.01074
racerc6 .3396213 .1190633 -3.08 0.002 .1708376 .6751595
racerc7 .9213184 .3274465 -0.23 0.818 .4590749 1.848996
racerc8 1.222907 .2458549 1.00 0.317 .8246449 1.813511
bornusrc 1.375953 .1410598 3.11 0.002 1.125486 1.682159
bornus_black 2.180697 .531974 3.20 0.001 1.351909 3.517574
bornus_asian 4.949464 2.072695 3.82 0.000 2.178207 11.24649
bornus_indan .4309863 .5353906 -0.68 0.498 .0377627 4.918859
bornus_hawaiian 1.581595 1.270714 0.57 0.568 .3274973 7.63806
bornus_multi 1.890161 .676586 1.78 0.075 .9371515 3.812305
bornus_other 6.983825 3.613393 3.76 0.000 2.533298 19.25309
_cons .1404302 .0277592 -9.93 0.000 .0953239 .2068805




Do I have to exclude an interaction effect from the regression (as a reference variable or not). Some of the interactions seems to be significant. However, I have a cesus typr od data. All clients of our grattes have to complete a survey in order to get a mammogram. It is not a sample. It is the universe. any thoughts??
How can I really interpret these findings? For instance, asian- can i say that asian that were born on the US are almost 3 times more likely than latinos who were born in the US to (reference-interaction) to have a mammogram before 40 years old. IT is hard for me to explain this in a easy language.

Thank for any help...
 

noetsi

No cake for spunky
#5
I would be interested in the effects. Two things to remember if you find interaction is signficant (other than don't sob as I generally do when I find it, interaction is a royal pain but we are all have to go through it) are:

1) looking at plots of one variable regressed on the DV at levels of the other interaction variable is probably the simplest way to consider the impact of interaction. SAS, and I am sure all softwares have specialized commands to do this. Note this is a lot easier with categorical variables you have than continuous ones.

2) Simple effects are probably the best way to analyze interaction. It looks at the impact on the DV of one variable at specific levels of the other interacting IV. Again all the major software I am sure will do this.
 

noetsi

No cake for spunky
#6
Do I have to exclude an interaction effect from the regression (as a reference variable or not). Some of the interactions seems to be significant.
If its statistically signficant you should include it.

However, I have a cesus type of data. All clients of our grattes have to complete a survey in order to get a mammogram. It is not a sample. It is the universe. any thoughts??
Yes. You don't need test of statistical signficant when you have a population. The effects discovered are real, they can't be tied to random sampling error. So you can simply focus on the slopes and effect size.


How can I really interpret these findings? For instance, asian- can i say that asian that were born on the US are almost 3 times more likely than latinos who were born in the US to (reference-interaction) to have a mammogram before 40 years old. IT is hard for me to explain this in a easy language.
I am not one to write simply, my thoughts usually go through several translations before they go to decision makers. That said, think about the substantive meaning of what you find - why does what you find make sense or not, and what does it entail?

Remember, and this is critical, when you have interaction effects do not intepret main effects such as Asians. Interpret the behavior of the main effect at specific levels of the other interacting variable. From your results Asians in the US behave very differently than those born outside the US (that is what interaction is, one main effect behaves differently on the DV at various levels of the other IV).

So why do Asians born in the US behave differently than those outside the US? My bet is different culture or that income and education levels vary between foreign and domestic born Asians. That is your job as an analyst, find out why the results are occuring (or if you can't at least point out what you found for others to explore).
 
#7
This is great..specially the last part... I did not know how to interpret interactions... thank you very much. I will improve this regression and convey the result with you.. would you include or exclude the interaction terms? I am just curious... Maybe I can keep this manuscript simple and just concentrate in the main finding. Race is a predictor of getting a mammogram before 40 years old vs 40+ controlling for family history and symptoms..
 

noetsi

No cake for spunky
#8
You can't talk about just the main effects if you have meaningful interaction - because that would distort the reality. The whole point of interaction is that you can not talk about an average effect of an IV on a DV, you have to talk about the impact of an IV on the DV at specific levels of another IV (the one it is interacting with). This is where normal statistical control breaks down. So the key question is if the interaction is signficant or not.

The problem is that rather than having a handy crutch of not including interaction effects in the analysis (which makes them far simpler) if they are above p=.05 (i.e., not statistically significant) you have to make a substantive decision if the interaction effects are large enough to matter in making this decision.

That is beyond my expertise. I would be inclined to not include interaction effects above p =.05 simply because I don't know how to intepret if the interaction effect is substantively important (that is decide if they are large enough to include in the model). Better yet talk to an expert on your topic (if you are not one yourself) and ask them if they are large enough to matter in assessing whether to leave the interactions in. Also look at a graph showing the impact of the interacting IV on the DV at specific levels of the other interacting IV to make the decision if you should include the interaction effect. If it look like it is making a lot of difference, if the impact on one IV on the DV seems to be very different at levels of the other IV, then you should include the interaction effect I would think (but only then). But again this is beyond the statistics I have seen which looks at samples and p values not populations.
 
#9
Thank you very much. I greatly appreciate your help. Do you know a nice way to graph the interactions effect impacts? Which visual can I produce?

Thank you... I use stata.
 

noetsi

No cake for spunky
#10
That depends on what software you are familar with and of course have. There are SAS commands for example which do this. I am hardly an expert at this, but I can send you comments made by those who are.

In the end all plot the regression line between the IV and the DV at different levels of the other IV. So if you have this information you can plot it even without that software.
 

bukharin

RoboStataRaptor
#11
Use -margins- followed by -marginsplot-. I have written a few examples in the Stata forum over the years - search for marginsplot and you should come across some example code.

You need to code your categorical variables using i. and create interactions using #, otherwise -margins- won't understand the model correctly.
 
#13
agelog Odds Ratio P value
racerc1 0.5234188 0.028
racerc3 0.467678 0.001
racerc4 0.4797798 0.008
racerc5 1.161012 0.459
whereliverc2 1.14028 0.129
whereliverc3 1.277347 0
whereliverc4 1.213362 0.292
symptomsrc 4.979101 0
firstdegreerc 1.340462 0
healthinsurance5rc0.8625822 0.02
bornusrc 2.023084 0
bornus_black 1.429971 0.243
bornus_asian 3.199009 0.011
bornus_other 1.717719 0.077
bornus_latino 0.6409995 0.049
_cons 0.1716388 0

Please... I need help vreating a interaction effect chart of bornus_black bornus_asian bornus_other bornus_latino.

I tried to look at margins and marginsplot but I do not understand anything. Please help.
 
#14
@Marvin85, please show us the exact logistic command you used and then we can try to help you formulate the margins command you need.

Also, please tell us the levels at which the variables are measured. That is, for each variable, tell us whether it is a continuous, binary, or factor (i.e., multinomial) variable.

Also, for continuous variables please tell us the minimum and maximum values.
 
#15
logistic agelog racerc1 racerc3 racerc4 racerc5 whereliverc2 whereliverc3 whereliverc4 symptomsrc firstdegreerc healthinsurance5rc bornusrc bornus_black bornus_asian bornus_other bornus_latino

Agelog= categorical= have a mammogram before 40 vs 40+
race1-5- categorical race variables
Wherelive= categorical I generate dummies to include them in the regression
Firstdegree= categorical dummy "Have a family history of breast cancer?"
healthinsurance5rc= Do you have health insurance? YEs- no/
BOrnusrc= dummy bornin the US? yes - no
bornus_black Interaction effect (bornusrc*race2)
bornus_asian (bornusrc*race3)
bornus_other ...
bornus_latino ...

I do not have any continous variables. Is this clear? Look at my previous posts to have an idea of what is going on.

Thank you very much for your help!
 
#16
You do not need to convert the race and wherelive variables to binaries, and I would recommend that you not do that.

I would create factor variables (i.e., a multinomial categorical variable) for race and wherelive instead of using a series of dummy variables. (You may already have those variables in that form, but I'll show the code to create or recreate them from the dummy variables.)

Code:
gen race = 1
if racerc2 == 1 replace race = 2
if racerc3 == 1 replace race = 3
if racerc4 == 1 replace race = 4
if racerc5 == 1 replace race = 5
if racerc6 == 1 replace race = 6
if racerc7 == 1 replace race = 7
if racerc8 == 1 replace race = 8

gen wherelive = 1
if whereliverc2 == 1 replace wherelive = 2
if whereliverc3 == 1 replace wherelive = 3
if whereliverc4 == 1 replace wherelive = 4
I would rewrite the command as shown in the following code, using factor variable notation with the "i" prefix. The ## symbol asks Stata to include both the main and interaction effects for the factor variables bornus and race.

Code:
logistic agelog i.wherelive i.symptomsrc i.firstdegreerc i.healthinsurance5rc i.bornusrc##i.race
Please run that code and then give us the output table from logistic. Also, please tell us how many observations you have in the data set.

With that information, we should be able to help you formulate the margins command line and create one or more marginsplots.
 
Last edited:
#18
With such a large sample size, almost every predictor variable is statististically significant. (That does not necessarily mean, however, that all of the effects of the predictors are substantively important or meaningful. That is a matter for your own research judgment based on your knowledge of the domain.)

There are multiple ways you could construct margins and marginsplots from this output, so I'll just show you some examples to try below. Without the data, I could not try these myself.

You can experiment with other combinations in the margins code. You can also look at -help margins- in Stata for more options. If you need black-and-white graphs for publications, you should change "scheme(s1color)" to "scheme(s1mono)" in the code below.

Code:
margins i.bornus, atmeans over(i.racerc)
marginsplot, scheme(s1color)

margins i.whereliverc, atmeans over(i.racerc)
marginsplot, scheme(s1color)

margins i.whereliverc, atmeans over(i.bornusrc i.racerc)
marginsplot, scheme(s1color)

margins i.whereliverc, atmeans over(i.racerc i.healthinsurance5rc)
marginsplot, scheme(s1color)

margins i.healthinsurance5rc, atmeans over(i.whereliverc i.racerc)
marginsplot, scheme(s1color)
You will note that none of my examples explicitly include the bornusrc#racerc interaction term. That is not needed because the margins include the combined main and interaction effects from all of the terms in the model.

By the way, if any of your codes (for example 8 in racerc may be a missing value flag) is a missing value flag, you should set that value to missing before running logistic regression. See -help mvencode- for the commands to indicate to Stata which values are missing value indicators for specific variables. General information about missing values in Stata can be obtained from -help missing-.

I hope this helps.