Chi-square test for multiple rows contingency table

#1
Hi there!
New user here, I hope I'm posting in the right section!

I was having some issues trying to figure out if/how I can apply a chi-square test to my problem.

I know how to deal with and interpret a case where I have only 2 mutually excluding groups, like male/female, and I want to determine whether a certain disease is more common in one of the groups. The null hypotheses is that there's no difference in the distribution of the disease between males and females.
I can organize the data this way:

Code:
            |  disease       no_disease
-----------------------------------
male    |    25               6
female |     8               15
and, after using a chi-square test, I can reject the null hypothesis, since I get a chi2 (11.686) higher than the critical value (3.841) for a significance of 0.05.

However, I'm trying to figure out how to deal with a situation where I have patients with multiple symptoms, so something like:

Code:
                      |  disease  no_disease  |  TOT
-----------------------------------------------------------
symptom 01  |    20              20           |  40

symptom 02  |    10               5            |  15

symptom 03  |    10             15            |  25

symptom 04  |    10               5            |  15
-----------------------------------------------------------
    TOT               50             45             |  95
the symptoms are mutually exclusive, and I'd like to figure out which ones are better indicators of the presence of the disease, but I'm not sure about how to approach the problem.

I don't think performing a chi2 test on the whole table as it is would make sense, right?
I was thinking I could treat them separately. For each symptom, I could create a single contingency table, considering the patients with the current symptom as a first category, and grouping together all the other ones not presenting the current symptom in a second category...so, something like:

Code:
                    |  disease   no_disease  |  TOT
----------------------------------------------------------
symptom 01 |      20            20            |  40
no sympt 01 |      30            25            |  55
----------------------------------------------------------
    TOT                50            45             |  95

                    |  disease   no_disease  |  TOT
----------------------------------------------------------
symptom 02 |      10             5             |  40
no sympt 02 |      40            40            |  55
----------------------------------------------------------
    TOT                50            45             |  95

                    |  disease   no_disease  |  TOT
----------------------------------------------------------
symptom 03 |      10            15             |  40
no sympt 03 |      40            30             |  55
----------------------------------------------------------
    TOT                50            45             |  95

                    |  disease   no_disease  |  TOT
----------------------------------------------------------
symptom 04 |      10             5             |  40
no sympt 04 |      40            40            |  55
----------------------------------------------------------
    TOT                50            45             |  95
and then perform chi2 tests on each table, to evaluate how much each symptom is an indicator of the presence of the disease.
But again, I don't know if it would be a correct approach.
Any suggestions about it, please?

EDIT: sorry for the formatting, I'm trying to figure out how to create these tables in a more readable way!
 
Last edited:

gianmarco

TS Contributor
#2
I believe that chi-2 test can be applied. At the end of the day, what you are trying to test is if there is a dependence between rows and columns, i.e. symptoms and disease. In case the test turns out to indicate a significan association, you may want to use standardized residuals to pinpoint which cell(s) is/are contributing to the departure from independence.

Best
Gm


p.s.
are those real frequencies, or just made up?
 
#3
I believe that chi-2 test can be applied. At the end of the day, what you are trying to test is if there is a dependence between rows and columns, i.e. symptoms and disease. In case the test turns out to indicate a significan association, you may want to use standardized residuals to pinpoint which cell(s) is/are contributing to the departure from independence.

Best
Gm


p.s.
are those real frequencies, or just made up?
Ok, thanks, that's exactly what I didn't get. So, basically the idea in a situation like this is that if the null hypothesis is rejected, you can conclude there's correlation, but you can't pinpoint exactly where the correlation is, i.e. between which rows and columns?

The frequencies were made up, the real data is:

data= [ [21, 20],
[6, 1],
[15, 19],
[12, 6],
[7, 3],
[5, 3],
[13, 12],
[13, 12]
]

and the chi2 test suggests there's no dependence.

Grazie mille per l'aiuto! :)