# Which correlation coefficient is best suited to dichotomous variables (and why do my results feel intutiteivly wrong)

#### ShoM

I have collected data that shows the days on which I fasted and if those days resulted in weight-loss.

Null hypothesis:
There is no statistically significant relationship between fasting and weight loss.​

Data Analysis:
1 The chart (see way below) is a visual representation of my data. The circled points indicate a prolonged period of fasting and shows a downward trend in weight (i.e. weight loss occurred) during these periods.​
2) The contingency/crosstab table (see below) shows that:​
• a) on the days that I fasted weight-loss occurred 93% of the time (100/107).
• b) on the days that I did not fast weight-loss occurred 38% of the time (348/924)

Conclusion (laymen's):
Intuitively, this suggests that there is a high correlation between the days on which I fasted and weight-loss occurring, and though weight-loss can occur on non-fasting days, it happens a lot less than on fasting days (93% vs. 38%).​

Conclusion (statistical):
I calculated the phi-coefficient (aka Yule phi or Mean Square Contingency Coefficient) to be -0.343 (see contingency table below for numbers used).
I chose phi-coefficient for my correlation method as my variables are dichotomous (actually dichotomized) and various sources suggest this method is suitable (please note that I am not a statistician and only learnt what the phi-coefficient was yesterday, before this the entirety of experience with correlation was using the CORREL function once in Excel a long time ago).​
This source suggests that a phi-coefficient of 0.3-0.39 represents a "moderate" relationship between the variables.​
This source (see table 2) suggest that a phi-coefficient of >0.25 represents a "very strong" relationship between the variables.​

My Questions:
1) Is my methodology, reasoning and calculation sound? If not how would approach testing my null hypothesis?​
2) My phi-coefficient of 0.343 seems low when taking into account my laymen's conclusion (weight-loss on fasting vs. non-fasting days is 93% vs. 38%) - do you agree? why is it so far from a maximum possible value of 1?​
Thanks and please let me know if I can clarify anything further!

Contingency table & data visualization:

#### Karabiner

The value of Phi not only depends on the degree of association between
2 caracteristics, but also on the degree of symmetry of the marginal
distributions. If marginal distributions are asymmetric (as is here the
case for fast), then Phi is diminished.
Personally, I would just use the 93% vs. 38% here as description of the
effect in the sample. You could also consider odds ratio.

#### ShoM

Thank you @Karabiner - I really appreciate your insightful response! it took me on a bit of a journey! anyway...

I looked into your suggestions and acted on them to find the following:

1) The odds-ratio (OR) for my variables is 23.6
2) The confidence-interval (CI) for my variables at 95% is: 10.86 - 51.47
I have follow up questions for anyone who is able to share their thoughts and expertise:

1) If I had dichotomized data for "exercising vs. weight-loss" (just as in the example above I have data for "fasting vs weight-loss") and lets say that I computed an odds-ratio (OR) for that data to be 30 (with a similar CI to the "fasting" variable), is it acceptable to compare the OR for "exercising" to the OR for "fasting" and claim that exercise is more effective than fasting (30 vs. 23.6) for weight-loss?​
• [a] i.e. is it fair to compare OR values for different predictor variables (fasting vs. exercising) given the same outcome variable (weight-loss)?
• if not, what is a good statistical method to use to conduct this type of analysis?
[*][c] what statistical method can I use to understand the effect of fasting vs. exercise for a given day (i.e. if I exercise and fast on the same day how much weight-loss (as a percentage) do I attribute to fasting and how much to exercise?

2) Is it accurate to draw the following conclusions from the OR and CI calculated (for the given dataset)? If not why not?​
• [a] when fasting I am 23.6 times more likely to lose weight than when I don’t fast
• if I lost weight then I was 23 times more likely to have fasted than not fasted
[*][c] the odds are 23.6 times higher that I will lose weight when I fast compared to when I don’t fast
[*][d] with 95% confidence I can say that fasting is between 10.86 and 51.47 times more likely to result in weight-loss than when not fasting

Thanks again and feel free to let me know if I can clarify anything further.

#### hlsmith

Can you explain these data? I apologize that I just skimmed above, but are they just for one person, groups or each number represents an individual? Was fasting randomized? How do you define weight loss?

#### ShoM

##### New Member
Can you explain these data? I apologize that I just skimmed above, but are they just for one person, groups or each number represents an individual? Was fasting randomized? How do you define weight loss?
Answers to your questions (please note that I'm not a stats student or professional so apologies if I did not understand the questions as asked):

1) Are they just for one person?
Yes - and I am that person. Data collected over approx. 3 years.​
2) Was fasting randomized?
I'm afraid I don't really know what this means but I have collected data daily for 3 years (date, weight, fast). I lagged the weight data by 1 day and dichotomized it to 1 if weight decreased from the previous day's weight or 0 if it did not. Fast is simply 1 if I fasted and 0 if I did not. This is date ordered time-series data.​
3) How do you define weight loss
weight-loss = 1 if weight decreased from the previous day, or 0 if it did not​
Please do let me know if I have not been able to answer your questions satisfactorily and I'll be happy to try again.

#### noetsi

Weight loss would occur in the future not at the present point I would think. So you arguably have a time series and I am not sure even logistic regression would work (at the least you would want to predict weight loss at a given lag not now and p values are doubtful in that case).

But that would take you into very complex places.

#### hlsmith

I guess weight loss variable seems trivial, quantifying amount may be important since I just not drinking as much water may results in weightloss due to dehydration but not actual lean mass or adiposity. In addition, if a person has a non-daily bowel movement schedule could be misalign to being correlated with fasting, etc. This is not an easy and directed project. Was there a rationale behind when you fasted?

#### ShoM

Weight loss would occur in the future not at the present point I would think. So you arguably have a time series and I am not sure even logistic regression would work (at the least you would want to predict weight loss at a given lag not now and p values are doubtful in that case).

But that would take you into very complex places.
Thank for your input. Here is how the data was collected:
1) I fast on a given day (T)
2) I take my weight on T+1 (always after I wake up and have a bowl movement [yes I'm regular] but before I eat anything)

I'm not too sure what you are trying to convey in your post?

#### ShoM

I guess weight loss variable seems trivial, quantifying amount may be important since I just not drinking as much water may results in weightloss due to dehydration but not actual lean mass or adiposity. In addition, if a person has a non-daily bowel movement schedule could be misalign to being correlated with fasting, etc. This is not an easy and directed project. Was there a rationale behind when you fasted?
Hi, thanks for sharing your thoughts. Whilst the concerns you expressed are fair, they are way too nuanced for now. I'm a fairly consistent person and if I didnt drink much water for a few days and whilst it *may* impact weight-loss I have 3 years of data here so it would not have a significant impact on the final result. Of course there probably millions of factors that go into weight-loss and I'm no trying to account for them all.

To answer your question, I fasted for health reasons (it's supposed to be good for you) but of course it also leads to weight-loss (or atleast it did for me and in-fact it did so 100 out of 107 times!) - anyway this is besides the point of the questions I am asking in my posts above - I am simply using this data that I collected to learn about stats/correlation etc. So if anyone can help me with the questions in my second post above that would be much appreciated.

#### ShoM

But how did you choose which days you would fast on
No rational to the days I chose but per my first post (see chart) you'l notice that I tended to fast during the same days for the last 3 years. Why is it important to know this knowing what I am asking? just curious.

#### hlsmith

Your top chart doesnt make sense to me, just plot your weight everyday for the 3 years as connected dots and fill in the dots for the days you fasted with a different color and post it. Just whether your next day weight was different seems like a poor metric. Say i eat and do the same thing everyday and my caloric expenditure is constant. The plot would be stocastic like coin flips given the intagibles i could not document. So 50% of the time i loss weight, just being me. But did i really lose weight? Well now say i dont eat for 24 hours, well yeah i am gonna lose weight cause i am burning off available glucose and urinating the water i retained because of the glucose, i may metabolize lean and fat tissue too and not rebuild it due to no macro nutrients, and i wont have as much waste in my gi tract since it will be clearing itself, so yeah the day after i fast i will always way less more than 50% of the time. What are you trying to prove then? Well if you are sustaining a deficiency you will lose scale weight.

Weight loss is just creating a caloric deficient. Fewer calories going in and/or more calories expended.i am not aware any non-nominal health benefits from fasting that cannot be achieved via structured diet. But whatever works for you! I do know intermitent fasting can be used in cancer treatments to try and slow its growth, and organ reserve theories have never been conclusively proven.

#### hlsmith

I would imagine people either fast on a particular that it is easy or post binge. Post binge may not show the true effect of fasting. I am sure there are other issues that may come into play with non-random fasting that could be conjured up.

#### ShoM

I would imagine people either fast on a particular that it is easy or post binge. Post binge may not show the true effect of fasting. I am sure there are other issues that may come into play with non-random fasting that could be conjured up.
None of this answers my very specific questions that I asked. You can imagine what you like, I'm sure people do fast post-binge but i'm not going to caveat my post to cover every possible thought you might have that may or may not have an impact on the data. Please assume that the data is representative of a reasonable and rational human being and that it's not been put together to catch you out. With that said, I'd still like help with the questions in my second post.

Thanks.

#### hlsmith

@noetsi - mentioned that ORs may not be useful here. Ideally, you would use risk differences if anything. But all (odds ratios, risk ratios, and risk differences) are moot, since as I probed, these data are all from you and are correlated to each other. Most models assume independence between observations - i.e. independent Bernoulli trials. So to my knowledge you can't use that (ORs) as a metric. This is a time series project and includes time series questions - which is not my strong area.

#### hlsmith

EDITED:

As mentioned, I think the first step should be to plot these data as a time series as mentioned in post #14.

P.S., I am not questioning that this data are collected by a rational person, just pointing out why randomization of underlying assignment of which days are fasting days may influence data. If there is a reason why a person fasts on a certain day that reason may also be associated with the outcome and create a confounding variable that would need to be addressed to discern the true relationship with fasting.

