Help interpreting results of categorical variable analysis

#1
Hello, I would like to confirm my analysis for two categorical variables, X and Y. Where both X and Y are binary (0,1). Given a null hypothesis that X and Y are independent I get the following results from my data:

Observed results:
  • X0, Y0 = 1,300,243
  • X0, Y1 = 1,140,705
  • X1, Y0 = 188
  • X1, Y1 = 11

Calculated Expected results:
  • X0, Y0 = 1,300,316
  • X0, Y1 = 1,140,623
  • X1, Y0 = 106
  • X1, Y1 = 93
The Chi-Square test for independence = 135.72 (p-value < .001). Given a significance level of .05 I can reject the null hypothesis and conclude that X and Y have an association.

Further the odds ratio calculates to .0667 (95% confidence interval of .0363 to .1225). This suggests a negative correlation between X and Y such that If a subject experiences X (X=1) they are nearly 15 times more likely (1/.0667) to not be Y (Y = 1).

Am I interpreting and stating the results correctly?
 

hlsmith

Not a robit
#2
What is this exactly for? Reporting both and doing a Ho test may be redundant and an overkill. If this is for a real life problem and data aren't too burdened by these being retrospective cross-section sampling, I would solely report the risk difference, which is more intuitive to readers:

Risk for Y=0 in the X = 0 group is 41% (95% CI: 0.38, .44) lower than in the X=1 group.

This also puts into question whether you are answering the question you set out to address. Meaning most times people are trying to examine the X=1 and Y=1 outcome not the X=0 and Y=0.
 
#3
I am trying to understand if a subject participates in the X program (X=1) what is the impact on Y. The hope is X will lead to a increase in Y (Y=1).
 

hlsmith

Not a robit
#4
Well given your data, subjects participating in X (X=1) have a 41% (95% CI: 38.03, 44.38) lower rate of participating in Y (Y=1) than subjects participating in X (X=0). You could always change the confidence intervals if you wanted the precision confines to better rule out false positive, which doesn't seem to be an issue here. You could also run these analyses using a Bayesian model so you could state the probability of result values.

Any questions?

The take home message is, if you tell some one X=1 subjects have a 0.07 times lower odds of Y=1 compared to subjects with X=0, that will get messed up in their head almost every time. Estimates on the additive scale are more intuitive.
 
#5
Well given your data, subjects participating in X (X=1) have a 41% (95% CI: 38.03, 44.38) lower rate of participating in Y (Y=1) than subjects participating in X (X=0). You could always change the confidence intervals if you wanted the precision confines to better rule out false positive, which doesn't seem to be an issue here. You could also run these analyses using a Bayesian model so you could state the probability of result values.

Any questions?

The take home message is, if you tell some one X=1 subjects have a 0.07 times lower odds of Y=1 compared to subjects with X=0, that will get messed up in their head almost every time. Estimates on the additive scale are more intuitive.
Thanks for the help! What measurement or calculation are you using to determine participants in X have a 41% lower rate of Y=1 than subjects not in X? I would like to understand the measurement a little better.

At this point it seems we are getting unexpected results no matter how we measure it. Participating in X correlates negatively to having Y=1. The expected results were there would be no significant correlation or a positive correlation. So there must be other significant factors that we are not yet Considering.
 

hlsmith

Not a robit
#7
Not sure if you code at all, but the equations are in this R code snippet. The bottom part is a function I came across last night that plots all of the CIs. So the 0.01 is the 99% CI's and 0.05 is the 95% CI's. You can see the estimate is clearly lower than 0.0 or the null estimate of no difference between the groups. The 0.3 value (reference line) is in case you assumed a 0.3 positive effect, since you hypothesized there could be an effect.


Code:
#======================================================================================
# Difference between two independent proportions: Agresti & Caffo
#======================================================================================
# First proportion
x1 <- 11
n1 <- 199
# Second proportion
x2 <- 1140705
n2 <- 2441147
# Apply the correction
p1hat <- (x1 + 1)/(n1 + 2)
p2hat <- (x2 + 1)/(n2 + 2)
# The original estimator
est0 <- (x1/n1) - (x2/n2)
est0
# The unmodified estimator and its standard error using the correction
est <- p1hat - p2hat
se <- sqrt(((p1hat*(1 - p1hat))/(n1 + 2)) + ((p2hat*(1 - p2hat))/(n2 + 2)))
UCL = est + (qnorm(0.975)*se)
LCL = est - (qnorm(0.975)*se)
est;UCL;LCL
install.packages("pvaluefunctions")
library(pvaluefunctions)
res <- conf_dist(
  estimate = c(est)
  , stderr = c(se)
  , type = "general_z"
  , plot_type = "p_val"
  , n_values = 1e4L
  , log_yaxis = FALSE
  , cut_logyaxis = 0.05
  , conf_level = c(0.95, 0.99)
  , null_values = c(0, 0.3)
  , trans = "identity"
  , alternative = "two_sided"
  , xlab = "Difference of proportions"
  , together = FALSE
  , plot_p_limit = 1 - 0.9999
  , plot_counternull = FALSE
  , title = "P-value function for the difference of two independent proportions"
  , ylab = NULL
  , ylab_sec = NULL
  , inverted = FALSE
  , x_scale = "default"
)
1574779055899.png