Odds ratio or other technique


( I have also posted in https://stats.stackexchange.com/questions/535284/odds-ratio-or-other-technique)

Before I start I am not well versed in posting on forums so please be patient if I'm going against convention. Plus I am on a steep learning curve with statistics.

I have been presented with some data which originates from a survey filled in by workers who work in the chemical factories (made up data is attached in an excel spreadsheet to illustrate the question - in reality the sample size is larger).


I am being asked to statistically analyse whether there is a significant connection between workers who were diagnosed with cancer whilst working in the chemical environment and whether the workplace has designated clean and dirty areas.

1) Am i correct in saying that i can conduct an odds ratio calculation with confidence intervals and p value as below (I would have to class 'Y - not well adhered to' as 'N' in this case. Is this recommended):

Capture odds ratio table.JPG

2) Would odds ratio be the recommended approach or are there more robust methods?

3) It may be that I have to exclude some participants who were diagnosed with cancer whilst working with chemicals, but have declined to answer. Does this need taking into account in some way?

Thank you for your time.



Last edited:


Less is more. Stay pure. Stay poor.
Yes your approach would be fine. If you think Clean pPace may be protective, then you may switch the ordering of the rows to get an odds ratio on the positive side.

Not sure of the origins of your data, but you could have survivor bias.

As for persons not collected, if there is reason to suspect a systematic bias (selective loss based on a certain exposure and outcome group) - there are quantitative bias analyses as well as probabilistic quantitative bias analyses that you can conduct. To do these you would need to have a validation sample or assumptions of the proportion of subjects that may be in these scenarios. An additional approach may be to just play around with the numbers to show how many missed persons it would take to nullify the results - given you you have non-null results. There is an approach called Evalues by Tyler Vander Weele that can be used to quantify the impact of selection bias on results similar to this.
Thank you hlsmith - that is really helpful and reassuring.

The origins of the data are a survey of all workers in that industry - so it is their choice whether to complete the survey. Does this have any bearing on the analysis?

I have a couple more questions if you don't mind:

1) Is there any reason why I shouldn't/couldn't use the Risk Ratio instead?

2) We have another field (smoker/non-smoker) not included in my example. If we were to target these workers with some further survey questions in the future, would we then need to take this into account when analysing the data returned? because in effect we are taking a sample from within a sample?

3) I am planning on using rstudio to calculate the odds ratio as mentioned in the original post - do you happen to know which is the best package to use fort this?




Less is more. Stay pure. Stay poor.
I would be transparent when reporting results - on the threat of survival or selection bias.

The ideal estimate to report is the risk difference. However the default in survey data is to report odds ratios since the outcome and exposure were collected at the same time (cross-sectionally), so you may not know the ordering of the exposure and outcome necessarily.

You can incorporate smoking status via multiple logistic regression or by using your contingency table approach for smoke and then non-smokers (stratification: run two tables).

In R, glm with link=log and dist=binomial could be sufficient for multiple regression (controlling for both variables at the same time).