Lab 11: Association in 2x2 tables and odds ratios
The objective of this lab is to learn how to perform
We’ll use two datasets: the case-control study of smoking and lung cancer from lecture, and asthma prevalence among a subset of NHANES respondents.
Since you should be fairly comfortable with retrieving variables from datasets by now, we’ll construct the contingency tables needed for analysis without the step of extracting and storing values for each variable separately.
# contingency table
table(smoking$group, smoking$smoking)
Smokers NonSmokers
Cancer 83 3
Control 72 14
The asthma
dataset contains observations of asthma occurrence (Y/N) and sex from a random sample of U.S. 1629 adults.
Construct a contingency table of the asthma data following the example above.
# contingency table
table(asthma$sex, asthma$asthma)
asthma no asthma
male 30 769
female 49 781
Chi-square tests
The test proceeds by computing the expected counts
This measures the deviation of the actual counts from the expected counts, and if the measure is sufficiently large, then the test identifies evidence of an association between the row and column variables in the table.
This is all very easy to implement in R:
# chi square test
table(smoking$group, smoking$smoking) |>
Pearson's Chi-squared test with Yates' continuity correction
data: table(smoking$group, smoking$smoking)
X-squared = 6.5275, df = 1, p-value = 0.01062
The data provide evidence of an association between smoking and lung cancer (
= 6.53 on 1 df, p = 0.0106).
Recall that by default, R adjusts the test statistic slightly (the “continuity correction”).
Test for an association between asthma and sex at the 5% level. Interpret the result in context.
# chi square test
table(asthma$sex, asthma$asthma) |>
Pearson's Chi-squared test with Yates' continuity correction
data: table(asthma$sex, asthma$asthma)
X-squared = 3.6217, df = 1, p-value = 0.05703
The data do not provide evidence of an association between asthma and sex.
One nice thing about the implementation is that the test result doesn’t depend on the orientation of the table. For example:
# chi square test
table(smoking$group, smoking$smoking) |>
Pearson's Chi-squared test with Yates' continuity correction
data: table(smoking$group, smoking$smoking)
X-squared = 6.5275, df = 1, p-value = 0.01062
# flip orientation of table
table(smoking$smoking, smoking$group) |>
Pearson's Chi-squared test with Yates' continuity correction
data: table(smoking$smoking, smoking$group)
X-squared = 6.5275, df = 1, p-value = 0.01062
If we wish to recover expected counts or residuals – cell-wise normalized differences between actual and expected counts – we can store the result of the test and extract those quantities:
# store test result
<- table(smoking$group, smoking$smoking) |>
test_rslt chisq.test()
# expected counts
$expected test_rslt
Smokers NonSmokers
Cancer 77.5 8.5
Control 77.5 8.5
# residuals
$residuals test_rslt
Smokers NonSmokers
Cancer 0.624758 -1.886484
Control -0.624758 1.886484
The residuals identify which cells deviate the most from the expected counts.
Compute the residuals for the test of association between asthma and sex and identify which cells deviate most from the expected counts under independence.
# store test result
<- table(asthma$sex, asthma$asthma) |>
test_rslt chisq.test()
# residuals
$residuals test_rslt
asthma no asthma
male -1.4053932 0.3172821
female 1.3788982 -0.3113006
Asthma prevalence for men is lower than expected under independence and prevalence for women is higher.
Inference for odds ratios
Odds are the relative likelihood of an event; for instance, if the odds of winning a bet are 3, that means that you’re three times as likely to win as to lose, i.e., in terms of probabilities,
An odds ratio is a multiplicative comparison of odds under two circumstances. For example, if the odds of developing cancer among smokers are 2 (twice as likely to get cancer as not), and the odds of developing cancer among nonsmokers are 0.5 (half as likely to get cancer as not), then the odds ratio is
Odds ratios can be estimated directly from a contingency table. For example, the odds that a person is a smoker are about 5.4 times higher among cancer patients than among healthy individuals:
# contingency table
table(smoking$group, smoking$smoking)
Smokers NonSmokers
Cancer 83 3
Control 72 14
# odds ratio (cancer/control) of smoking
83/3)/(72/14) (
[1] 5.37963
Somehwat miraculously, the odds ratio computed along one orientation is the same as that computed along the opposite orientation. That is, if we had a random sample rather than a case-control study, the odds ratio of cancer among smokers compared with nonsmokers is:
# hypothetically, odds ratio (smokers/nonsmokers) of cancer
83/72)/(3/14) (
[1] 5.37963
This is exactly the same!
Using the asthma data…
- compute the odds ratio of asthma among women compared with men
- compute the odds ratio of being a woman among asthmatics compared with non-asthmatics
You should find that they are the same!
# contingency table
table(asthma$sex, asthma$asthma)
asthma no asthma
male 30 769
female 49 781
# odds ratio (women/men) of asthma
49/781)/(30/769) (
[1] 1.608237
# odds ratio (asthma/no asthma) of being a woman
49/30)/(781/769) (
[1] 1.608237
Notice, however, that if you invert the comparison, or compute the odds of the complementary event, you will get different results. For the smoking data, here is an exhaustive list of all of the odds ratios we could compute:
# contingency table
table(smoking$group, smoking$smoking)
Smokers NonSmokers
Cancer 83 3
Control 72 14
# odds of cancer (smokers/nonsmokers)
83/72)/(3/14) (
[1] 5.37963
# odds of cancer (nonsmokers/smokers)
3/14)/(83/72) (
[1] 0.1858864
# odds of not getting cancer (nonsmokers/smokers)
14/3)/(72/83) (
[1] 5.37963
# odds of not getting cancer (smokers/nonsmokers)
72/83)/(14/3) (
[1] 0.1858864
# odds of smoking (cancer/control)
83/3)/(72/14) (
[1] 5.37963
# odds of smoking (control/cancer)
72/14)/(83/3) (
[1] 0.1858864
# odds of not smoking (control/cancer)
14/72)/(3/83) (
[1] 5.37963
# odds of not smoking (cancer/control)
3/83)/(14/72) (
[1] 0.1858864
You will notice that there are two odds ratios that are possible to compute, each of which may be interpreted in one of four ways, and which are reciprocals of one another.
The epitools
package has a function oddsratio(...)
which takes the two variables as input and returns estimated odds, a confidence interval, and a test of association:
# inference for odds ratio
oddsratio(smoking$smoking, smoking$group, method = 'wald')
Predictor Cancer Control Total
Smokers 83 72 155
NonSmokers 3 14 17
Total 86 86 172
odds ratio with 95% C.I.
Predictor estimate lower upper
Smokers 1.00000 NA NA
NonSmokers 5.37963 1.486376 19.47045
Predictor midp.exact fisher.exact chi.square
Smokers NA NA NA
NonSmokers 0.005116319 0.008822805 0.004948149
[1] "Unconditional MLE & normal approximation (Wald) CI"
By default it will produce the odds of the second outcome in the second predictor (group) compared with the first predictor (group). So the above estimates the odds of not having cancer among nonsmokers compared with smokers. The rev = ...
argument allows you to reverse either rows
, columns
, or both
To orient the table correctly, we need to change the order of both rows and columns.
# inference for odds ratio
oddsratio(smoking$smoking, smoking$group, method = 'wald', rev = 'both')
Predictor Control Cancer Total
NonSmokers 14 3 17
Smokers 72 83 155
Total 86 86 172
odds ratio with 95% C.I.
Predictor estimate lower upper
NonSmokers 1.00000 NA NA
Smokers 5.37963 1.486376 19.47045
Predictor midp.exact fisher.exact chi.square
NonSmokers NA NA NA
Smokers 0.005116319 0.008822805 0.004948149
[1] "Unconditional MLE & normal approximation (Wald) CI"
The result is the same, but that’s just because we got lucky and the odds ratio we wanted happened to be equivalent to the one we got by default; however there is no such guarantee in general.
The result is interpreted as follows:
The odds of developing lung cancer are estimated to be between 1.49 and 19.47 times higher among smokers as compared with nonsmokers.
Estimate the odds of asthma among women compared with men.
# inference for odds ratio
oddsratio(asthma$sex, asthma$asthma, method = 'wald', rev = 'columns')
Predictor no asthma asthma Total
male 769 30 799
female 781 49 830
Total 1550 79 1629
odds ratio with 95% C.I.
Predictor estimate lower upper
male 1.000000 NA NA
female 1.608237 1.010044 2.560708
Predictor midp.exact fisher.exact chi.square
male NA NA NA
female 0.04412095 0.04961711 0.04354632
[1] "Unconditional MLE & normal approximation (Wald) CI"
The odds of asthma are estimated to be between 1.01 and 2.56 times higher among women as compared with men.
Lastly, the confidence level can be adjusted using the conf.level = ...
# adjust confidence level
oddsratio(smoking$smoking, smoking$group,
method = 'wald', rev = 'both', conf.level = 0.9)
Predictor Control Cancer Total
NonSmokers 14 3 17
Smokers 72 83 155
Total 86 86 172
odds ratio with 90% C.I.
Predictor estimate lower upper
NonSmokers 1.00000 NA NA
Smokers 5.37963 1.82785 15.83303
Predictor midp.exact fisher.exact chi.square
NonSmokers NA NA NA
Smokers 0.005116319 0.008822805 0.004948149
[1] "Unconditional MLE & normal approximation (Wald) CI"
Practice problem
The leap
dataset contains observations from the LEAP study, which aimed to ascertain whether there was a causal effect of peanut consumption/avoidance on development of peanut allergies among infants with prior risk factors.
group ofc.result
1 Peanut Consumption PASS OFC
2 Peanut Consumption PASS OFC
3 Peanut Avoidance PASS OFC
4 Peanut Consumption PASS OFC
5 Peanut Avoidance PASS OFC
6 Peanut Consumption PASS OFC
- Estimate the proportion of each treatment group that developed allergies.
- Test for an association between peanut consumption and allergy development.
- Estimate the odds of developing allergies among the consumption group compared with the avoidance group.
# proportions in each group that developed allergies
table(leap$group, leap$ofc.result) |>
prop.table(margin = 1)
Peanut Avoidance 0.13688213 0.86311787
Peanut Consumption 0.01872659 0.98127341
# test for association between consumption/avoidance and allergy development
table(leap$group, leap$ofc.result) |>
Pearson's Chi-squared test with Yates' continuity correction
data: table(leap$group, leap$ofc.result)
X-squared = 24.286, df = 1, p-value = 8.302e-07
# estimate odds ratio of allergies among consumption group compared with avoidance group
oddsratio(leap$group, leap$ofc.result, method = 'wald', rev = 'columns')
Predictor PASS OFC FAIL OFC Total
Peanut Avoidance 227 36 263
Peanut Consumption 262 5 267
Total 489 41 530
odds ratio with 95% C.I.
Predictor estimate lower upper
Peanut Avoidance 1.000000 NA NA
Peanut Consumption 0.120335 0.04643869 0.3118202
Predictor midp.exact fisher.exact chi.square
Peanut Avoidance NA NA NA
Peanut Consumption 1.294368e-07 1.389354e-07 3.567075e-07
[1] "Unconditional MLE & normal approximation (Wald) CI"