Lab 11: Association in 2x2 tables and odds ratios

Course activity

STAT218

The objective of this lab is to learn how to perform $χ^{2}$ tests of association for two-by-two tables and inference for odds ratios in R.

We’ll use two datasets: the case-control study of smoking and lung cancer from lecture, and asthma prevalence among a subset of NHANES respondents.

library(epitools)
load('data/smoking.RData')
load('data/asthma.RData')

Since you should be fairly comfortable with retrieving variables from datasets by now, we’ll construct the contingency tables needed for analysis without the step of extracting and storing values for each variable separately.

# contingency table
table(smoking$group, smoking$smoking)

         
          Smokers NonSmokers
  Cancer       83          3
  Control      72         14

The asthma dataset contains observations of asthma occurrence (Y/N) and sex from a random sample of U.S. 1629 adults.

Your turn

Construct a contingency table of the asthma data following the example above.

Solution

# contingency table
table(asthma$sex, asthma$asthma)

        
         asthma no asthma
  male       30       769
  female     49       781

Chi-square tests

The $χ^{2}$ test of association tests the hypothesis and alternative:

${\begin{cases} H_{0} : smoking is independent of cancer \\ H_{A} : smoking is not independent of cancer \end{cases}$

The test proceeds by computing the expected counts ${\hat{n}}_{i j}$ in each cell of the table under $H_{0}$ after fixing the marginal totals $n_{i}, n_{j}$ . The statistic is:

$χ^{2} = \sum_{i, j} \frac{(n_{i j} - {\hat{n}}_{i j})^{2}}{{\hat{n}}_{i j}}$

This measures the deviation of the actual counts from the expected counts, and if the measure is sufficiently large, then the test identifies evidence of an association between the row and column variables in the table.

This is all very easy to implement in R:

# chi square test
table(smoking$group, smoking$smoking) |>
  chisq.test()


    Pearson's Chi-squared test with Yates' continuity correction

data:  table(smoking$group, smoking$smoking)
X-squared = 6.5275, df = 1, p-value = 0.01062

The data provide evidence of an association between smoking and lung cancer ( $χ^{2}$ = 6.53 on 1 df, p = 0.0106).

Recall that by default, R adjusts the test statistic slightly (the “continuity correction”).

Your turn

Test for an association between asthma and sex at the 5% level. Interpret the result in context.

Solution

# chi square test
table(asthma$sex, asthma$asthma) |>
  chisq.test()


    Pearson's Chi-squared test with Yates' continuity correction

data:  table(asthma$sex, asthma$asthma)
X-squared = 3.6217, df = 1, p-value = 0.05703

The data do not provide evidence of an association between asthma and sex.

One nice thing about the implementation is that the test result doesn’t depend on the orientation of the table. For example:

# chi square test
table(smoking$group, smoking$smoking) |>
  chisq.test()


    Pearson's Chi-squared test with Yates' continuity correction

data:  table(smoking$group, smoking$smoking)
X-squared = 6.5275, df = 1, p-value = 0.01062

# flip orientation of table
table(smoking$smoking, smoking$group) |>
  chisq.test()


    Pearson's Chi-squared test with Yates' continuity correction

data:  table(smoking$smoking, smoking$group)
X-squared = 6.5275, df = 1, p-value = 0.01062

If we wish to recover expected counts or residuals – cell-wise normalized differences between actual and expected counts – we can store the result of the test and extract those quantities:

# store test result
test_rslt <- table(smoking$group, smoking$smoking) |>
  chisq.test()

# expected counts
test_rslt$expected

         
          Smokers NonSmokers
  Cancer     77.5        8.5
  Control    77.5        8.5

# residuals
test_rslt$residuals

         
            Smokers NonSmokers
  Cancer   0.624758  -1.886484
  Control -0.624758   1.886484

The residuals identify which cells deviate the most from the expected counts.

Your turn

Compute the residuals for the test of association between asthma and sex and identify which cells deviate most from the expected counts under independence.

Solution

# store test result
test_rslt <- table(asthma$sex, asthma$asthma) |>
  chisq.test()

# residuals
test_rslt$residuals

        
             asthma  no asthma
  male   -1.4053932  0.3172821
  female  1.3788982 -0.3113006

Asthma prevalence for men is lower than expected under independence and prevalence for women is higher.

Inference for odds ratios

Odds are the relative likelihood of an event; for instance, if the odds of winning a bet are 3, that means that you’re three times as likely to win as to lose, i.e., in terms of probabilities, $\frac{P r (win)}{P r (lose)} = 3$ .

An odds ratio is a multiplicative comparison of odds under two circumstances. For example, if the odds of developing cancer among smokers are 2 (twice as likely to get cancer as not), and the odds of developing cancer among nonsmokers are 0.5 (half as likely to get cancer as not), then the odds ratio is $\frac{2}{0.5} = 4$ . This would mean that the odds of developing cancer are four times higher among smokers compared with nonsmokers.

Odds ratios can be estimated directly from a contingency table. For example, the odds that a person is a smoker are about 5.4 times higher among cancer patients than among healthy individuals:

# contingency table
table(smoking$group, smoking$smoking)

         
          Smokers NonSmokers
  Cancer       83          3
  Control      72         14

# odds ratio (cancer/control) of smoking
(83/3)/(72/14)

[1] 5.37963

Somehwat miraculously, the odds ratio computed along one orientation is the same as that computed along the opposite orientation. That is, if we had a random sample rather than a case-control study, the odds ratio of cancer among smokers compared with nonsmokers is:

# hypothetically, odds ratio (smokers/nonsmokers) of cancer
(83/72)/(3/14)

[1] 5.37963

This is exactly the same!

Your turn

Using the asthma data…

compute the odds ratio of asthma among women compared with men
compute the odds ratio of being a woman among asthmatics compared with non-asthmatics

You should find that they are the same!

Solution

# contingency table
table(asthma$sex, asthma$asthma)

        
         asthma no asthma
  male       30       769
  female     49       781

# odds ratio (women/men) of asthma
(49/781)/(30/769)

[1] 1.608237

# odds ratio (asthma/no asthma) of being a woman
(49/30)/(781/769)

[1] 1.608237

Notice, however, that if you invert the comparison, or compute the odds of the complementary event, you will get different results. For the smoking data, here is an exhaustive list of all of the odds ratios we could compute:

# contingency table
table(smoking$group, smoking$smoking)

         
          Smokers NonSmokers
  Cancer       83          3
  Control      72         14

# odds of cancer (smokers/nonsmokers)
(83/72)/(3/14)

[1] 5.37963

# odds of cancer (nonsmokers/smokers)
(3/14)/(83/72)

[1] 0.1858864

# odds of not getting cancer (nonsmokers/smokers)
(14/3)/(72/83)

[1] 5.37963

# odds of not getting cancer (smokers/nonsmokers)
(72/83)/(14/3)

[1] 0.1858864

# odds of smoking (cancer/control)
(83/3)/(72/14)

[1] 5.37963

# odds of smoking (control/cancer)
(72/14)/(83/3)

[1] 0.1858864

# odds of not smoking (control/cancer)
(14/72)/(3/83)

[1] 5.37963

# odds of not smoking (cancer/control)
(3/83)/(14/72)

[1] 0.1858864

You will notice that there are two odds ratios that are possible to compute, each of which may be interpreted in one of four ways, and which are reciprocals of one another.

The epitools package has a function oddsratio(...) which takes the two variables as input and returns estimated odds, a confidence interval, and a test of association:

# inference for odds ratio
oddsratio(smoking$smoking, smoking$group, method = 'wald')

$data
            Outcome
Predictor    Cancer Control Total
  Smokers        83      72   155
  NonSmokers      3      14    17
  Total          86      86   172

$measure
            odds ratio with 95% C.I.
Predictor    estimate    lower    upper
  Smokers     1.00000       NA       NA
  NonSmokers  5.37963 1.486376 19.47045

$p.value
            two-sided
Predictor     midp.exact fisher.exact  chi.square
  Smokers             NA           NA          NA
  NonSmokers 0.005116319  0.008822805 0.004948149

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

By default it will produce the odds of the second outcome in the second predictor (group) compared with the first predictor (group). So the above estimates the odds of not having cancer among nonsmokers compared with smokers. The rev = ... argument allows you to reverse either rows, columns, or both.

To orient the table correctly, we need to change the order of both rows and columns.

# inference for odds ratio
oddsratio(smoking$smoking, smoking$group, method = 'wald', rev = 'both')

$data
            Outcome
Predictor    Control Cancer Total
  NonSmokers      14      3    17
  Smokers         72     83   155
  Total           86     86   172

$measure
            odds ratio with 95% C.I.
Predictor    estimate    lower    upper
  NonSmokers  1.00000       NA       NA
  Smokers     5.37963 1.486376 19.47045

$p.value
            two-sided
Predictor     midp.exact fisher.exact  chi.square
  NonSmokers          NA           NA          NA
  Smokers    0.005116319  0.008822805 0.004948149

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

The result is the same, but that’s just because we got lucky and the odds ratio we wanted happened to be equivalent to the one we got by default; however there is no such guarantee in general.

The result is interpreted as follows:

The odds of developing lung cancer are estimated to be between 1.49 and 19.47 times higher among smokers as compared with nonsmokers.

Your turn

Estimate the odds of asthma among women compared with men.

Solution

# inference for odds ratio
oddsratio(asthma$sex, asthma$asthma, method = 'wald', rev = 'columns')

$data
         Outcome
Predictor no asthma asthma Total
   male         769     30   799
   female       781     49   830
   Total       1550     79  1629

$measure
         odds ratio with 95% C.I.
Predictor estimate    lower    upper
   male   1.000000       NA       NA
   female 1.608237 1.010044 2.560708

$p.value
         two-sided
Predictor midp.exact fisher.exact chi.square
   male           NA           NA         NA
   female 0.04412095   0.04961711 0.04354632

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

The odds of asthma are estimated to be between 1.01 and 2.56 times higher among women as compared with men.

Lastly, the confidence level can be adjusted using the conf.level = ... argument:

# adjust confidence level
oddsratio(smoking$smoking, smoking$group, 
          method = 'wald', rev = 'both', conf.level = 0.9)

$data
            Outcome
Predictor    Control Cancer Total
  NonSmokers      14      3    17
  Smokers         72     83   155
  Total           86     86   172

$measure
            odds ratio with 90% C.I.
Predictor    estimate   lower    upper
  NonSmokers  1.00000      NA       NA
  Smokers     5.37963 1.82785 15.83303

$p.value
            two-sided
Predictor     midp.exact fisher.exact  chi.square
  NonSmokers          NA           NA          NA
  Smokers    0.005116319  0.008822805 0.004948149

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

Practice problem

The leap dataset contains observations from the LEAP study, which aimed to ascertain whether there was a causal effect of peanut consumption/avoidance on development of peanut allergies among infants with prior risk factors.

load('data/leap.RData')
head(leap)

               group ofc.result
1 Peanut Consumption   PASS OFC
2 Peanut Consumption   PASS OFC
3   Peanut Avoidance   PASS OFC
4 Peanut Consumption   PASS OFC
5   Peanut Avoidance   PASS OFC
6 Peanut Consumption   PASS OFC

Estimate the proportion of each treatment group that developed allergies.
Test for an association between peanut consumption and allergy development.
Estimate the odds of developing allergies among the consumption group compared with the avoidance group.

Solution

# proportions in each group that developed allergies
table(leap$group, leap$ofc.result) |>
  prop.table(margin = 1)

                    
                       FAIL OFC   PASS OFC
  Peanut Avoidance   0.13688213 0.86311787
  Peanut Consumption 0.01872659 0.98127341

# test for association between consumption/avoidance and allergy development
table(leap$group, leap$ofc.result) |>
  chisq.test()


    Pearson's Chi-squared test with Yates' continuity correction

data:  table(leap$group, leap$ofc.result)
X-squared = 24.286, df = 1, p-value = 8.302e-07

# estimate odds ratio of allergies among consumption group compared with avoidance group
oddsratio(leap$group, leap$ofc.result, method = 'wald', rev = 'columns')

$data
                    Outcome
Predictor            PASS OFC FAIL OFC Total
  Peanut Avoidance        227       36   263
  Peanut Consumption      262        5   267
  Total                   489       41   530

$measure
                    odds ratio with 95% C.I.
Predictor            estimate      lower     upper
  Peanut Avoidance   1.000000         NA        NA
  Peanut Consumption 0.120335 0.04643869 0.3118202

$p.value
                    two-sided
Predictor              midp.exact fisher.exact   chi.square
  Peanut Avoidance             NA           NA           NA
  Peanut Consumption 1.294368e-07 1.389354e-07 3.567075e-07

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"