Lab 13: Association in contingency tables

Course activity

STAT218

The objective of this lab is to learn how to perform $χ^{2}$ and exact tests of association for two-way tables in R.

We’ll use several datasets: the case-control study of smoking and lung cancer from lecture, and asthma prevalence among a subset of NHANES respondents.

load('data/smoking.RData')
load('data/asthma.RData')
load('data/famuss.RData')
load('data/malaria.RData')
load('data/outbreak.RData')

Since you should be fairly comfortable with retrieving variables from datasets by now, we’ll construct the contingency tables needed for analysis without the step of extracting and storing values for each variable separately.

# contingency table
table(smoking$group, smoking$smoking)

         
          Smokers NonSmokers
  Cancer       83          3
  Control      72         14

The asthma dataset contains observations of asthma occurrence (Y/N) and sex from a random sample of U.S. 1629 adults.

Your turn

Construct a contingency table of the asthma data following the example above.

Solution

# contingency table
table(asthma$sex, asthma$asthma)

        
         asthma no asthma
  male       30       769
  female     49       781

Chi-square tests

The $χ^{2}$ test of association tests the hypothesis and alternative:

${\begin{cases} H_{0} : smoking is independent of cancer \\ H_{A} : smoking is not independent of cancer \end{cases}$

The test proceeds by computing the expected counts ${\hat{n}}_{i j}$ in each cell of the table under $H_{0}$ after fixing the marginal totals $n_{i}, n_{j}$ . The statistic is:

$χ^{2} = \sum_{i, j} \frac{(n_{i j} - {\hat{n}}_{i j})^{2}}{{\hat{n}}_{i j}}$

This measures the deviation of the actual counts from the expected counts, and if the measure is sufficiently large, then the test identifies evidence of an association between the row and column variables in the table.

This is all very easy to implement in R:

# chi square test
table(smoking$group, smoking$smoking) |>
  chisq.test()


    Pearson's Chi-squared test with Yates' continuity correction

data:  table(smoking$group, smoking$smoking)
X-squared = 6.5275, df = 1, p-value = 0.01062

The data provide evidence of an association between smoking and lung cancer ( $χ^{2}$ = 6.53 on 1 df, p = 0.0106).

Recall that by default, R adjusts the test statistic slightly (the “continuity correction”).

Your turn

Test for an association between asthma and sex at the 5% level. Interpret the result in context.

Solution

# chi square test
table(asthma$sex, asthma$asthma) |>
  chisq.test()


    Pearson's Chi-squared test with Yates' continuity correction

data:  table(asthma$sex, asthma$asthma)
X-squared = 3.6217, df = 1, p-value = 0.05703

The data do not provide evidence of an association between asthma and sex.

One nice thing about the implementation is that the test result doesn’t depend on the orientation of the table. For example:

# chi square test
table(smoking$group, smoking$smoking) |>
  chisq.test()


    Pearson's Chi-squared test with Yates' continuity correction

data:  table(smoking$group, smoking$smoking)
X-squared = 6.5275, df = 1, p-value = 0.01062

# flip orientation of table
table(smoking$smoking, smoking$group) |>
  chisq.test()


    Pearson's Chi-squared test with Yates' continuity correction

data:  table(smoking$smoking, smoking$group)
X-squared = 6.5275, df = 1, p-value = 0.01062

If we wish to recover expected counts or residuals – cell-wise normalized differences between actual and expected counts – we can store the result of the test and extract those quantities:

# store test result
test_rslt <- table(smoking$group, smoking$smoking) |>
  chisq.test()

# expected counts
test_rslt$expected

         
          Smokers NonSmokers
  Cancer     77.5        8.5
  Control    77.5        8.5

# residuals
test_rslt$residuals

         
            Smokers NonSmokers
  Cancer   0.624758  -1.886484
  Control -0.624758   1.886484

The residuals identify which cells deviate the most from the expected counts. In this case:

a higher-than-expected number of smokers are in the cancer group (0.625)
a lower-than-expected number of nonsmokers are in the control group (-1.886)
a lower-than-expected number of smokers are in the cancer group (-0.625)
a higher-than-expected number of nonsmokers are in the control group (1.886)

The residuals in each group have the same in absolute values because the row totals are the same – there are 86 individuals in each group. This will not always be the case.

Your turn

Compute the residuals for the test of association between asthma and sex and identify which cells deviate most from the expected counts under independence.

Solution

# store test result
test_rslt <- table(asthma$sex, asthma$asthma) |>
  chisq.test()

# residuals
test_rslt$residuals

        
             asthma  no asthma
  male   -1.4053932  0.3172821
  female  1.3788982 -0.3113006

Asthma prevalence for men is lower than expected under independence and prevalence for women is higher.

$χ^{2}$ tests in $I \times J$ tables

Chi-square tests can also be performed for general two-way tables of arbitrary dimension. The implementation is identical to the 2-by-2 case.

# contingency table of race and genotype
table(famuss$race, famuss$genotype)

            
              CC  CT  TT
  African Am  16   6   5
  Asian       21  18  16
  Caucasian  125 216 126
  Hispanic     4  10   9
  Other        7  11   5

# test whether race is associated with genotype
table(famuss$race, famuss$genotype) |> chisq.test()


    Pearson's Chi-squared test

data:  table(famuss$race, famuss$genotype)
X-squared = 19.4, df = 8, p-value = 0.01286

# check residuals to explain association
test_rslt <- table(famuss$race, famuss$genotype) |> chisq.test()
test_rslt$residuals

            
                      CC          CT          TT
  African Am  2.90863193 -1.69802497 -0.85310170
  Asian       1.25242978 -1.24720387  0.28971360
  Caucasian  -0.92538910  0.77888407 -0.03244366
  Hispanic   -1.03920927 -0.02804356  1.11294757
  Other       0.12088363  0.28678513 -0.49045147

The data provide evidence of an association between race and genotype ( $χ^{2}$ = 19.4 on 8 degrees of freedom, p = 0.01286).

Moreover, the largest absolute residuals explain inferred association:

African American and Asian populations have higher CC and lower CT frequencies than would be expected if genotype were independent of race.

Your turn

Use the famuss data to test for an association between sex and genotype. Based on the residuals, explain the association, to the extent that there is one.

Solution

# contingency table
table(famuss$sex, famuss$genotype)

        
          CC  CT  TT
  Female 106 149  98
  Male    67 112  63

# test whether race is associated with genotype
table(famuss$sex, famuss$genotype) |> chisq.test()


    Pearson's Chi-squared test

data:  table(famuss$sex, famuss$genotype)
X-squared = 0.97208, df = 2, p-value = 0.6151

# check residuals to explain association
test_rslt <- table(famuss$sex, famuss$genotype) |> chisq.test()
test_rslt$residuals

        
                 CC         CT         TT
  Female  0.3319542 -0.4697464  0.2539930
  Male   -0.4009201  0.5673397 -0.3067619

The data do not provide evidence of an association between sex and genotype. That said, the largest deviation from independence is among men with genotype CT, which occurs more often in the data than expected.

Fisher’s exact test

In lecture, we encountered an example of a study for which the assumptions for the $χ^{2}$ test of association do not hold. Notice the warning message when we try to carry out the test:

# chi square test for malaria study
test_rslt <- table(malaria) |> chisq.test()

Warning in chisq.test(table(malaria)): Chi-squared approximation may be
incorrect

This is generated because some expected counts are smaller than 5, which is unsurprising considering the group sizes (6 and 14).

# expected counts
test_rslt$expected

         outcome
treatment no infection infection
  placebo          2.7       3.3
  vaccine          6.3       7.7

As an alternative, we can use an exact test based on hypergeometric probabilities. Fisher’s exact test computes a probability for the entire table under the assumption of independence, and then tallies up the likelihood of all tables that at least as unlikely as the observed counts. (Have another look at the construction of the test from lecture if needed here.)

The implementation of the exact test is straightforward in R:

# fisher's exact test
fisher.test(malaria$treatment, malaria$outcome)


    Fisher's Exact Test for Count Data

data:  malaria$treatment and malaria$outcome
p-value = 0.01409
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.0000000 0.7471714
sample estimates:
odds ratio 
         0

The data provide evidence of an effect of the vaccine (Fisher’s exact test, p = 0.01409).

For now, you can ignore the point estimate and confidence interval; those are for the odds ratio, which we’ll discuss next time.

Your turn

Using the cyclosporiasis outbreak data (outbreak), perform an exact test for association between raspberry consumption and infection.

Solution

# contingency table
table(outbreak$exposure, outbreak$group)

                
                 case control
  no raspberries    9      56
  raspberries      21       4

# exact test of association
fisher.test(outbreak$exposure, outbreak$group)


    Fisher's Exact Test for Count Data

data:  outbreak$exposure and outbreak$group
p-value = 6.183e-10
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.006527972 0.124422348
sample estimates:
odds ratio 
0.03260011

Chi-square tests

χ2 tests in I×J tables

Fisher’s exact test

$χ^{2}$ tests in $I \times J$ tables