load('data/smoking.RData')
load('data/asthma.RData')
load('data/famuss.RData')
load('data/malaria.RData')
load('data/outbreak.RData')
Lab 13: Association in contingency tables
The objective of this lab is to learn how to perform
We’ll use several datasets: the case-control study of smoking and lung cancer from lecture, and asthma prevalence among a subset of NHANES respondents.
Since you should be fairly comfortable with retrieving variables from datasets by now, we’ll construct the contingency tables needed for analysis without the step of extracting and storing values for each variable separately.
# contingency table
table(smoking$group, smoking$smoking)
Smokers NonSmokers
Cancer 83 3
Control 72 14
The asthma
dataset contains observations of asthma occurrence (Y/N) and sex from a random sample of U.S. 1629 adults.
Construct a contingency table of the asthma data following the example above.
# contingency table
table(asthma$sex, asthma$asthma)
asthma no asthma
male 30 769
female 49 781
Chi-square tests
The
The test proceeds by computing the expected counts
This measures the deviation of the actual counts from the expected counts, and if the measure is sufficiently large, then the test identifies evidence of an association between the row and column variables in the table.
This is all very easy to implement in R:
# chi square test
table(smoking$group, smoking$smoking) |>
chisq.test()
Pearson's Chi-squared test with Yates' continuity correction
data: table(smoking$group, smoking$smoking)
X-squared = 6.5275, df = 1, p-value = 0.01062
The data provide evidence of an association between smoking and lung cancer (
= 6.53 on 1 df, p = 0.0106).
Recall that by default, R adjusts the test statistic slightly (the “continuity correction”).
Test for an association between asthma and sex at the 5% level. Interpret the result in context.
# chi square test
table(asthma$sex, asthma$asthma) |>
chisq.test()
Pearson's Chi-squared test with Yates' continuity correction
data: table(asthma$sex, asthma$asthma)
X-squared = 3.6217, df = 1, p-value = 0.05703
The data do not provide evidence of an association between asthma and sex.
One nice thing about the implementation is that the test result doesn’t depend on the orientation of the table. For example:
# chi square test
table(smoking$group, smoking$smoking) |>
chisq.test()
Pearson's Chi-squared test with Yates' continuity correction
data: table(smoking$group, smoking$smoking)
X-squared = 6.5275, df = 1, p-value = 0.01062
# flip orientation of table
table(smoking$smoking, smoking$group) |>
chisq.test()
Pearson's Chi-squared test with Yates' continuity correction
data: table(smoking$smoking, smoking$group)
X-squared = 6.5275, df = 1, p-value = 0.01062
If we wish to recover expected counts or residuals – cell-wise normalized differences between actual and expected counts – we can store the result of the test and extract those quantities:
# store test result
<- table(smoking$group, smoking$smoking) |>
test_rslt chisq.test()
# expected counts
$expected test_rslt
Smokers NonSmokers
Cancer 77.5 8.5
Control 77.5 8.5
# residuals
$residuals test_rslt
Smokers NonSmokers
Cancer 0.624758 -1.886484
Control -0.624758 1.886484
The residuals identify which cells deviate the most from the expected counts. In this case:
- a higher-than-expected number of smokers are in the cancer group (0.625)
- a lower-than-expected number of nonsmokers are in the control group (-1.886)
- a lower-than-expected number of smokers are in the cancer group (-0.625)
- a higher-than-expected number of nonsmokers are in the control group (1.886)
The residuals in each group have the same in absolute values because the row totals are the same – there are 86 individuals in each group. This will not always be the case.
Compute the residuals for the test of association between asthma and sex and identify which cells deviate most from the expected counts under independence.
# store test result
<- table(asthma$sex, asthma$asthma) |>
test_rslt chisq.test()
# residuals
$residuals test_rslt
asthma no asthma
male -1.4053932 0.3172821
female 1.3788982 -0.3113006
Asthma prevalence for men is lower than expected under independence and prevalence for women is higher.
tests in tables
Chi-square tests can also be performed for general two-way tables of arbitrary dimension. The implementation is identical to the 2-by-2 case.
# contingency table of race and genotype
table(famuss$race, famuss$genotype)
CC CT TT
African Am 16 6 5
Asian 21 18 16
Caucasian 125 216 126
Hispanic 4 10 9
Other 7 11 5
# test whether race is associated with genotype
table(famuss$race, famuss$genotype) |> chisq.test()
Pearson's Chi-squared test
data: table(famuss$race, famuss$genotype)
X-squared = 19.4, df = 8, p-value = 0.01286
# check residuals to explain association
<- table(famuss$race, famuss$genotype) |> chisq.test()
test_rslt $residuals test_rslt
CC CT TT
African Am 2.90863193 -1.69802497 -0.85310170
Asian 1.25242978 -1.24720387 0.28971360
Caucasian -0.92538910 0.77888407 -0.03244366
Hispanic -1.03920927 -0.02804356 1.11294757
Other 0.12088363 0.28678513 -0.49045147
The data provide evidence of an association between race and genotype (
= 19.4 on 8 degrees of freedom, p = 0.01286).
Moreover, the largest absolute residuals explain inferred association:
African American and Asian populations have higher CC and lower CT frequencies than would be expected if genotype were independent of race.
Use the famuss
data to test for an association between sex and genotype. Based on the residuals, explain the association, to the extent that there is one.
# contingency table
table(famuss$sex, famuss$genotype)
CC CT TT
Female 106 149 98
Male 67 112 63
# test whether race is associated with genotype
table(famuss$sex, famuss$genotype) |> chisq.test()
Pearson's Chi-squared test
data: table(famuss$sex, famuss$genotype)
X-squared = 0.97208, df = 2, p-value = 0.6151
# check residuals to explain association
<- table(famuss$sex, famuss$genotype) |> chisq.test()
test_rslt $residuals test_rslt
CC CT TT
Female 0.3319542 -0.4697464 0.2539930
Male -0.4009201 0.5673397 -0.3067619
The data do not provide evidence of an association between sex and genotype. That said, the largest deviation from independence is among men with genotype CT, which occurs more often in the data than expected.
Fisher’s exact test
In lecture, we encountered an example of a study for which the assumptions for the
# chi square test for malaria study
<- table(malaria) |> chisq.test() test_rslt
Warning in chisq.test(table(malaria)): Chi-squared approximation may be
incorrect
This is generated because some expected counts are smaller than 5, which is unsurprising considering the group sizes (6 and 14).
# expected counts
$expected test_rslt
outcome
treatment no infection infection
placebo 2.7 3.3
vaccine 6.3 7.7
As an alternative, we can use an exact test based on hypergeometric probabilities. Fisher’s exact test computes a probability for the entire table under the assumption of independence, and then tallies up the likelihood of all tables that at least as unlikely as the observed counts. (Have another look at the construction of the test from lecture if needed here.)
The implementation of the exact test is straightforward in R:
# fisher's exact test
fisher.test(malaria$treatment, malaria$outcome)
Fisher's Exact Test for Count Data
data: malaria$treatment and malaria$outcome
p-value = 0.01409
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.0000000 0.7471714
sample estimates:
odds ratio
0
The data provide evidence of an effect of the vaccine (Fisher’s exact test, p = 0.01409).
For now, you can ignore the point estimate and confidence interval; those are for the odds ratio, which we’ll discuss next time.
Using the cyclosporiasis outbreak data (outbreak
), perform an exact test for association between raspberry consumption and infection.
# contingency table
table(outbreak$exposure, outbreak$group)
case control
no raspberries 9 56
raspberries 21 4
# exact test of association
fisher.test(outbreak$exposure, outbreak$group)
Fisher's Exact Test for Count Data
data: outbreak$exposure and outbreak$group
p-value = 6.183e-10
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.006527972 0.124422348
sample estimates:
odds ratio
0.03260011