asthma | no asthma | |
---|---|---|
male | 30 | 769 |
female | 49 | 781 |
From a subsample of NHANES data:
asthma | no asthma | |
---|---|---|
male | 30 | 769 |
female | 49 | 781 |
Test for a difference in prevalence:
2-sample test for equality of proportions with continuity correction
data: table(asthma$sex, asthma$asthma)
X-squared = 3.6217, df = 1, p-value = 0.05703
alternative hypothesis: two.sided
95 percent confidence interval:
-0.0434742223 0.0004958005
sample estimates:
prop 1 prop 2
0.03754693 0.05903614
The data provide evidence that asthma prevalence differs between men and women (Z = 2.108, p = 0.0436). With 95% confidence, the prevalence among women is estimated to be between 0.07 and 4.22 percentage points higher than that among men.
Consider applying the inference on proportions to a case-control study:
Smokers | NonSmokers | n | |
---|---|---|---|
Cancer | 83 | 3 | 86 |
Control | 72 | 14 | 86 |
2-sample test for equality of proportions with continuity correction
data: smoking.tbl
X-squared = 6.5275, df = 1, p-value = 0.01062
alternative hypothesis: two.sided
95 percent confidence interval:
0.029149 0.226665
sample estimates:
prop 1 prop 2
0.9651163 0.8372093
With 95% confidence, the share of smokers is estimated to be between 2.9 and 22.7 percentage points higher among cancer patients.
This tells us how much the probability of smoking increases if one has cancer.
But we’d rather know how much the probability of cancer increases if one smokes.
Can we do inference another way?
Consider the more general hypothesis that smoking and cancer are independent:
Cell-wise and marginal proportions:
Smokers | NonSmokers | total | |
---|---|---|---|
Cancer | 0.4826 | 0.01744 | 0.5 |
Control | 0.4186 | 0.0814 | 0.5 |
total | 0.9012 | 0.09884 | 1 |
Under independence we’d expect:
For example (shown in bold):
Expected proportions translate directly to expected counts:
Actual counts:
O1 | O2 | total | |
---|---|---|---|
G1 | |||
G2 | |||
total |
Expected counts under independence:
O1 | O2 | total | |
---|---|---|---|
G1 | |||
G2 | |||
total |
Idea for a test: reject
This is more general than inference of proportions because it doesn’t depend on
Actual counts:
O1 | O2 | total | |
---|---|---|---|
G1 | |||
G2 | |||
total |
Expected counts under independence:
O1 | O2 | total | |
---|---|---|---|
G1 | |||
G2 | |||
total |
For the case-control study:
Smokers | NonSmokers | total | |
---|---|---|---|
Cancer | 83 | 3 | 86 |
Control | 72 | 14 | 86 |
total | 155 | 17 | 172 |
Smokers | NonSmokers | total | |
---|---|---|---|
Cancer | 77.5 | 8.5 | 86 |
Control | 77.5 | 8.5 | 86 |
total | 155 | 17 | 172 |
A measure of the amount by which actual counts differ from expected counts under independence is the chi (pronounced /ˈkaɪ ) square statistic:
Cell-wise calculation:
smokers | nonsmokers | |
---|---|---|
cancer | ||
control |
Result:
Smokers | NonSmokers | |
---|---|---|
Cancer | 0.3903 | 3.559 |
Control | 0.3903 | 3.559 |
Chi-square statistic:
Under
The model assumes no expected counts are too small.
To determine the test outcome, find the
The data provide evidence of an association between smoking and lung cancer (
= 7.989 on 1 degree of freedom, p = 0.0049).
If smoking and cancer were independent, only 0.49% of random samples would produce a table that deviates from expected counts by more than what we observed.
The residual for each cell is defined as a standardized difference between the observed and expected count:
Examining residuals can indicate the source(s) of an inferred association.
Smokers | NonSmokers | |
---|---|---|
Cancer | 0.6248 | -1.886 |
Control | -0.6248 | 1.886 |
Look for the largest residuals to explain the result:
The
This consists in using a modified version of the test statistic:
Implementation:
# construct table and pass to chisq.test
table(smoking$group, smoking$smoking) |>
chisq.test(correct = T)
Pearson's Chi-squared test with Yates' continuity correction
data: table(smoking$group, smoking$smoking)
X-squared = 6.5275, df = 1, p-value = 0.01062
Note the larger
FAMuSS data:
CC | CT | TT | total | |
---|---|---|---|---|
African Am | 16 | 6 | 5 | 27 |
Asian | 21 | 18 | 16 | 55 |
Caucasian | 125 | 216 | 126 | 467 |
Hispanic | 4 | 10 | 9 | 23 |
Other | 7 | 11 | 5 | 23 |
total | 173 | 261 | 161 | 595 |
Expected counts:
CC | CT | TT | total | |
---|---|---|---|---|
African Am | 7.85 | 11.84 | 7.31 | 27 |
Asian | 15.99 | 24.13 | 14.88 | 55 |
Caucasian | 135.8 | 204.8 | 126.4 | 467 |
Hispanic | 6.69 | 10.09 | 6.22 | 23 |
Other | 6.69 | 10.09 | 6.22 | 23 |
total | 173 | 261 | 161 | 595 |
In detail:
CC | CT | TT | |
---|---|---|---|
African Am | |||
Asian | |||
Caucasian | |||
Hispanic | |||
Other |
Then:
The implementation is the same as for a
The data provide evidence of an association between race and genotype (
= 19.4 on 8 degrees of freedom, p = 0.01286).
Which genotype/race combinations are contributing most to this inferred association?
CC | CT | TT | |
---|---|---|---|
African Am | 2.909 | -1.698 | -0.8531 |
Asian | 1.252 | -1.247 | 0.2897 |
Caucasian | -0.9254 | 0.7789 | -0.03244 |
Hispanic | -1.039 | -0.02804 | 1.113 |
Other | 0.1209 | 0.2868 | -0.4905 |
Again look for the largest absolute residuals to explain inferred association.
African American and Asian populations have higher CC and lower CT frequencies than would be expected if genotype were independent of race.
In a randomized trial for a malaria vaccine, 20 individuals were randomly allocated to receive a dose of the vaccine or a placebo.
no infection | infection | |
---|---|---|
placebo | 0 | 6 |
vaccine | 9 | 5 |
Recall the assumption for the
Warning in chisq.test(table(malaria)): Chi-squared approximation may be
incorrect
Not just a data artefact…
no infection | infection | |
---|---|---|
placebo | 2.7 | 3.3 |
vaccine | 6.3 | 7.7 |
So what alternative do we have to test for association?
Fact 1: if you fix the margins, one table entry determines the rest.
Try it for yourself!
no infection | infection | total | |
---|---|---|---|
placebo | 6 | ||
vaccine | 5 | 14 | |
total | 9 | 11 |
Fact 2: under
The probability above is simply:
no infection | infection | total | |
---|---|---|---|
placebo | 6 | ||
vaccine | X | 14 | |
total | 9 | 11 |
The “hypergeometric” probability distribution
no infection | infection | total | |
---|---|---|---|
placebo | 6 | ||
vaccine | X | 14 | |
total | 9 | 11 |
The exact probability of the observed table is 0.0119.
no infection | infection | total | |
---|---|---|---|
placebo | 6 | ||
vaccine | X | 14 | |
total | 9 | 11 |
The exact probability of the observed table is 0.0119.
no infection | infection | total | |
---|---|---|---|
placebo | 6 | ||
vaccine | X | 14 | |
total | 9 | 11 |
The exact probability of the observed table is 0.0119.
Fisher's Exact Test for Count Data
data: table(malaria)
p-value = 0.01409
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.0000000 0.7471714
sample estimates:
odds ratio
0
The data provide evidence of an effect of the vaccine (Fisher’s exact test, p = 0.0141). With 95% confidence, the vaccine reduces the odds of infection by between 25.3% and 100%.
In R the test is formulated in terms of the odds ratio. We’ll discuss this next time.
STAT218