library(tidyverse)
load('data/nhanes500.RData')
load('data/vitc.RData')
load('data/obesity.RData')
Lab 10: Inference for proportions
The goal of this lab is to learn how to implement one- and two-sample inference for population proportions.
You’ll reproduce examples from lecture using the NHANES data to estimate diabetes prevalence and the Vitamin C experiment and practice with these and other datasets.
One-sample inference
Refresher on categorical frequency distributions
Sample proportions are most easily calculated from categorical frequency distributions – tables of counts of each value.
Earlier in the quarter, you learned how to construct these using table(...)
:
# extract variable of interest
<- nhanes$diabetes
dia
# construct table of counts
table(dia)
dia
Yes No
57 443
You also saw how to convert the counts to proportions using prop.table*()
:
# compute sample proportions
table(dia) |> prop.table()
dia
Yes No
0.114 0.886
The sample proportions provide point estimates for the population proportions.
Diabetes prevalence among U.S. adults is estimated to be 11.4%.
The sleeptrouble
variable in the NHANES dataset records whether the participant experiences sleep trouble.
Estimate the proportion of U.S. adults that experience sleep trouble.
# extract variable of interest
<- nhanes$sleeptrouble
sleep
# compute sample proportions
table(sleep) |> prop.table()
sleep
Yes No
0.278 0.722
An estimated 27.8% of U.S. adults experience sleep trouble.
Exact inference
Exact inference for a population proportion is based on the binomial probability distribution, which gives an exact probability for recording a count of x occurrences of an outcome of interest among n independent observations in terms of the outcome probability p:
# calculate a binomial probability
dbinom(x = 12, size = 20, prob = 0.7)
[1] 0.1143967
Similarly, exact methods based on the binomial utilize the sample size n and the tally x of the number of outcomes of interest (e.g., has diabetes, experiences sleep trouble, etc.). We can get this information from the table of counts:
# outcome counts for diabetes data
table(dia)
dia
Yes No
57 443
Then, to test whether the population proportion is 0.095, provide the outcome tally and sample size with the hypothetical proportion to binom.test(...)
:
# exact inference using binomial
binom.test(x = 57, n = 500, p = 0.095)
Exact binomial test
data: 57 and 500
number of successes = 57, number of trials = 500, p-value = 0.1473
alternative hypothesis: true probability of success is not equal to 0.095
95 percent confidence interval:
0.0874949 0.1451685
sample estimates:
probability of success
0.114
Notice that the default is to perform a two-sided test. The result is interpreted as follows:
The data do not provide evidence that diabetes prevalence among U.S. adults differs from 9.5% (exact binomial test, p = 0.1473). With 95% confidence, prevalence is estimated to be between 8.75% and 14.5%.
To implement directional tests, add an alternative = ...
argument; to adjust the confidence level for the interval, add a conf.level = ...
argument. For example:
# upper-sided test
binom.test(x = 57, n = 500, p = 0.095, alternative = 'greater', conf.level = 0.99)
Exact binomial test
data: 57 and 500
number of successes = 57, number of trials = 500, p-value = 0.08745
alternative hypothesis: true probability of success is greater than 0.095
99 percent confidence interval:
0.08312206 1.00000000
sample estimates:
probability of success
0.114
Using the NHANES data, test whether more than 1 in 5 U.S. adults experience sleep trouble. Interpret the result of the test and confidence interval in context.
# table of counts
table(sleep)
sleep
Yes No
139 361
# exact binomial test
binom.test(x = 139, n = 500, p = 0.2, alternative = 'greater')
Exact binomial test
data: 139 and 500
number of successes = 139, number of trials = 500, p-value = 1.743e-05
alternative hypothesis: true probability of success is greater than 0.2
95 percent confidence interval:
0.2450837 1.0000000
sample estimates:
probability of success
0.278
The data provide evidence that more than 1 in 5 U.S. adults experience sleep trouble (exact binomial test, p < 0.0001). With 95% confidence, the proportion is at least 0.245.
Approximate inference
Approximate methods of inference are based on large-sample models that assume the expected counts of both outcomes are at least 10:
If so, the normal model provides a reasonably good approximation for the sampling distribution of:
The inference then proceeds exactly like the prop.test(...)
in R, which takes as input the table of outcome counts:
# approximate inference
table(dia) |> prop.test(p = 0.095)
1-sample proportions test with continuity correction
data: table(dia), null probability 0.095
X-squared = 1.8843, df = 1, p-value = 0.1698
alternative hypothesis: true p is not equal to 0.095
95 percent confidence interval:
0.08814952 0.14594579
sample estimates:
p
0.114
By default, this performs a two-sided test with the continuity correction and returns a 95% confidence interval. To control this behavior exactly, add the arguments shown below:
# approximate inference
table(dia) |>
prop.test(p = 0.095, alternative = 'two.sided', correct = T, conf.level = 0.95)
1-sample proportions test with continuity correction
data: table(dia), null probability 0.095
X-squared = 1.8843, df = 1, p-value = 0.1698
alternative hypothesis: true p is not equal to 0.095
95 percent confidence interval:
0.08814952 0.14594579
sample estimates:
p
0.114
The results are then interpreted as follows:
The data do not provide evidence that diabetes prevalence among U.S. adults differs from 9.5% (Z = 1.37, p = 0.1698). With 95% confidence, prevalence is estimated to be between 8.8% and 14.6%.
Note that the test statistic Z is the square root of the value of X-squared
(we’ll discuss this more later).
Use the large-sample approximation to test whether a quarter of U.S. adults experience sleep trouble. Interpret the results in context.
# approximate inference
table(sleep) |> prop.test(p = 0.25)
1-sample proportions test with continuity correction
data: table(sleep), null probability 0.25
X-squared = 1.944, df = 1, p-value = 0.1632
alternative hypothesis: true p is not equal to 0.25
95 percent confidence interval:
0.2395872 0.3198838
sample estimates:
p
0.278
The data do not provide evidence that the proportion of U.S. adults who experience sleep trouble differs from 0.25 (Z = 1.39, p = 0.1632). With 95% confidence, between 24.0% and 32.0% of U.S. adults experience sleep trouble.
Two-sample inference
Inference comparing two proportions proceeds from a two-way table or “contingency” table. Contingency tables are constructed by cross-tabulating two categorical variables:
# variables of interest
<- vitc$trt
trt <- vitc$out
out
# construct contingency table
table(trt, out)
out
trt Cold NoCold
Placebo 335 76
VitC 302 105
The counts show you how many observations had each combination of values. For instance, 76 participants had no cold and were in the placebo group. This count can be rendered as a proportion in three ways!
Contingency tables get confusing because of this feature. We distinguish which is which by speaking of the “margin” used to figure proportions: column margins, row margins, or the grand total.
# using totals (default)
table(trt, out) |> prop.table(margin = NULL)
out
trt Cold NoCold
Placebo 0.40953545 0.09290954
VitC 0.36919315 0.12836186
# using row margins
table(trt, out) |> prop.table(margin = 1)
out
trt Cold NoCold
Placebo 0.8150852 0.1849148
VitC 0.7420147 0.2579853
# using column margins
table(trt, out) |> prop.table(margin = 2)
out
trt Cold NoCold
Placebo 0.5259027 0.4198895
VitC 0.4740973 0.5801105
Using the NHANES data, make a contingency table of whether participants experience sleep trouble by gender. Then:
- compute the proportion of women who experience sleep trouble
- compute the proportion of those who experience sleep trouble who are women
- compute the proportion of respondents that are women who experience sleep trouble
# contingency table
<- nhanes$gender
gender <- nhanes$sleeptrouble
sleep table(gender, sleep)
sleep
gender Yes No
female 92 174
male 47 187
# proportion of women with sleep trouble
table(gender, sleep) |> prop.table(margin = 1)
sleep
gender Yes No
female 0.3458647 0.6541353
male 0.2008547 0.7991453
# proportion of those with sleep trouble who are women
table(gender, sleep) |> prop.table(margin = 2)
sleep
gender Yes No
female 0.6618705 0.4819945
male 0.3381295 0.5180055
# proportion of respondents who are women with sleep trouble
table(gender, sleep) |> prop.table(margin = NULL)
sleep
gender Yes No
female 0.184 0.348
male 0.094 0.374
For the vitamin C study, we wish to compare the proportions of those with colds between the two treatment groups, so we want to use row margins:
# proportion of each outcome by treatment group
table(trt, out) |> prop.table(margin = 1)
out
trt Cold NoCold
Placebo 0.8150852 0.1849148
VitC 0.7420147 0.2579853
The contingency table can be used directly to perform inference on a difference in proportions using prop.test(...)
:
# test for difference in proportions
table(trt, out) |>
prop.test(alternative = 'two.sided',
conf.level = 0.95)
2-sample test for equality of proportions with continuity correction
data: table(trt, out)
X-squared = 5.9196, df = 1, p-value = 0.01497
alternative hypothesis: two.sided
95 percent confidence interval:
0.01391972 0.13222111
sample estimates:
prop 1 prop 2
0.8150852 0.7420147
The data provide evidence that vitamin C affects the chance of contracting a common cold (Z = 2.43, p = 0.0150). With 95% confidence, the share of people who contract a common cold while taking vitamin C is between 1.39 and 13.22 percentage points lower than among those who do not take vitamin C.
For the inference to work appropriately, the outcome must be shown in the column dimension and the groups must be shown in the row dimension. Here’s what happens when they’re switched:
# NOT CORRECT
table(out, trt) |>
prop.test(alternative = 'two.sided',
conf.level = 0.95)
2-sample test for equality of proportions with continuity correction
data: table(out, trt)
X-squared = 5.9196, df = 1, p-value = 0.01497
alternative hypothesis: two.sided
95 percent confidence interval:
0.02077574 0.19125059
sample estimates:
prop 1 prop 2
0.5259027 0.4198895
The test statistic and p-value are identical because the row proportions differ exactly when the column proportions differ, so the tests are equivalent. (That makes a nice algebraic exercise, if you’re interested.) However, the point estimates and confidence interval are not identical, and have different interpretations; only the first one above is correct for the comparison we wish to make.
Using the NHANES data, test whether women experience sleep trouble at a higher rate than men.
# test for difference in proportions
table(gender, sleep) |> prop.test(alternative = 'greater')
2-sample test for equality of proportions with continuity correction
data: table(gender, sleep)
X-squared = 12.329, df = 1, p-value = 0.0002229
alternative hypothesis: greater
95 percent confidence interval:
0.07651852 1.00000000
sample estimates:
prop 1 prop 2
0.3458647 0.2008547
The data provide evidence that women experience sleep trouble at a higher rate than men (Z = 3.511, p = 0.0002). With 95% confidence, the rate among women is estimated to be at least 7.65 percentage points higher than that among men.