Lab 10: Inference for proportions

Course activity

STAT218

The goal of this lab is to learn how to implement one- and two-sample inference for population proportions.

You’ll reproduce examples from lecture using the NHANES data to estimate diabetes prevalence and the Vitamin C experiment and practice with these and other datasets.

library(tidyverse)
load('data/nhanes500.RData')
load('data/vitc.RData')
load('data/obesity.RData')

One-sample inference

Refresher on categorical frequency distributions

Sample proportions are most easily calculated from categorical frequency distributions – tables of counts of each value.

Earlier in the quarter, you learned how to construct these using table(...):

# extract variable of interest
dia <- nhanes$diabetes

# construct table of counts
table(dia)

dia
Yes  No 
 57 443

You also saw how to convert the counts to proportions using prop.table*():

# compute sample proportions
table(dia) |> prop.table()

dia
  Yes    No 
0.114 0.886

The sample proportions provide point estimates for the population proportions.

Diabetes prevalence among U.S. adults is estimated to be 11.4%.

Your turn

The sleeptrouble variable in the NHANES dataset records whether the participant experiences sleep trouble.

Estimate the proportion of U.S. adults that experience sleep trouble.

Solution

# extract variable of interest
sleep <- nhanes$sleeptrouble

# compute sample proportions
table(sleep) |> prop.table()

sleep
  Yes    No 
0.278 0.722

An estimated 27.8% of U.S. adults experience sleep trouble.

Exact inference

Exact inference for a population proportion is based on the binomial probability distribution, which gives an exact probability for recording a count of x occurrences of an outcome of interest among n independent observations in terms of the outcome probability p:

$P r (x) = (\binom{n}{x}) p^{x} (1 - p)^{n - x}$ In R, one can calculate these probabilities. For example, the chance of observing 12 occurrences of an outcome with probability 0.7 among 20 observations is about 11.4%:

# calculate a binomial probability
dbinom(x = 12, size = 20, prob = 0.7)

[1] 0.1143967

Similarly, exact methods based on the binomial utilize the sample size n and the tally x of the number of outcomes of interest (e.g., has diabetes, experiences sleep trouble, etc.). We can get this information from the table of counts:

# outcome counts for diabetes data
table(dia)

dia
Yes  No 
 57 443

Then, to test whether the population proportion is 0.095, provide the outcome tally and sample size with the hypothetical proportion to binom.test(...):

# exact inference using binomial
binom.test(x = 57, n = 500, p = 0.095)


    Exact binomial test

data:  57 and 500
number of successes = 57, number of trials = 500, p-value = 0.1473
alternative hypothesis: true probability of success is not equal to 0.095
95 percent confidence interval:
 0.0874949 0.1451685
sample estimates:
probability of success 
                 0.114

Notice that the default is to perform a two-sided test. The result is interpreted as follows:

The data do not provide evidence that diabetes prevalence among U.S. adults differs from 9.5% (exact binomial test, p = 0.1473). With 95% confidence, prevalence is estimated to be between 8.75% and 14.5%.

To implement directional tests, add an alternative = ... argument; to adjust the confidence level for the interval, add a conf.level = ... argument. For example:

# upper-sided test
binom.test(x = 57, n = 500, p = 0.095, alternative = 'greater', conf.level = 0.99)


    Exact binomial test

data:  57 and 500
number of successes = 57, number of trials = 500, p-value = 0.08745
alternative hypothesis: true probability of success is greater than 0.095
99 percent confidence interval:
 0.08312206 1.00000000
sample estimates:
probability of success 
                 0.114

Your turn

Using the NHANES data, test whether more than 1 in 5 U.S. adults experience sleep trouble. Interpret the result of the test and confidence interval in context.

Solution

# table of counts
table(sleep)

sleep
Yes  No 
139 361

# exact binomial test
binom.test(x = 139, n = 500, p = 0.2, alternative = 'greater')


    Exact binomial test

data:  139 and 500
number of successes = 139, number of trials = 500, p-value = 1.743e-05
alternative hypothesis: true probability of success is greater than 0.2
95 percent confidence interval:
 0.2450837 1.0000000
sample estimates:
probability of success 
                 0.278

The data provide evidence that more than 1 in 5 U.S. adults experience sleep trouble (exact binomial test, p < 0.0001). With 95% confidence, the proportion is at least 0.245.

Approximate inference

Approximate methods of inference are based on large-sample models that assume the expected counts of both outcomes are at least 10:

$n p_{0} \geq 10 and n (1 - p_{0}) \geq 10$

If so, the normal model provides a reasonably good approximation for the sampling distribution of:

$Z = \frac{\hat{p} - p}{S E (\hat{p})} where S E (\hat{p}) = \sqrt{\frac{\hat{p} (1 - \hat{p})}{n}}$

The inference then proceeds exactly like the $t$ test for means. This approach is implemented by prop.test(...) in R, which takes as input the table of outcome counts:

# approximate inference
table(dia) |> prop.test(p = 0.095)


    1-sample proportions test with continuity correction

data:  table(dia), null probability 0.095
X-squared = 1.8843, df = 1, p-value = 0.1698
alternative hypothesis: true p is not equal to 0.095
95 percent confidence interval:
 0.08814952 0.14594579
sample estimates:
    p 
0.114

By default, this performs a two-sided test with the continuity correction and returns a 95% confidence interval. To control this behavior exactly, add the arguments shown below:

# approximate inference
table(dia) |> 
  prop.test(p = 0.095, alternative = 'two.sided', correct = T, conf.level = 0.95)


    1-sample proportions test with continuity correction

data:  table(dia), null probability 0.095
X-squared = 1.8843, df = 1, p-value = 0.1698
alternative hypothesis: true p is not equal to 0.095
95 percent confidence interval:
 0.08814952 0.14594579
sample estimates:
    p 
0.114

The results are then interpreted as follows:

The data do not provide evidence that diabetes prevalence among U.S. adults differs from 9.5% (Z = 1.37, p = 0.1698). With 95% confidence, prevalence is estimated to be between 8.8% and 14.6%.

Note that the test statistic Z is the square root of the value of X-squared (we’ll discuss this more later).

Your turn

Use the large-sample approximation to test whether a quarter of U.S. adults experience sleep trouble. Interpret the results in context.

Solution

# approximate inference
table(sleep) |> prop.test(p = 0.25)


    1-sample proportions test with continuity correction

data:  table(sleep), null probability 0.25
X-squared = 1.944, df = 1, p-value = 0.1632
alternative hypothesis: true p is not equal to 0.25
95 percent confidence interval:
 0.2395872 0.3198838
sample estimates:
    p 
0.278

The data do not provide evidence that the proportion of U.S. adults who experience sleep trouble differs from 0.25 (Z = 1.39, p = 0.1632). With 95% confidence, between 24.0% and 32.0% of U.S. adults experience sleep trouble.

Two-sample inference

Inference comparing two proportions proceeds from a two-way table or “contingency” table. Contingency tables are constructed by cross-tabulating two categorical variables:

# variables of interest
trt <- vitc$trt
out <- vitc$out

# construct contingency table
table(trt, out)

         out
trt       Cold NoCold
  Placebo  335     76
  VitC     302    105

The counts show you how many observations had each combination of values. For instance, 76 participants had no cold and were in the placebo group. This count can be rendered as a proportion in three ways!

$\begin{aligned} \frac{76}{818} & = 0.0929 (proportion of total) \\ \frac{76}{411} & = 0.1849 (proportion of placebo group) \\ \frac{76}{181} & = 0.4199 (proportion of those without colds) \end{aligned}$

Contingency tables get confusing because of this feature. We distinguish which is which by speaking of the “margin” used to figure proportions: column margins, row margins, or the grand total.

# using totals (default)
table(trt, out) |> prop.table(margin = NULL)

         out
trt             Cold     NoCold
  Placebo 0.40953545 0.09290954
  VitC    0.36919315 0.12836186

# using row margins
table(trt, out) |> prop.table(margin = 1)

         out
trt            Cold    NoCold
  Placebo 0.8150852 0.1849148
  VitC    0.7420147 0.2579853

# using column margins
table(trt, out) |> prop.table(margin = 2)

         out
trt            Cold    NoCold
  Placebo 0.5259027 0.4198895
  VitC    0.4740973 0.5801105

Your turn

Using the NHANES data, make a contingency table of whether participants experience sleep trouble by gender. Then:

compute the proportion of women who experience sleep trouble
compute the proportion of those who experience sleep trouble who are women
compute the proportion of respondents that are women who experience sleep trouble

Solution

# contingency table
gender <- nhanes$gender
sleep <- nhanes$sleeptrouble
table(gender, sleep)

        sleep
gender   Yes  No
  female  92 174
  male    47 187

# proportion of women with sleep trouble
table(gender, sleep) |> prop.table(margin = 1)

        sleep
gender         Yes        No
  female 0.3458647 0.6541353
  male   0.2008547 0.7991453

# proportion of those with sleep trouble who are women
table(gender, sleep) |> prop.table(margin = 2)

        sleep
gender         Yes        No
  female 0.6618705 0.4819945
  male   0.3381295 0.5180055

# proportion of respondents who are women with sleep trouble
table(gender, sleep) |> prop.table(margin = NULL)

        sleep
gender     Yes    No
  female 0.184 0.348
  male   0.094 0.374

For the vitamin C study, we wish to compare the proportions of those with colds between the two treatment groups, so we want to use row margins:

# proportion of each outcome by treatment group
table(trt, out) |> prop.table(margin = 1)

         out
trt            Cold    NoCold
  Placebo 0.8150852 0.1849148
  VitC    0.7420147 0.2579853

The contingency table can be used directly to perform inference on a difference in proportions using prop.test(...):

# test for difference in proportions
table(trt, out) |> 
  prop.test(alternative = 'two.sided', 
            conf.level = 0.95)


    2-sample test for equality of proportions with continuity correction

data:  table(trt, out)
X-squared = 5.9196, df = 1, p-value = 0.01497
alternative hypothesis: two.sided
95 percent confidence interval:
 0.01391972 0.13222111
sample estimates:
   prop 1    prop 2 
0.8150852 0.7420147

The data provide evidence that vitamin C affects the chance of contracting a common cold (Z = 2.43, p = 0.0150). With 95% confidence, the share of people who contract a common cold while taking vitamin C is between 1.39 and 13.22 percentage points lower than among those who do not take vitamin C.

For the inference to work appropriately, the outcome must be shown in the column dimension and the groups must be shown in the row dimension. Here’s what happens when they’re switched:

# NOT CORRECT
table(out, trt) |> 
  prop.test(alternative = 'two.sided', 
            conf.level = 0.95)


    2-sample test for equality of proportions with continuity correction

data:  table(out, trt)
X-squared = 5.9196, df = 1, p-value = 0.01497
alternative hypothesis: two.sided
95 percent confidence interval:
 0.02077574 0.19125059
sample estimates:
   prop 1    prop 2 
0.5259027 0.4198895

The test statistic and p-value are identical because the row proportions differ exactly when the column proportions differ, so the tests are equivalent. (That makes a nice algebraic exercise, if you’re interested.) However, the point estimates and confidence interval are not identical, and have different interpretations; only the first one above is correct for the comparison we wish to make.

Your turn

Using the NHANES data, test whether women experience sleep trouble at a higher rate than men.

Solution

# test for difference in proportions
table(gender, sleep) |> prop.test(alternative = 'greater')


    2-sample test for equality of proportions with continuity correction

data:  table(gender, sleep)
X-squared = 12.329, df = 1, p-value = 0.0002229
alternative hypothesis: greater
95 percent confidence interval:
 0.07651852 1.00000000
sample estimates:
   prop 1    prop 2 
0.3458647 0.2008547

The data provide evidence that women experience sleep trouble at a higher rate than men (Z = 3.511, p = 0.0002). With 95% confidence, the rate among women is estimated to be at least 7.65 percentage points higher than that among men.