Inference for proportions

Binomial probabilities

The chance that a randomly selected U.S. adult is diabetic is p.

The chance that two of three randomly selected U.S. adults are diabetic is:

$P r (2) = \underset{# possible orderings}{\underset{⏟}{3}} \times \underset{chance of two diabetics}{\underset{⏟}{p \times p}} \times \underset{chance of one nondiabetic}{\underset{⏟}{(1 - p)}}$

The chance that x out of n randomly selected U.S. adults are diabetic is:

$P r (x) = \underset{# sample orderings}{\underset{⏟}{(\binom{n}{x})}} \times \underset{x diabetics}{\underset{⏟}{p^{x}}} \times \underset{n - x nondiabetics}{\underset{⏟}{(1 - p)^{n - x}}}$

This is called a binomial probability distribution.

Upper-sided test

${\begin{cases} H_{0} : p = 0.095 \\ H_{A} : p > 0.095 \end{cases}$

The data favor $H_{A}$ when counts are larger. $P r (X \geq 57) = \sum_{x = 57}^{500} P r (x) = 0.0874$

binom.test(x = 57, n = 500, p = 0.095, alternative = 'greater')


    Exact binomial test

data:  57 and 500
number of successes = 57, number of trials = 500, p-value = 0.08745
alternative hypothesis: true probability of success is greater than 0.095
95 percent confidence interval:
 0.0913675 1.0000000
sample estimates:
probability of success 
                 0.114

The data do not provide evidence that diabetes prevalence exceeds 9.5% (p = 0.0874).

Lower-sided test

${\begin{cases} H_{0} : p = 0.095 \\ H_{A} : p < 0.095 \end{cases}$

The data favor $H_{A}$ when counts are smaller. $P r (X \leq 57) = \sum_{x = 0}^{57} P r (x) = 0.9333$

binom.test(x = 57, n = 500, p = 0.095, alternative = 'less')


    Exact binomial test

data:  57 and 500
number of successes = 57, number of trials = 500, p-value = 0.9333
alternative hypothesis: true probability of success is less than 0.095
95 percent confidence interval:
 0.0000000 0.1401005
sample estimates:
probability of success 
                 0.114

The data do not provide evidence that diabetes prevalence is less than 9.5% (p = 0.9333).

Two-sided test

${\begin{cases} H_{0} : p = 0.095 \\ H_{A} : p \neq 0.095 \end{cases}$

The data favor $H_{A}$ when counts are less likely. $\sum_{x : P r (x) \leq P r (57)} P r (x) = 0.1473$

binom.test(x = 57, n = 500, p = 0.095, alternative = 'two.sided')


    Exact binomial test

data:  57 and 500
number of successes = 57, number of trials = 500, p-value = 0.1473
alternative hypothesis: true probability of success is not equal to 0.095
95 percent confidence interval:
 0.0874949 0.1451685
sample estimates:
probability of success 
                 0.114

The data do not provide evidence that diabetes prevalence differs from 9.5% (p = 0.1473).

Exact confidence intervals

An exact confidence interval can be obtained by inverting the corresponding exact test.

With 90% confidence, diabetes prevalence among U.S. adults is estimated to be between 9.2% and 14%.

Unfortunately, this approach gives asymmetric intervals (midpoint is not $\hat{p}$ ). Two options:

choose smallest symmetric interval with exact coverage $\geq 100 \times (1 - α) %$
use a large-sample approximation

Confidence interval for $p$

A confidence interval for a binomial proportion $p$ is:

$\hat{p} \pm c \times S E (\hat{p})$

The critical value $c$ comes from the normal model.

empirical rule:
- $c = 1$ gives a 68% interval
- $c = 2$ gives a 95% interval
- $c = 3$ gives a 99.7% interval
for a $(1 - α) \times 100 %$ confidence interval use the $1 - \frac{α}{2}$ quantile of the normal model

qnorm(1 - 0.1/2) # c for 90% interval
qnorm(1 - 0.05/2) # c for 95% interval
qnorm(1 - 0.01/2) # c for 99% interval

Confidence interval for $p$

Point estimate for diabetes prevalence
p.hat	se	n
0.114	0.01421	500

It is estimated that the proportion of the U.S. adult population with diagnosed diabetes is 11.4% (SE = 1.42%).

Check assumptions for the normal model:

$500 \times 0.114 = 57 \geq 10 and 500 \times 0.886 = 443 \geq 10$

95% confidence interval for diabetes prevalence:

$0.114 \pm 2 \times 0.01421 = (0.0881, 0.1459)$

With 95% confidence, the proportuion of U.S. adults with diagnosed diabetes is estimated to be between 8.81% and 14.59%.

Inference for a proportion in R

Inference using the normal model in R:

Construct a table of the frequency distribution
Pass the table to prop.test()

Remarks about output:

X-squared gives $Z^{2}$
correct = F performs the test without continuity correction

# variable of interest
dia <- nhanes$diabetes

# pass table to prop.test
table(dia) |> 
  prop.test(p = 0.1, alternative = 'two.sided',
            conf.level = 0.95, correct = F)


    1-sample proportions test without continuity correction

data:  table(dia), null probability 0.1
X-squared = 1.0889, df = 1, p-value = 0.2967
alternative hypothesis: true p is not equal to 0.1
95 percent confidence interval:
 0.0890369 0.1448491
sample estimates:
    p 
0.114

The data provide no evidence that diabetes prevalence among U.S. adults differs from 10%. With 95% confidence, prevalence is estimated to be between 8.90% and 14.48%, with a point estimate of 11.4% (SE = 1.42%).

Comparing exact and approximate methods
	ci.lwr	ci.upr	ci.width	p.value
exact	0.08749	0.1452	0.05767	0.1473
approximate w/ continuity correction	0.08815	0.1459	0.0578	0.1698
approximate w/o continuity correction	0.08904	0.1448	0.05581	0.1474

Vitamin C experiment
	Cold	NoCold	n
Placebo	335	76	411
VitC	302	105	407

Inference for two proportions

We can first consider inferences on the difference in proportions:

$δ = p_{placebo} - p_{vitC}$

Inferences are based on groupwise estimates:

point estimate: ${\hat{p}}_{placebo} - {\hat{p}}_{vitC}$
standard error: $\sqrt{S E^{2} ({\hat{p}}_{placebo}) + S E^{2} ({\hat{p}}_{vitC})}$

When both groups meet the conditions for inference for one proportion, the statistic

$Z = \frac{{\hat{p}}_{1} - {\hat{p}}_{2} - δ}{S E ({\hat{p}}_{1} - {\hat{p}}_{2})}$ has a sampling distribution well-approximated by a normal model.

Confidence interval for the difference

${\hat{p}}_{placebo} - {\hat{p}}_{vitC} \pm c \times S E ({\hat{p}}_{placebo} - {\hat{p}}_{vitC})$ For a $(1 - α) \times 100 %$ confidence interval the critical value $c$ is chosen to be the $(1 - \frac{α}{2})$ quantile of the normal model.

point estimate: ${\hat{p}}_{placebo} - {\hat{p}}_{vitC} = 0.0731$
standard error: $\sqrt{S E^{2} ({\hat{p}}_{placebo}) + S E^{2} ({\hat{p}}_{vitC})} = 0.0289$
critical value for 95% interval: qnorm(1 - 0.05/2) = 1.959964

95% confidence interval: (0.0164, 0.1298)

With 95% confidence, the prevalence of common cold is estimated to be between 1.64% and 12.98% lower among adults who take daily vitamin C supplements.

	Smokers	NonSmokers	n
Cancer	83	3	86
Control	72	14	86

1 / 29

Inference for proportions

Inference for proportions
Today’s agenda
Categorical data
Sample proportions
Exact methods
Binomial probabilities
Exact sampling distribution
Upper-sided test
Lower-sided test
Two-sided test
Exact confidence intervals
Approximate methods
SE for a sample proportion
Sampling distribution of $\hat{p}$
Confidence interval for $p$
Confidence interval for $p$
Two-sided test
(Another) Two-sided test
One-sided tests
Inference for a proportion in R
correct = F?
Method comparisons
Approximate inference for two proportions
Two-way tables
Inference for two proportions
Confidence interval for the difference
Tests for a difference in proportions
Inference in R
Sampling and two-way tables