Two sample inference

Hypothesis tests and intervals for comparing two population means

Today’s agenda

[lecture] two sample inference for means
[lab] two-sample \(t\) tests in R
[test prep] practice problems

From last time

Practice problem: test whether actual body weight exceeds desired body weight.

subject	actual	desired	difference
1	265	225	40
2	150	150	0
3	137	150	-13
4	159	125	34
5	145	125	20

weight.diffs <- brfss$weight - brfss$wtdesire
t.test(weight.diffs, 
       mu = 0, 
       alternative = 'greater')


    One Sample t-test

data:  weight.diffs
t = 4.2172, df = 59, p-value = 4.311e-05
alternative hypothesis: true mean is greater than 0
95 percent confidence interval:
 10.99824      Inf
sample estimates:
mean of x 
 18.21667

The data provide very strong evidence that the average U.S. adult’s actual weight exceeds their desired weight (T = 4.2172 on 59 degrees of freedom, p < 0.0001).

Inference is on the mean difference: \(H_0: \delta = 0\) vs. \(H_A: \delta > 0\).

Can we also do inference on a difference in means?

Evolution of Darwin’s finches

Peter and Rosemary Grant caught and measured birds from more than 20 generations of finches on Daphne Major.

severe drought in 1977 limited food to large tough seeds
selection pressure favoring larger and stronger beaks
hypothesis: beak depth increased in 1978 relative to 1976

year	depth
1976	10.8
1976	7.4
1978	11.4
1978	10.6

To answer this, we need to test a hypothesis involving two means:

\[ \begin{cases} H_0: &\mu_{1976} = \mu_{1978} \\ H_A: &\mu_{1976} < \mu_{1978} \end{cases} \]

can’t do inference on a mean difference here (no pairing of observations)
treat each year as an independent sample

Two-sample inference

If \(x_1, \dots, x_{58}\) are the 1976 observations and \(y_1, \dots, y_{65}\) are the 1978 observations:

\(\bar{x}\) is a point estimate for \(\mu_{1976}\) with standard error \(SE(\bar{x}) = \frac{s_x}{\sqrt{n}}\)
\(\bar{y}\) is a point estimate for \(\mu_{1978}\) with standard error \(SE(\bar{y}) = \frac{s_y}{\sqrt{n}}\)

Inference uses a new \(T\) statistic:

\[ T = \frac{\bar{x} - \bar{y} - \delta_0}{SE(\bar{x} - \bar{y})} \]

\(\delta_0\) is the hypothesized difference in means
\(SE(\bar{x} - \bar{y}) = \sqrt{SE(\bar{x})^2 + SE(\bar{y})^2}\)
\(t_\nu\) model approximates the sampling distribution when each sample meets assumptions for one-sample inference

Checking assumptions

The two-sample test is appropriate whenever two one-sample tests would be.

In other words, the test assumes that both samples are either:

sufficiently large; or
have little skew and few outliers

To check, simply inspect each histogram.

both distributions unimodal
both a bit left skewed
no extreme outliers
large sample sizes (58, 65)

Checking assumptions (alternative)

The two-sample test is appropriate whenever two one-sample tests would be.

In other words, the test assumes that both samples are either:

sufficiently large; or
have little skew and few outliers

Could also check side-by-side boxplots for:

approximate symmetry of boxes
outliers far from whiskers

This is also a nice visualization of differences between samples.

Interpreting outputs and results

t.test(depth ~ year, data = finch,
       mu = 0, alternative = 'less')


    Welch Two Sample t-test

data:  depth by year
t = -4.5727, df = 111.79, p-value = 6.255e-06
alternative hypothesis: true difference in means between group 1976 and group 1978 is less than 0
95 percent confidence interval:
       -Inf -0.4698812
sample estimates:
mean in group 1976 mean in group 1978 
          9.453448          10.190769

The data provide very strong evidence that mean beak depth increased following the drought (T = -4.5727 on 111.79 degrees of freedom, p < 0.0001). With 95% confidence, the mean increase is estimated to be at least 0.4699 mm, with a point estimate of 0.7373 (SE 0.1612).

Highly similar, but notice:

input is a formula depth ~ year (“depth depends on year”) and data frame finch
mu now indicates hypothesized difference in means
decimal degrees of freedom
alternative is relative to the order in which groups appear

Cloud data

Does seeding clouds with silver iodide increase mean rainfall?

Data are rainfall measurements in a target area from 26 days when clouds were seeded and 26 days when clouds were not seeded.

rainfall gives volume of rainfall in acre-feet
treatment indicates whether clouds were seeded

Hypotheses to test: \[ \begin{cases} H_0: &\mu_\text{seeded} = \mu_\text{unseeded} \\ H_A: &\mu_\text{seeded} > \mu_\text{unseeded} \end{cases} \]

rainfall	treatment
334.1	seeded
489.1	seeded
200.7	seeded
40.6	seeded
21.7	unseeded
17.3	unseeded
68.5	unseeded
830.1	unseeded

Cloud data: which alternative?

Does seeding clouds with silver iodide increase mean rainfall?

t.test(rainfall ~ treatment, data = cloud, 
       mu = 0, alternative = 'less')


    Welch Two Sample t-test

data:  rainfall by treatment
t = 1.9982, df = 33.855, p-value = 0.9731
alternative hypothesis: true difference in means between group seeded and group unseeded is less than 0
95 percent confidence interval:
     -Inf 512.1582
sample estimates:
  mean in group seeded mean in group unseeded 
              441.9846               164.5885

t.test(rainfall ~ treatment, data = cloud, 
       mu = 0, alternative = 'greater')


    Welch Two Sample t-test

data:  rainfall by treatment
t = 1.9982, df = 33.855, p-value = 0.02689
alternative hypothesis: true difference in means between group seeded and group unseeded is greater than 0
95 percent confidence interval:
 42.63408      Inf
sample estimates:
  mean in group seeded mean in group unseeded 
              441.9846               164.5885

You can tell which group R considers first based on which estimate is printed first.

'greater' is interpreted as [FIRST GROUP] > [SECOND GROUP]
'less' is interpreted as [FIRST GROUP] < [SECOND GROUP]

Cloud data: interpretation

Does seeding clouds with silver iodide increase mean rainfall?

t.test(rainfall ~ treatment, data = cloud, 
       mu = 0, alternative = 'greater')


    Welch Two Sample t-test

data:  rainfall by treatment
t = 1.9982, df = 33.855, p-value = 0.02689
alternative hypothesis: true difference in means between group seeded and group unseeded is greater than 0
95 percent confidence interval:
 42.63408      Inf
sample estimates:
  mean in group seeded mean in group unseeded 
              441.9846               164.5885

The data provide moderate evidence that cloud seeding increases mean rainfall (T = 1.9982 on 33.855 degrees of freedom, p = 0.02689). With 95% confidence, seeding is estimated to increase mean rainfall by at least 42.63 acre-feet, with a point estimate of 277.4 (SE 138.8199).

Body temperatures (again)

Does mean body temperature differ between men and women?

Test \(H_0: \mu_F = \mu_M\) against \(H_A: \mu_F \neq \mu_M\)

t.test(body.temp ~ sex, data = temps, 
       mu = 0, alternative = 'two.sided')


    Welch Two Sample t-test

data:  body.temp by sex
t = 1.7118, df = 34.329, p-value = 0.09595
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
 -0.09204497  1.07783444
sample estimates:
mean in group female   mean in group male 
            98.65789             98.16500

Suggestive but insufficient evidence that mean body temperature differs by sex.

Notice: estimated difference (F - M) is 0.493 °F (SE 0.2879)

What if we had more data?

Here are estimates from two larger samples of 65 individuals each (compared with 19, 20):

sex	mean.temp	se	n
female	98.39	0.09222	65
male	98.1	0.08667	65

estimated difference (F - M) is smaller 0.2892 °F
but so is the standard error SE 0.1266 (recall more data \(\longleftrightarrow\) better precision)

t.test(body.temp ~ sex, data = temps.aug, 
       mu = 0, alternative = 'two.sided')


    Welch Two Sample t-test

data:  body.temp by sex
t = 2.2854, df = 127.51, p-value = 0.02394
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
 0.03881298 0.53964856
sample estimates:
mean in group female   mean in group male 
            98.39385             98.10462

The data provide moderate evidence that mean body temperature differs by sex (T = 2.29 on 127.51 degrees of freedom, p = 0.02394).

Power calculations

How much data do you need to collect in order to detect a difference of \(\delta\)?

The statistical power of a test captures how often it detects a specified alternative.

measures how often the test correctly rejects (proportion of samples)
value depends on…
1. magnitude of difference between null value and true value of parameter
2. significance level
3. sample size

power.t.test(power = 0.95, 
             delta = 0.5, 
             sig.level = 0.05, 
             type = 'two.sample',
             alternative = 'two.sided')


     Two-sample t test power calculation 

              n = 104.928
          delta = 0.5
             sd = 1
      sig.level = 0.05
          power = 0.95
    alternative = two.sided

NOTE: n is number in *each* group

\(\Rightarrow\) need 105 observations in each group to detect a difference of 0.5 standard deviations for 95% of samples with a 5% significance level test

A statistical trap

If you collect enough data, you can detect an arbitrarily small difference in means almost always.

So keep in mind:

statistical significance \(\neq\) practical significance
always check your point estimates

Extras

The equal-variance \(t\)-test

If it is reasonable to assume the (population) standard deviations are the same in each group, one can gain a bit of power by using a different standard error:

\[SE_\text{pooled}(\bar{x} - \bar{y}) = \sqrt{\frac{\color{red}{s_p^2}}{n_x} + \frac{\color{red}{s_p^2}}{n_y}} \quad\text{where}\quad \color{red}{s_p} = \underbrace{\sqrt{\frac{(n_x - 1)s_x^2 + (n_y - 1)s_y^2}{n_x + n_y - 2}}}_{\text{weighted average of } s_x^2 \;\&\; s_y^2}\]

Implement by adding var.equal = T as an argument to t.test().

larger df is used, hence more frequent rejections
avoid unless you have a small sample