Two sample inference

Hypothesis tests and intervals for comparing two population means

Today’s agenda

  1. [lecture] two sample inference for means
  2. [lab] two-sample \(t\) tests in R
  3. [test prep] practice problems

From last time

Practice problem: test whether actual body weight exceeds desired body weight.

subject actual desired difference
1 265 225 40
2 150 150 0
3 137 150 -13
4 159 125 34
5 145 125 20
weight.diffs <- brfss$weight - brfss$wtdesire
t.test(weight.diffs, 
       mu = 0, 
       alternative = 'greater')

    One Sample t-test

data:  weight.diffs
t = 4.2172, df = 59, p-value = 4.311e-05
alternative hypothesis: true mean is greater than 0
95 percent confidence interval:
 10.99824      Inf
sample estimates:
mean of x 
 18.21667 

The data provide very strong evidence that the average U.S. adult’s actual weight exceeds their desired weight (T = 4.2172 on 59 degrees of freedom, p < 0.0001).

Inference is on the mean difference: \(H_0: \delta = 0\) vs. \(H_A: \delta > 0\).

Can we also do inference on a difference in means?

Evolution of Darwin’s finches

Peter and Rosemary Grant caught and measured birds from more than 20 generations of finches on Daphne Major.

  • severe drought in 1977 limited food to large tough seeds

  • selection pressure favoring larger and stronger beaks

  • hypothesis: beak depth increased in 1978 relative to 1976

year depth
1976 10.8
1976 7.4
1978 11.4
1978 10.6

To answer this, we need to test a hypothesis involving two means:

\[ \begin{cases} H_0: &\mu_{1976} = \mu_{1978} \\ H_A: &\mu_{1976} < \mu_{1978} \end{cases} \]

  • can’t do inference on a mean difference here (no pairing of observations)
  • treat each year as an independent sample

Two-sample inference

If \(x_1, \dots, x_{58}\) are the 1976 observations and \(y_1, \dots, y_{65}\) are the 1978 observations:

  • \(\bar{x}\) is a point estimate for \(\mu_{1976}\) with standard error \(SE(\bar{x}) = \frac{s_x}{\sqrt{n}}\)
  • \(\bar{y}\) is a point estimate for \(\mu_{1978}\) with standard error \(SE(\bar{y}) = \frac{s_y}{\sqrt{n}}\)

Inference uses a new \(T\) statistic:

\[ T = \frac{\bar{x} - \bar{y} - \delta_0}{SE(\bar{x} - \bar{y})} \]

  • \(\delta_0\) is the hypothesized difference in means
  • \(SE(\bar{x} - \bar{y}) = \sqrt{SE(\bar{x})^2 + SE(\bar{y})^2}\)
  • \(t_\nu\) model approximates the sampling distribution when each sample meets assumptions for one-sample inference

Checking assumptions

The two-sample test is appropriate whenever two one-sample tests would be.

In other words, the test assumes that both samples are either:

  • sufficiently large; or
  • have little skew and few outliers

To check, simply inspect each histogram.

  • both distributions unimodal
  • both a bit left skewed
  • no extreme outliers
  • large sample sizes (58, 65)

Checking assumptions (alternative)

The two-sample test is appropriate whenever two one-sample tests would be.

In other words, the test assumes that both samples are either:

  • sufficiently large; or
  • have little skew and few outliers

Could also check side-by-side boxplots for:

  • approximate symmetry of boxes
  • outliers far from whiskers

This is also a nice visualization of differences between samples.

Interpreting outputs and results

t.test(depth ~ year, data = finch,
       mu = 0, alternative = 'less')

    Welch Two Sample t-test

data:  depth by year
t = -4.5727, df = 111.79, p-value = 6.255e-06
alternative hypothesis: true difference in means between group 1976 and group 1978 is less than 0
95 percent confidence interval:
       -Inf -0.4698812
sample estimates:
mean in group 1976 mean in group 1978 
          9.453448          10.190769 

The data provide very strong evidence that mean beak depth increased following the drought (T = -4.5727 on 111.79 degrees of freedom, p < 0.0001). With 95% confidence, the mean increase is estimated to be at least 0.4699 mm, with a point estimate of 0.7373 (SE 0.1612).

Highly similar, but notice:

  • input is a formula depth ~ year (“depth depends on year”) and data frame finch
  • mu now indicates hypothesized difference in means
  • decimal degrees of freedom
  • alternative is relative to the order in which groups appear

Cloud data

Does seeding clouds with silver iodide increase mean rainfall?

Data are rainfall measurements in a target area from 26 days when clouds were seeded and 26 days when clouds were not seeded.

  • rainfall gives volume of rainfall in acre-feet
  • treatment indicates whether clouds were seeded

Hypotheses to test: \[ \begin{cases} H_0: &\mu_\text{seeded} = \mu_\text{unseeded} \\ H_A: &\mu_\text{seeded} > \mu_\text{unseeded} \end{cases} \]

rainfall treatment
334.1 seeded
489.1 seeded
200.7 seeded
40.6 seeded
21.7 unseeded
17.3 unseeded
68.5 unseeded
830.1 unseeded

Cloud data: which alternative?

Does seeding clouds with silver iodide increase mean rainfall?

t.test(rainfall ~ treatment, data = cloud, 
       mu = 0, alternative = 'less')

    Welch Two Sample t-test

data:  rainfall by treatment
t = 1.9982, df = 33.855, p-value = 0.9731
alternative hypothesis: true difference in means between group seeded and group unseeded is less than 0
95 percent confidence interval:
     -Inf 512.1582
sample estimates:
  mean in group seeded mean in group unseeded 
              441.9846               164.5885 
t.test(rainfall ~ treatment, data = cloud, 
       mu = 0, alternative = 'greater')

    Welch Two Sample t-test

data:  rainfall by treatment
t = 1.9982, df = 33.855, p-value = 0.02689
alternative hypothesis: true difference in means between group seeded and group unseeded is greater than 0
95 percent confidence interval:
 42.63408      Inf
sample estimates:
  mean in group seeded mean in group unseeded 
              441.9846               164.5885 

You can tell which group R considers first based on which estimate is printed first.

  • 'greater' is interpreted as [FIRST GROUP] > [SECOND GROUP]
  • 'less' is interpreted as [FIRST GROUP] < [SECOND GROUP]

Cloud data: interpretation

Does seeding clouds with silver iodide increase mean rainfall?

t.test(rainfall ~ treatment, data = cloud, 
       mu = 0, alternative = 'greater')

    Welch Two Sample t-test

data:  rainfall by treatment
t = 1.9982, df = 33.855, p-value = 0.02689
alternative hypothesis: true difference in means between group seeded and group unseeded is greater than 0
95 percent confidence interval:
 42.63408      Inf
sample estimates:
  mean in group seeded mean in group unseeded 
              441.9846               164.5885 

The data provide moderate evidence that cloud seeding increases mean rainfall (T = 1.9982 on 33.855 degrees of freedom, p = 0.02689). With 95% confidence, seeding is estimated to increase mean rainfall by at least 42.63 acre-feet, with a point estimate of 277.4 (SE 138.8199).

Body temperatures (again)

Does mean body temperature differ between men and women?

Test \(H_0: \mu_F = \mu_M\) against \(H_A: \mu_F \neq \mu_M\)

t.test(body.temp ~ sex, data = temps, 
       mu = 0, alternative = 'two.sided')

    Welch Two Sample t-test

data:  body.temp by sex
t = 1.7118, df = 34.329, p-value = 0.09595
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
 -0.09204497  1.07783444
sample estimates:
mean in group female   mean in group male 
            98.65789             98.16500 

Suggestive but insufficient evidence that mean body temperature differs by sex.

Notice: estimated difference (F - M) is 0.493 °F (SE 0.2879)

What if we had more data?

Here are estimates from two larger samples of 65 individuals each (compared with 19, 20):

sex mean.temp se n
female 98.39 0.09222 65
male 98.1 0.08667 65
  • estimated difference (F - M) is smaller 0.2892 °F
  • but so is the standard error SE 0.1266 (recall more data \(\longleftrightarrow\) better precision)
t.test(body.temp ~ sex, data = temps.aug, 
       mu = 0, alternative = 'two.sided')

    Welch Two Sample t-test

data:  body.temp by sex
t = 2.2854, df = 127.51, p-value = 0.02394
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
 0.03881298 0.53964856
sample estimates:
mean in group female   mean in group male 
            98.39385             98.10462 

The data provide moderate evidence that mean body temperature differs by sex (T = 2.29 on 127.51 degrees of freedom, p = 0.02394).

Power calculations

How much data do you need to collect in order to detect a difference of \(\delta\)?

The statistical power of a test captures how often it detects a specified alternative.

  • measures how often the test correctly rejects (proportion of samples)

  • value depends on…

    1. magnitude of difference between null value and true value of parameter
    2. significance level
    3. sample size
power.t.test(power = 0.95, 
             delta = 0.5, 
             sig.level = 0.05, 
             type = 'two.sample',
             alternative = 'two.sided')

     Two-sample t test power calculation 

              n = 104.928
          delta = 0.5
             sd = 1
      sig.level = 0.05
          power = 0.95
    alternative = two.sided

NOTE: n is number in *each* group

\(\Rightarrow\) need 105 observations in each group to detect a difference of 0.5 standard deviations for 95% of samples with a 5% significance level test

A statistical trap

If you collect enough data, you can detect an arbitrarily small difference in means almost always.

So keep in mind:

  • statistical significance \(\neq\) practical significance
  • always check your point estimates

Extras

The equal-variance \(t\)-test

If it is reasonable to assume the (population) standard deviations are the same in each group, one can gain a bit of power by using a different standard error:

\[SE_\text{pooled}(\bar{x} - \bar{y}) = \sqrt{\frac{\color{red}{s_p^2}}{n_x} + \frac{\color{red}{s_p^2}}{n_y}} \quad\text{where}\quad \color{red}{s_p} = \underbrace{\sqrt{\frac{(n_x - 1)s_x^2 + (n_y - 1)s_y^2}{n_x + n_y - 2}}}_{\text{weighted average of } s_x^2 \;\&\; s_y^2}\]

Implement by adding var.equal = T as an argument to t.test().

  • larger df is used, hence more frequent rejections
  • avoid unless you have a small sample