library(tidyverse)
load('data/finch.RData')
load('data/temps2.RData')
Lab 6: Two-sample inference
This lab focuses on two-sample inference for differences in population means. We’ll use two datasets for which we will consider two-sample comparisons:
finch
: mean finch beak depths in generations before and after a drought on Daphne Majortemps
: body temperatures and heart rates for men and women
Examples will utilize the finch
data; you’ll practice using the temps
data.
Checking assumptions
A two-sample
To do that, we’ll need to separate the samples. This can be done by partitioning observations of beak depth by year.
# split observations by year
<- finch$year
f.year <- finch$depth
f.depth <- split(f.depth, f.year)
f.split
# retrieve depth measurements from each year
.1978 <- f.split$`1978`
depth.1976 <- f.split$`1976` depth
We could then check assumptions by comparing histograms:
# make histograms
hist(depth.1978)
hist(depth.1976)
While there is a bit of left skewness, the sample sizes are large enough that it’s not a concern.
Partition the temps
data by sex and make histograms of the heart rates. Comment on whether assumptions seem to be met.
# split observations of heart rate by sex
<- temps$heart.rate
hr <- temps$sex
sex <- split(hr, sex)
hr.split
# retrieve heart rate measurements from each group
<- hr.split$male
hr.m <- hr.split$female
hr.f
# make histograms and compare
hist(hr.f)
hist(hr.m)
Sample sizes are large and both distributions are unimodal, so the
Exploratory plots
When comparing histograms, it’s difficult to judge if there appears to be a difference between samples/groups. The eye has to jump back and forth, and the centers are not always visually obvious.
Side-by-side boxplots provide a nice alternative that makes it easy to compare the groups for location differences (and thus different means). This can be done with the original dataset (no need to partition):
# side-by-side boxplots
boxplot(depth ~ year, data = finch, horizontal = T)
It looks like there is a clear difference – beak depths are greater in 1978 – so now the question is simply whether that difference is statistically significant relative to the sampling variation in our estimates.
As an aside, the syntax y ~ x
is called a formula in R. You can read it verbally as “y
depends on x
”; in the above example, we would read depth ~ year
as saying “depth
depends on year
”.
Make side-by-side boxplots for heart rate and reassess test assumptions.
# side-by-side boxplots
boxplot(heart.rate ~ sex, data = temps, horizontal = T)
Test assumptions still seem reasonable; it’s not clear that there’s much of a difference by sex, so it would be a little surprising if a statistical test rejected the hypothesis of no difference.
Two-sample -tests
To test whether the drought imposed selection pressure on the finch population, we want to know whether finch beak depth increased after the drought, i.e.,
We can perform the test using t.test(...)
with a formula in which the variable of interest is on the left and the grouping variable is on the right.
# perform t test
t.test(formula = depth ~ year,
data = finch,
mu = 0,
alternative = 'less',
conf.level = 0.95)
Welch Two Sample t-test
data: depth by year
t = -4.5727, df = 111.79, p-value = 6.255e-06
alternative hypothesis: true difference in means between group 1976 and group 1978 is less than 0
95 percent confidence interval:
-Inf -0.4698812
sample estimates:
mean in group 1976 mean in group 1978
9.453448 10.190769
Notice two subtleties:
- we need to supply a
data
argument; the formula won’t work if R doesn’t know where to find the variables of interest - the alternative is specified as
less
; this is because the first group in the data is 1976, so the alternative reads “mean in 1976 is less than mean in 1978”; to determine which group comes first, look at which point estimate is printed first in the output
The point estimates and standard error can be retrieved by storing the output of t.test(...)
.
# store t test result
<- t.test(formula = depth ~ year,
tt.rslt data = finch,
mu = 0,
alternative = 'less',
conf.level = 0.95)
# print results
tt.rslt
Welch Two Sample t-test
data: depth by year
t = -4.5727, df = 111.79, p-value = 6.255e-06
alternative hypothesis: true difference in means between group 1976 and group 1978 is less than 0
95 percent confidence interval:
-Inf -0.4698812
sample estimates:
mean in group 1976 mean in group 1978
9.453448 10.190769
# estimates
$estimate tt.rslt
mean in group 1976 mean in group 1978
9.453448 10.190769
# estimate for difference in means
$estimate |> diff() tt.rslt
mean in group 1978
0.737321
# standard error for estimate of difference in means
$stderr tt.rslt
[1] 0.1612445
We’d report the test result as follows:
The data provide evidence that mean beak depth increased in the generation of finches following the drought (T = -4.5727 on 111.79 degrees of freedom, p < 0.0001). With 95% confidence, the mean beak depth is estimated to have increased by at least 0.4699 mm, with a point estiamte of 0.7373 mm (SE 0.1612).
Test whether mean heart rate differs between men and women at the 1% significance level. (Make sure your interval estimate is consistent with the level and alternative of your test.) Report the test result, confidence interval, and point estimate and standard error for the difference in means.
# store t test result
<- t.test(formula = heart.rate ~ sex,
tt.rslt data = temps,
mu = 0,
alternative = 'two.sided',
conf.level = 0.99)
# print results
tt.rslt
Welch Two Sample t-test
data: heart.rate by sex
t = 0.63191, df = 116.7, p-value = 0.5287
alternative hypothesis: true difference in means between group female and group male is not equal to 0
99 percent confidence interval:
-2.466825 4.036055
sample estimates:
mean in group female mean in group male
74.15385 73.36923
# estimates
$estimate tt.rslt
mean in group female mean in group male
74.15385 73.36923
# estimate for difference in means
$estimate |> diff() tt.rslt
mean in group male
-0.7846154
# standard error for estimate of difference in means
$stderr tt.rslt
[1] 1.241665
The data provide no evidence that heart rate differs by sex (T = 0.63 on 116.7 degrees of freedom, p = 0.5287). With 95% confidence, mean heart rate among women is estimated to be between -2.47 and 4.04 bpm, with a point estimate of 0.78 bpm (SE 1.24).