[lecture] omnibus \(F\) test for the one-way ANOVA model
[lab] fitting ANOVA models in R
Statistical power
\(p\)-values and type I errors
A \(p\)-value captures how often you’d make a mistake if \(H_0\) were true.
t.test(rainfall ~ treatment, data = cloud, mu =0, alternative ='greater')
Welch Two Sample t-test
data: rainfall by treatment
t = 1.9982, df = 33.855, p-value = 0.02689
alternative hypothesis: true difference in means between group seeded and group unseeded is greater than 0
95 percent confidence interval:
42.63408 Inf
sample estimates:
mean in group seeded mean in group unseeded
441.9846 164.5885
If there is no effect of cloud seeding, then we would see \(T > 1.9982\) for 2.69% of samples.
While unlikely, our sample could have been among that 2.69%
By rejecting here we are willing to be wrong 2.69% of the time
Type II errors
But you can also make a mistake when \(H_0\) is false!
t.test(rainfall ~ treatment, data = cloud, mu =0, alternative ='greater')
Welch Two Sample t-test
data: rainfall by treatment
t = 1.9982, df = 33.855, p-value = 0.02689
alternative hypothesis: true difference in means between group seeded and group unseeded is greater than 0
95 percent confidence interval:
42.63408 Inf
sample estimates:
mean in group seeded mean in group unseeded
441.9846 164.5885
Suppose we test at the 1% level.
The test fails to reject (\(p > 0.01\)), but that doesn’t completely rule out \(H_A\).
The estimated effect – an increase of 277.4 acre-feet – could be too small relative to the variability in rainfall.
Simulating type II errors
Summary stats for cloud data:
treatment
mean
sd
n
seeded
442
650.8
26
unseeded
164.6
278.4
26
We can approximate the type II error rate by:
simulating datasets with matching statistics
performing upper-sided tests of no difference
computing the proportion of fail-to-reject decisions
type2sim(delta =277, n =26, sd =650.8, alpha =0.01, nsim =1000)
If in fact the effect size is exactly 277, a level 1% test with similar data will fail to reject 80.3% of the time!
type2sim(delta =100, n =26, sd =650.8, alpha =0.01, nsim =1000)
If in fact the effect size is exactly 100, a level 1% test with similar data will fail to reject 96.3% of the time.
Statistical power
The power of a test refers to its true rejection rate under an alternative and is defined as: \[\beta = \underbrace{(1 - \text{type II error rate})}_\text{correct decision rate when alternative is true}\]
Power is often interpreted as a detection rate:
high type II error \(\longrightarrow\) low power \(\longrightarrow\) low detection rate
low type II error \(\longrightarrow\) high power \(\longrightarrow\) high detection rate
Power depends on the exact alternative scenario:
low for alternatives close to the hypothetical value (e.g., \(\delta_0 = 0\))
high for alternatives far from the hypothetical value (e.g., \(\delta_0 = 0\))
Power curves
Power is usually represented as a curve depending on the true difference.
Power curves for the test applied to the cloud data:
Assumptions:
significance level \(\alpha = 0.05\)
population standard deviation \(\sigma = 650\) (larger of two group estimates)
Which sample size achieves a specific \(\beta, \delta\)?
Sample size calculation
If you were (re)designing the study, how much data should you collect to detect a specified effect size?
To detect a difference of 250 or more due to cloud seeding with power 0.9:
power.t.test(power =0.9, # target power leveldelta =250, # smallest differencesd =650, # largest population SDsig.level =0.01, type ='two.sample', alternative ='one.sided')
Two-sample t test power calculation
n = 177.349
delta = 250
sd = 650
sig.level = 0.01
power = 0.9
alternative = one.sided
NOTE: n is number in *each* group
For a conservative estimate, use:
overestimate of the larger of the two standard deviations
minimum difference of interest
\(\Longrightarrow\) we need at least 177 observations in each group to detect a difference of 250 or more at least 90% of the time
Revisiting body temperatures
Does mean body temperature differ between men and women?
Test \(H_0: \mu_F = \mu_M\) against \(H_A: \mu_F \neq \mu_M\)
t.test(body.temp ~ sex, data = temps,mu =0, alternative ='two.sided')
Welch Two Sample t-test
data: body.temp by sex
t = 1.7118, df = 34.329, p-value = 0.09595
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
-0.09204497 1.07783444
sample estimates:
mean in group female mean in group male
98.65789 98.16500
Suggestive but insufficient evidence that mean body temperature differs by sex.
Notice: estimated difference (F - M) is 0.493 °F (SE 0.2879)
What if we had more data?
Here are estimates from two larger samples of 65 individuals each (compared with 19, 20):
sex
mean.temp
se
n
female
98.39
0.09222
65
male
98.1
0.08667
65
estimated difference (F - M) is smaller 0.2892 °F
but so is the standard error SE 0.1266 (recall more data \(\longleftrightarrow\) better precision)
t.test(body.temp ~ sex, data = temps.aug,mu =0, alternative ='two.sided')
Welch Two Sample t-test
data: body.temp by sex
t = 2.2854, df = 127.51, p-value = 0.02394
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
0.03881298 0.53964856
sample estimates:
mean in group female mean in group male
98.39385 98.10462
The data provide evidence that mean body temperature differs by sex (T = 2.29 on 127.51 degrees of freedom, p = 0.02394).
A statistical trap
If you collect enough data, you can detect an arbitrarily small difference in means.
We’ll base the test on the ratio \(F = \frac{\color{red}{\text{group variation}}}{\color{blue}{\text{error variation}}}\) and reject \(H_0\) if the ratio is large enough.
The \(F\) statistic: a variance ratio
The \(F\) statistic measures variability attributable to group differences relative to variability attributable to individual differences.
Notation:
\(\bar{x}\): “grand” mean of all observations
\(\bar{x}_i\): mean of observations in group \(i\)
F = 5.4636 means the variation in weight attributable to diets is 5.46 times greater than individual variation among chicks.
The \(p\)-value for the test is 0.0029:
if there is truly no difference in means, then under 1% of samples (about 2 in 1000) would produce at least as much diet-to-diet variability as we observed
so in this case we reject \(H_0\) at the 1% level
ANOVA in R
The aov(...) function fits ANOVA models using a formula/dataframe specification:
# fit anova modelfit <-aov(weight ~ diet, data = chicks)# generate tablesummary(fit)
Analysis of Variance Model
Df
Sum Sq
Mean Sq
F value
Pr(>F)
diet
3
55881
18627
5.464
0.002909
Residuals
42
143190
3409
NA
NA
Conventional interpretation style closely follows that of previous inferences for one or two means:
The data provide evidence of an effect of diet on mean weight (F = 5.464 on 3 and 42 df, p = 0.0029).
Analysis of variance table
The results of an analysis of variance are traditionally displayed in a table.
Source
degrees of freedom
Sum of squares
Mean square
F statistic
p-value
Group
\(k - 1\)
SSG
\(MSG = \frac{SSG}{k - 1}\)
\(\frac{MSG}{MSE}\)
\(P(F > F_\text{obs})\)
Error
\(n - k\)
SSE
\(MSE = \frac{SSE}{n - k}\)
the sum of square terms are ‘raw’ measures of variability from each source
the mean square terms are adjusted for the amount of data available and number of parameters
Formally, the ANOVA model says \[(n - 1)S^2 = SSG + SSE\]
Constructing the table “by hand”
Source
DF
Sum sq.
Mean sq.
F statistic
p-value
Group
\(k - 1\)
SSG
\(MSG = \frac{SSG}{k - 1}\)
\(\frac{MSG}{MSE}\)
\(P(F > F_\text{obs})\)
Error
\(n - k\)
SSE
\(MSE = \frac{SSE}{n - k}\)
Here is what the fitted model looks like in R:
# fitted anova modelaov(weight ~ diet, data = chicks)
Call:
aov(formula = weight ~ diet, data = chicks)
Terms:
diet Residuals
Sum of Squares 55881.02 143190.31
Deg. of Freedom 3 42
Residual standard error: 58.38915
Estimated effects may be unbalanced
Source
DF
Sum Sq.
Mean Sq.
F statistic
Diet
Error
Rules of thumb for level \(\alpha = 0.05\) tests:
\(F > 5\) almost always rejects
\(F > 4\) usually rejects for \(k > 2\) (need \(n > 15\))
\(F > 3\) usually rejects for \(k > 3\) (need \(n > 25\))
\(F > 2\) rarely rejects unless \(k > 8\)
Measuring effect size
A common measure of effect size is \[\eta^2 = \frac{SSG}{SST} \quad\left(\frac{\text{group variation}}{\text{total variation}}\right)\] where \[SST = SSG + SSE = (n - 1)S^2_x\] This is the proportion of total variation attributed to the grouping factor.
possible values \(0 \leq \eta^2 \leq 1\)
near 1 \(\longrightarrow\) large effect
near 0 \(\longrightarrow\) small effect
library(effectsize)fit <-aov(weight ~ diet, data = chicks)eta_squared(fit, partial = F)
# Effect Size for ANOVA (Type I)
Parameter | Eta2 | 95% CI
-------------------------------
diet | 0.28 | [0.08, 1.00]
- One-sided CIs: upper bound fixed at [1.00].
An estimated 28% of variation in weight is attributable to diet.
The confidence interval is for the population analogue of \(\eta^2\).
Powering for effect size
How many chicks should we measure to detect an average group difference of 20g?
Group variation on the order of 20g amounts to about (since \(n_i\) varies) \[SSG \approx (k - 1)(n_i\times400) \approx 14400\] which is an effect size of \[\eta^2 = \frac{14400}{143190 + 14400} = 0.0914\]i.e., group variation is around 9.14% of total variation.
To power the study to detect \(\eta^2 = 0.1\) use the approximation (assumes \(k \ll n\)):
Balanced one-way analysis of variance power calculation
groups = 4
n = 33.70068
between.var = 0.1
within.var = 0.9
sig.level = 0.05
power = 0.8
NOTE: n is number in each group
We’d need 34 chicks per group.
Other examples
Diet restriction and longevity
From data on lifespans of lab mice fed calorie-restricted diets:
diet
mean
sd
n
NP
27.4
6.134
49
N/N85
32.69
5.125
57
N/R50
42.3
7.768
71
N/R40
45.12
6.703
60
ANOVA omnibus test:
\[
\begin{align*}
&H_0: \mu_\text{NP} = \mu_\text{N/N85} = \mu_\text{N/R50} = \mu_\text{N/R40} \\
&H_A: \text{at least two means differ}
\end{align*}
\]
fit <-aov(lifetime ~ diet, data = longevity)summary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
diet 3 11426 3809 87.41 <2e-16 ***
Residuals 233 10152 44
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The data provide evidence that diet restriction has an effect on mean lifetime among mice (F = 87.41 on 3 and 233 degrees of freedom, p < 0.0001).
Treating anorexia
From data on weight change among 72 young anorexic women after a period of therapy:
treat
post - pre
sd
n
ctrl
-0.45
7.989
26
cbt
3.007
7.309
29
ft
7.265
7.157
17
ANOVA omnibus test: \[
\begin{align*}
&H_0: \mu_\text{ctrl} = \mu_\text{cbt} = \mu_\text{ft} \\
&H_A: \text{at least two means differ}
\end{align*}
\]
# fit anova modelfit <-aov(change ~ treat, data = anorexia)# generate tablesummary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
treat 2 615 307.32 5.422 0.0065 **
Residuals 69 3911 56.68
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The data provide evidence of an effect of therapy on mean weight change among young women with anorexia (F = 5.422 on 2 and 69 degrees of freedom, p = 0.0065).