load('data/census.RData')
load('data/temps.RData')
Lab 9: nonparametric inference
The objective of this lab is to learn how to implement nonparamteric alternatives to
- sign test
- signed rank test
- rank sum test
- Kruskal-Wallis test
Examples utilize data from lecture and past assignments.
One-sample inference
Sign test
The sign test is a nonparametric inference procedure for inference on a population median
If the median is in fact
Here’s an example of testing whether median DDT in kale samples is 3:
# ddt data
<- MASS::DDT
ddt
# how many observations are less than 3?
<- sum(ddt < 3)
ddt.x <- length(ddt)
ddt.n
# sign test
binom.test(x = ddt.x, n = ddt.n, p = 0.5, alternative = 'two.sided')
Exact binomial test
data: ddt.x and ddt.n
number of successes = 2, number of trials = 15, p-value = 0.007385
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.01657591 0.40460270
sample estimates:
probability of success
0.1333333
The
The data provide evidence that the median DDT in kale is not 3ppm (sign test, *p = 0.0074).
Using the census
data, test whether median total personal income is 25K using the sign test. Interpret the result in context.
Extra: Which alternative should you use in binom.test
to test whether median income is less than 25K?
# incomes from census data
<- census$total_personal_income
incomes
# how many observations are less than 25K?
<- sum(incomes < 25000)
incomes.x <- length(incomes)
incomes.n
# sign test
binom.test(x = incomes.x, n = incomes.n, p = 0.5, alternative = 'two.sided')
Exact binomial test
data: incomes.x and incomes.n
number of successes = 228, number of trials = 377, p-value = 5.557e-05
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.5534322 0.6544581
sample estimates:
probability of success
0.6047745
The data provide evidence that median income is not 25K.
Signed rank test
The signed rank test provides inference on the “center” of a population distribution, assuming the distribution is symmetric. For example, a two-sided test would pertain to the hypothesis and alternative:
Under the symmetry assumption, one can interpret the center as a median.
The implementation in R is almost identical to the wilcox.test(...)
. For example, to test whether median male body temperature is 98.7 degrees Farenheit:
# male body temperatures
<- split(temps$body.temp, temps$sex)
temps.split <- temps.split$male
temps.m
# signed rank test
wilcox.test(temps.m, mu = 98.7, alternative = 'two.sided')
Wilcoxon signed rank test with continuity correction
data: temps.m
V = 39.5, p-value = 0.01505
alternative hypothesis: true location is not equal to 98.7
We’d interpret the result as follows:
The data provide evidence that median male body temperature is not 98.7 degrees Farenheit (signed rank test, p = 0.0151).
Test whether median female body temperature is 98.7 degrees Farenheit and interpret the result in context.
# female body temperatures
<- temps.split$female
temps.f
# signed rank test
wilcox.test(temps.f, mu = 98.7, alternative = 'two.sided')
Wilcoxon signed rank test with continuity correction
data: temps.f
V = 79, p-value = 0.7935
alternative hypothesis: true location is not equal to 98.7
The data do not provide evidence that median female body temperature differs from 98.7 degrees Farenheit.
It is important to remember that this method relies on assuming the underlying data distribution is symmetric.
Two-sample inference
Consider now comparing the centers of two groups. For this we can use the rank sum test, which for a two-sided alternative would assess the hypotheses:
This is the analogue of the two-sample
The implementation in R is straightforward. To test whether male and female body temperatures differ:
# rank sum test
wilcox.test(body.temp ~ sex, data = temps)
Wilcoxon rank sum test with continuity correction
data: body.temp by sex
W = 249.5, p-value = 0.09682
alternative hypothesis: true location shift is not equal to 0
We’d interpret the result as follows:
The data do not provide evidence that body temperatures differ by sex (rank sum test, p = 0.0968).
Using the census
data, test whether incomes are higher among men than among women. Interpret your result in context.
# rank sum test
wilcox.test(total_personal_income ~ sex, data = census, alternative = 'less')
Wilcoxon rank sum test with continuity correction
data: total_personal_income by sex
W = 11889, p-value = 1.363e-08
alternative hypothesis: true location shift is less than 0
The data provide evidence that incomes difer by sex (rank sum test, p < 0.0001).
By including conf.int
and conf.level
arguments, we can obtain an estimate and confidence interval for the magnitude of the location shift:
# rank sum test
wilcox.test(body.temp ~ sex, data = temps, conf.int = T, conf.level = 0.95)
Wilcoxon rank sum test with continuity correction
data: body.temp by sex
W = 249.5, p-value = 0.09682
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
-0.1999996 1.1000358
sample estimates:
difference in location
0.5000276
This is interpreted as follows:
Body temperatures are estimated to be 0.5 degrees Farenheit higher among women.
As discussed in lecture, the rank sum test is designed to detect differences in location – meaning that the alternative, though written in terms of centers, is really that the values from one group tend to be uniformly larger/smaller than those from the other group. It’s important to keep this in mind.