Homework 1

With solutions

Due

June 26, 2025

Remarks: problems 1 and 2 are based on exercises in Lab 2: Descriptive statistics; problems 3 and 4 are based on exercises in Lab 3: Point and interval estimation and Lab 4: Confidence intervals; problem 5 is comprehensive.

  1. The census dataset contains a sample of responses from the 2000 U.S. census.

    1. [L1] How many variables are in the dataset, not including census year and FIPS code?
    2. [L1] How many categorical variables are in the dataset, not including FIPS code?
    3. [L1] How many respondents are included in the dataset?
    4. [L3] What are the ages of the youngest and oldest individuals in the sample?
    5. [L3] Construct a histogram of total family incomes with an appropriate amount of binning.
    6. [L3] Determine and compute appropriate measures of center and spread.
# load and inspect dataset
load('data/census.RData')

# minimum and maximum age
age <- census$age
min(age)
[1] 15
max(age)
[1] 93
# histogram of family incomes
family.income <- census$total_family_income
hist(family.income, breaks = 50)

# measure of center
median(family.income)
[1] 44000
# measure of spread
IQR(family.income)
[1] 48000
  1. There are 8 columns in the dataframe, so not including year and FIPS, there are 6 variables.
  2. Not including FIPS, there are 3 categorical variables: sex, race_general, and marital_status
  3. There are 377 individuals in the sample (one per row).
  4. The histogram is shown above. Around 50 bins seems to capture the shape pretty well.
  5. The incomes are heavily right-skewed with some large outliers, so median and IQR are better choices of summary statistics. The median family income is 44K; the IQR is 48K.
  1. The nhanes dataset contains responses from the National Health and Nutrition Examination Survey (NHANES) on a subset of demographic and health-related variables from the survey. Assume that respondents are a representative sample of U.S. adults.

    1. [L1] Which variables are numeric and which are categorical?
    2. [L1] Classify each numeric variable as discrete or continuous.
    3. [L1] How many observations and variables are in the dataset?
    4. [L3] What proportion of respondents are male?
    5. [L3] Make histograms of systolic and diastolic blood pressure (bpsys1 and bpdia1, respectively). Describe the distributions.
    6. [LX] A person is considered to have hypertension if their systolic pressure is over 130 OR their diastolic pressure is over 80. How many individuals in the dataset have hypertension? (Hint: sum(x > 5) will calculate how many values in x exceed 5.)
# load dataset
load('data/nhanes.RData')

# proportion of male respondents
table(nhanes$gender) |> proportions()

   female      male 
0.4995282 0.5004718 
# histograms of systolic and diastolic blood pressure
hist(nhanes$bpsys1)

hist(nhanes$bpdia1)

# how many individuals have hypertension (challenge)
sys.over130 <- (nhanes$bpsys1 > 130) 
dia.over80 <- (nhanes$bpdia1 > 80)
table(sys.over130 + dia.over80)

   0    1    2 
2169  770  240 
  1. All variables are numeric except for gender.
  2. All variables are discrete (integer-valued) except for totchol.
  3. There are 3179 observations of 8 variables.
  4. 50.05% of respondents were male.
  5. Histograms are above; the distribution of systolic pressure is right-skewed, and the distribution of diastolic pressure is symmetric. Both are unimodal.
  6. 1010 individuals, or approximately one third of respondents, have hypertension.
  1. Chen, W., et al., Maternal investment increases with altitude in a frog on the Tibetan Plateau. Journal of Evolutionary Biology 26-12 (2013) includes an analysis of measurements pertaining to egg clutches of several populations of frog at breeding ponds (sites) in the eastern Tibetan Plateau: egg size (diameter in mm), clutch size (estimated number of eggs), clutch volume (volume in cubic mm), and body size (length of mother in cm). The frog dataset contains an excerpt of data from this paper along with a study site identifier and site altitude.

    1. [L1] Which site has the most observations? The least?
    2. [L1] Notice that altitude is recorded as a categorical variable. What type of categorical variable is it and what about the study might explin why it is recorded this way rather than as numeric?
    3. [L3] Make histograms of each of the four numeric variables and describe the distributions.
    4. [L3] For each variable in (c), identify any measures of center or spread that would not be appropriate summary statistics.
    5. [L4] Use the data to estimate mean clutch size; report the point estimate and standard error following conventional style.
    6. [L4] Construct a 99% confidence interval for mean clutch size and interpret the result in context following conventional style.
    7. [L3] Would a clutch size of 200 be unusual? Explain.
# load data
load('data/frog.RData')

# tally number of observations by site
table(frog$site) |> sort()

118 030 053 019 060 069 109 063 077 105 040 
  5   6   6  10  10  14  21  23  37 127 172 
# distributions of each numeric variable
hist(frog$clutch.size, breaks = 15)

hist(frog$clutch.volume, breaks = 15)

hist(frog$egg.size, breaks = 20)

hist(frog$body.size, breaks = 10)

# point estimate and se for mean clutch size
clutch.mean <- mean(frog$clutch.size)
clutch.se <- sd(frog$clutch.size)/sqrt(length(frog$clutch.size))

# 99% confidence interval for mean clutch size
cval <- qt(0.995, df = 430)
clutch.mean + c(-1, 1)*cval*clutch.se
[1] 691.7161 750.7847
# is 200 an unusual clutch size?
summary(frog$clutch.size)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  158.5   549.5   707.9   721.3   851.1  1698.2 
quantile(frog$clutch.size, probs = 0.01)
      1% 
237.7373 
  1. Site 118 has the fewest observations (5); site 040 has the most (172).
  2. Altitude is ordinal. It is recorded this way because each site is associated with exactly one altitude, so there are only 11 unique values, each of which occurs multiple times in the data. Moreover, since the spacing between observed altitudes is irregular, it is more sensible to treat it as ordinal than as discrete.
  3. Clutch size is unimodal and slightly right skewed; clutch volume is unimodal and right skewed; egg size is bimodal; and body size is ambiguous – it’s fairly uniform besides the local peak around 5.2-5.8, but with less binning appears unimodal and left skewed. Either answer for body size is acceptable.
  4. For clutch size and clutch volume, there are no major outliers (though there are a few unusually large observations) and the skewness is mild, so any of the commonly used summaries (mean, median, SD, IQR) are appropriate; similarly, any of the common summaries would be appropriate for body size. However, for egg size, none of the common summaries are appropriate because the distribution is bimodal.
  5. Mean clutch size is estimated to be 721.25 eggs (SE 11.42).
  6. With 99% confidence, mean clutch size is estimated to be between 691.7 and 750.8 eggs.
  7. Yes, a clutch size of 200 would be unusually small; the 1st percentile is 237.7, so more than 99% of observed clutches are larger than 200.
  1. Vu and Harrington exercise 4.1 (note that the exercises are at the end of the chapter, not the end of the section). Additionally:

    1. [L4] Compute a 95% confidence interval for the mean BGC of nests using the empirical rule and interpret the interval in context following conventional style.
    2. [L4] Supposing a sample of 30 nests returned exactly the same summary statistics, recompute your interval in (e). Is the margin of error smaller or larger?

Note: this problem can be done entirely by hand. If you wish to use R, you can input the given summary statistics directly – for example, bcg.mean <- 0.6052 – and perform the calculations as in the lab activity.

# input summary statistics directly for this problem

# point estimate for population mean is sample mean
bgc.mean <- 0.6052

# point estimate for population sd is sample sd
bgc.sd <- 0.0131

# deviation of 0.63, relative to sd
(0.63 - bgc.mean)/bgc.sd
[1] 1.89313
# interval for the population mean
bgc.mean + c(-1, 1)*2*bgc.sd/sqrt(70)
[1] 0.6020685 0.6083315
# repeat e, but suppose only 30 nests were measured
bgc.mean + c(-1, 1)*2*bgc.sd/sqrt(30)
[1] 0.6004166 0.6099834
  1. The (population) mean BGC of nests is estimated to be 0.6052.
  2. The (population) standard deviation of BGC of nests is estimated to be 0.0131.
  3. A BGC value of 0.63 would be somewhat high, but not too extreme – it is 1.9 times the average distance from the sample mean.
  4. The mean BGC of nests is estimated to be between 0.6021 and 0.6083 units.
  5. If the sample size had been 30, the interval estimate would be wider.
  1. The temps dataset contains measurements of body temperature and heart rate for a random sample of adults. The provided code splits the body temperature data into groups by sex.

    1. [L4] Compute a 90% confidence interval for the mean female body temperature and interpret the interval in context in conventional style.
    2. [L4] Compute a 90% confidence interval for the mean male body temperature and interpret the interval in context in conventional style.
    3. [L4] Do the intervals suggest a sex difference in mean body temperature? Why or why not?
# load data
load('data/temps2.RData')

# split observations into two groups by sex
temps.split <- split(temps$body.temp, temps$sex)
temps.m <- temps.split$male
temps.f <- temps.split$female

# confidence interval for mean body temps for female
temps.f.mean <- mean(temps.f)
temps.f.se <- sd(temps.f)/sqrt(65)
cval.f <- qt(0.95, df = 64)
temps.f.mean + c(-1, 1)*cval.f*temps.f.se
[1] 98.23993 98.54776
# confidence interval for mean body temps for male
temps.m.mean <- mean(temps.m)
temps.m.se <- sd(temps.m)/sqrt(65)
cval.m <- qt(0.95, df = 64)
temps.m.mean + c(-1, 1)*cval.m*temps.m.se
[1] 97.95996 98.24927
  1. With 90% confidence, the mean female body temperature is estimated to be between 98.24 and 98.55 degrees Farenheit.
  2. With 90% confidence, the mean male body temperature is estimated to be between 97.96 and 98.24 degrees Farenheit.
  3. The intervals overlap slightly, so there are common values that are plausible for both means at the specified confidence level, such as 98.24. The intervals therefore do not suggest a sex difference.