Lab 2: Descriptive Statistics

Course

STAT218

The objectives of this lab are to learn to:

make basic statistical graphics for visualizing frequency distributions
compute common measures of location and spread
discern appropriate measures of location and spread based on presence of outliers and skewness

We’ll use the FAMuSS dataset, as in lecture.

library(tidyverse)

# load famuss dataset 
load('data/famuss.RData')

# inspect data frame
head(famuss)

# A tibble: 6 × 9
  ndrm.ch drm.ch sex      age race      height weight genotype   bmi
    <dbl>  <dbl> <fct>  <int> <fct>      <dbl>  <dbl> <fct>    <dbl>
1      40     40 Female    27 Caucasian   65      199 CC        33.1
2      25      0 Male      36 Caucasian   71.7    189 CT        25.8
3      40      0 Female    24 Caucasian   65      134 CT        22.3
4     125      0 Female    40 Caucasian   68      171 CT        26.0
5      40     20 Female    32 Caucasian   61      118 CC        22.3
6      75      0 Female    24 Hispanic    62.2    120 CT        21.8

As a quick refresher, you can extract a vector of the observations for any particular variable from the dataframe as follows: famuss$[variable name]. For example:

# extract the age variable
famuss$age

# store the age variable as a new object called "age"
age <- famuss$age

Part 1: distributions

A “frequency distribution” is a summary that shows how frequently each value of a variable occurs. This summary is computed differently depending on the variable type.

Categorical variables

For categorical variables, table(...) will tabulate counts of the number of occurrences of each unique value (category). This is the frequency distribution and it can be shown in either counts or proportions.

# retrieve genotype
genotype <- famuss$genotype

# frequency distribution (counts)
table(genotype)

genotype
 CC  CT  TT 
173 261 161

# frequency distribution (proportions)
table(genotype) |> proportions()

genotype
       CC        CT        TT 
0.2907563 0.4386555 0.2705882

Barplots provide simple visualizations of categorical frequency distributions. Barplots may show either counts or proportions; it is only a difference of scale.

# barplot (counts)
table(genotype) |> barplot()

# barplot (proportions)
table(genotype) |> proportions() |> barplot()

Your turn

Extract the sex variable from the FAMuSS data.

Construct a table showing the frequency distribution in proportions.
Construct a barplot showing the frequency distribution in proportions.

Solution

# retrieve race variable
sex <- famuss$sex

# frequency distribution (proportions)
table(sex) |> proportions()

sex
   Female      Male 
0.5932773 0.4067227

# barplot (proportions)
table(sex) |> proportions() |> barplot()

Sometimes it’s handy to flip the orientation of the axes and/or labels. Try out the examples below:

# rotate axes
table(genotype) |> barplot(horiz = T)

# rotate labels 
table(genotype) |> barplot(las = 2)

# rotate axes and labels
table(genotype) |> barplot(hoiz = T, las = 2)

Your turn

Rotate your bar plot (axes and labels) from the previous exercise.

Solution

# barplot (proportions)
table(sex) |> proportions() |> barplot(horiz = T, las = 2)

Numeric variables

Numeric variables must be “binned” into many small intervals to visualize the frequency distribution. This results in a histogram.

# retrieve age
age <- famuss$age

# construct histogram
hist(age)

Your turn

Extract the change in strength in nondominant arm variable (ndrm.ch) from the FAMuSS dataset and construct a histogram.

Discuss the shape of the distribution with your neighbor: is it skewed or symmetric? If it’s skewed, which direction?

Solution

# extract ndrm.ch variable
change <- famuss$ndrm.ch

# construct histogram
hist(change)

You can control the binning using the breaks = ... argument. If you provide one number, R will do its best to get close to that number; if you provide a vector, R will make bins with those values as endpoints.

# more bins
hist(age, breaks = 25)

# fewer bins
hist(age, breaks = 5)

# exact breaks
hist(age, breaks = c(10, 20, 24, 32, 50))

Selecting an appropriate number of bins is a bit subjective; you want to try and capture the shape while smoothing out unnecessary detail.

Your turn

Using the histogram of nondominant arm strength change from the previous “your turn” exercise, tinker with the binning until you find a setting that you feel captures the shape of the distribution well.

Solution

# adjust breaks = ... argument
hist(change, breaks = 20)

If you finish the above and have a few minutes, pick one additional numeric variable and construct a histogram. Is the distribution skewed or symmetric? How many modes does it have?

# choose another variable

# construct histogram

Part 2: summary statistics

Most common measures of center and spread have dedicated functions in R. We’ll review these below.

Measures of center

Means and medians are very straightforward to compute:

# average age
mean(age)

[1] 24.40168

# median age (middle value)
median(age)

[1] 22

Which one do you use? That depends on the shape of the distribution. In this case, age is a little right-skewed, so the median may capture the center a little better. Take a moment to locate the mean and median on the histogram below and consider whether you agree.

# plot mean and median on top of histogram
hist(age)
abline(v = 22, col = 'red', lty = 1) # ignore this
abline(v = 24.4, col = 'blue', lty = 2) # ignore this too

Your turn

Extract the bmi variable from the FAMuSS data.

compute the mean
compute the median

Discuss with your neighbor: which (if either) is the better choice?

Solution

# extract bmi
bmi <- famuss$bmi

# a. compute mean
mean(bmi)

[1] 24.40108

# b. compute median
median(bmi)

[1] 23.35

# histogram
hist(bmi)

The distribution is fairly symmetric, and both measures are quite close, so either statistic (mean or median) is a good choice.

Percentiles and min/max

Percentiles are computed using the quantile() function:

# 20th percentile of age
quantile(age, probs = 0.2)

20% 
 19

# 60th percentile of age
quantile(age, probs = 0.6)

60% 
 24

The probs = ... argument specifies which percentile R will calculate.

Minima and maxima can be computed with min() and max(), respectively:

# minimum age
min(age)

[1] 17

# maximum age
max(age)

[1] 40

Your turn

Using the bmi variable from the FAMuSS data, compute the minimum, maximum, and quartiles:

min (0th percentile)
1st quartile (25th percentile)
median (50th percentile)
3rd quartile (75th percentile)
max (100th percentile)

Solution

# min
min(bmi)

[1] 15.504

# quartiles
quantile(bmi, probs = 0.25)

   25% 
21.295

quantile(bmi, probs = 0.5)

  50% 
23.35

quantile(bmi, probs = 0.75)

    75% 
26.6245

# max
max(bmi)

[1] 43.758

As a shortcut, if you want to inspect all common measures of location and center – the five-number summary plus the mean – the summary(...) function will do just that.

# all common location/center measures
summary(age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   17.0    20.0    22.0    24.4    27.0    40.0

Your turn

Compute the 5-number summary for bmi using summary() and compare with your answers to the previous “your turn” exercise.

Solution

# all common location/center measures
summary(bmi)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  15.50   21.30   23.35   24.40   26.62   43.76

Measures of spread

The following functions return common measures of spread for numeric variables:

range(...) returns the range (min, max)
IQR(...) returns the interquartile range (middle 50% of data)
var(...) returns the variance (average squared distance of observations from mean)
sd(...) returns the standard deviation (average distance of observations from mean)

# age range
range(age)

[1] 17 40

# interquartile range of ages
IQR(age)

[1] 7

# variance of age
var(age)

[1] 33.79966

# standard deviation of age
sd(age)

[1] 5.813748

Which one to choose? It depends on the distribution: if there are large outliers, IQR is a better choice; otherwise, standard deviation is conventional. In this case, there are no age outliers (you can check the histogram above from Part 1), so standard deviation is the way to go.

Your turn

Determine and compute an appropriate measure of spread for BMI.

Solution

# histogram of bmi
hist(bmi)

# measure of spread
sd(bmi)

[1] 4.57662

There are no obvious outliers, so standard deviation is the way to go here.