Bivariate summaries

Quantitative and graphical techniques for summarizing two variables

Today’s agenda

  1. Reading quiz
  2. Loose end: robustness
  3. Bivariate numeric and graphical summaries
  4. Lab: bivariate graphics in R

Robustness

Percentile-based measures of location and spread are less sensitive to outliers

Consider adding an observation of 94 to our 12 ages from last time. This is an outlier.

# append an outlier
ages_add <- c(ages, 94)

# means
c(original = mean(ages), with.outlier = mean(ages_add))
    original with.outlier 
    24.00000     29.38462 
# medians
c(original = median(ages), with.outlier = median(ages_add))
    original with.outlier 
        23.5         25.0 
# IQR
c(original = IQR(ages), with.outlier = IQR(ages_add))
    original with.outlier 
         8.5          9.0 
# SD
c(original = sd(ages), with.outlier = sd(ages_add))
    original with.outlier 
    5.526794    20.122701 

The effect of this outlier on each statistic is:

  • mean increases by 22.44%
  • median increases by 6.38%
  • IQR increases by 5.88%
  • SD increases by 264.09%

Robustness refers to sensitivity to outliers. Mean and SD are less robust than median and IQR.

Choosing appropriate measures

When outliers are present, use percentile-based measures; otherwise, use mean and standard deviation or variance

Check your understanding: which measures are most appropriate for each variable above?

Limitations of univariate summaries

Univariate summaries aim to capture the distribution of values of a single variable.

  • both unimodal, no obvious outliers
  • heights symmetric
  • weights right-skewed
  • but these observations actually come in pairs

Univariate summaries don’t reflect how the variables might be related.

Bivariate summaries

Bivariate summaries aim to capture a relationship between two variables.

A simple example is a scatterplot:

Each point represents a pair of values \((h, w)\) for one study participant.

  • Reveals a relationship: taller participants tend to be heavier
  • But no longer shows individual distributions clearly

Notice, though, that the marginal means (dashed red lines) still capture the center well.

Summary types

Bivariate summary techniques differ depending on the data types of the variables.

Question Comparison type
Did genotype frequencies differ by race or sex among study participants? categorical/categorical
Were differential changes in arm strength observed according to genotype? numeric/categorical
Did change in arm strength appear related in any way to body size among study participants? numeric/numeric
Did study participants experience similar or different changes in arm strength depending on arm dominance? ??

Categorical/categorical

A contingency table is a bivariate tabular summary of two categorical variables; it shows the frequency of each pair of values. Usually the marginal totals are also shown.

  CC CT TT total
Female 106 149 98 353
Male 67 112 63 242
total 173 261 161 595

There are multiple ways to convert to proportions by using different denominators, and these yield proportions with distinct interpretations:

  • grand total – frequency of genotype/sex combination
  • row total – genotype frequency by sex
  • column total – sex frequency by genotype

Categorical/categorical

Did genotype frequencies differ by sex among study participants?

For this question, the row totals should be used to convert to proportions.

As a table:

  CC CT TT total
Female 0.3003 0.4221 0.2776 1
Male 0.2769 0.4628 0.2603 1

As a stacked bar plot:

The proportions are quite close, suggesting minimal sex differences.

Categorical/categorical

Did sex frequencies differ by genotype among study participants?

For this question, the column totals should be used to compute proportions.

As a table:

  CC CT TT
Female 0.6127 0.5709 0.6087
Male 0.3873 0.4291 0.3913
total 1 1 1

As a stacked bar plot:

The proportions are close, suggesting minimal genotype differences.

Numeric/categorical

Side-by-side boxplots are usually a good option. Avoid stacked histograms.

Were differential changes in arm strength observed according to genotype?

Look for differences:

  • location shift
  • spread
  • center

What do you think? Any notable relationships?

Numeric/numeric

Did change in arm strength appear related in any way to body size among study participants?

Pairwise scatterplots indicate no apparent relationships.

Interpreting scatterplots

Scatterplots show the presence or absence of an association.

If there is an association (i.e., discernible pattern), it can be:

  • linear or nonlinear

    • linear if scatter roughly follows a straight line, nonlinear otherwise
  • positive or negative

    • positive if scatter is increasing from left to right, negative otherwise

The plot at left is an example of a positive and (slightly) nonlinear relationship.

Practice interpreting scatterplots

Correlation

In addition to graphical techniques, for numeric/numeric comparisons, there are also quantiative measures of relationship.

Correlation measures the strength of linear relationship, and is defined as: \[r_{xy} = \frac{1}{n - 1}\frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{s_x s_y}\]

  • \(r \rightarrow 1\): positive relationship
  • \(r \rightarrow -1\): negative relationship
  • \(r \rightarrow 0\): no relationship

Interpreting correlations

Did change in arm strength appear related in any way to body size among study participants?

Here are the correlations corresponding to the plots we checked earlier.

  height weight bmi
drm.ch -0.1104 -0.1159 -0.07267
ndrm.ch -0.265 -0.2529 -0.1436

So there aren’t any linear relationships here. A rule of thumb:

  • \(|r| < 0.3\): no relationship
  • \(0.3 \leq |r| < 0.6\): weak to moderate relationship
  • \(0.6 \leq |r| < 1\): moderate to strong relationship
  • \(|r| = 1\): either a mistake or not real data

Data transformations

Sometimes a simple transformation can reveal a linear relationship on an alternate scale.

# correlation coefficient
cor(height, weight)
[1] 0.5308787

# correlation coefficient
cor(height, log(weight))
[1] 0.5609356

Interpretation

# correlation coefficient
cor(height, log(weight))
[1] 0.5609356

Applying these rules of thumb:

  • \(|r| < 0.3\): minimal association
  • \(0.3 \leq |r| < 0.6\): weak to moderate
  • \(0.6 \leq |r| < 1\): moderate to strong

And:

  • \(r > 0\): positive association
  • \(r < 0\): negative association

The interpretation is:

There is a moderately strong positive linear relationship between height and log weight.