Robustness refers to sensitivity to outliers. Mean and SD are less robust than median and IQR.
Choosing appropriate measures
When outliers are present, use percentile-based measures; otherwise, use mean and standard deviation or variance
Check your understanding: which measures are most appropriate for each variable above?
Limitations of univariate summaries
Univariate summaries aim to capture the distribution of values of a single variable.
both unimodal, no obvious outliers
heights symmetric
weights right-skewed
but these observations actually come in pairs
Univariate summaries don’t reflect how the variables might be related.
Bivariate summaries
Bivariate summaries aim to capture a relationship between two variables.
A simple example is a scatterplot:
Each point represents a pair of values \((h, w)\) for one study participant.
Reveals a relationship: taller participants tend to be heavier
But no longer shows individual distributions clearly
Notice, though, that the marginal means (dashed red lines) still capture the center well.
Summary types
Bivariate summary techniques differ depending on the data types of the variables.
Question
Comparison type
Did genotype frequencies differ by race or sex among study participants?
categorical/categorical
Were differential changes in arm strength observed according to genotype?
numeric/categorical
Did change in arm strength appear related in any way to body size among study participants?
numeric/numeric
Did study participants experience similar or different changes in arm strength depending on arm dominance?
??
Categorical/categorical
A contingency table is a bivariate tabular summary of two categorical variables; it shows the frequency of each pair of values. Usually the marginal totals are also shown.
CC
CT
TT
total
Female
106
149
98
353
Male
67
112
63
242
total
173
261
161
595
There are multiple ways to convert to proportions by using different denominators, and these yield proportions with distinct interpretations:
grand total – frequency of genotype/sex combination
row total – genotype frequency by sex
column total – sex frequency by genotype
Categorical/categorical
Did genotype frequencies differ by sex among study participants?
For this question, the row totals should be used to convert to proportions.
As a table:
CC
CT
TT
total
Female
0.3003
0.4221
0.2776
1
Male
0.2769
0.4628
0.2603
1
As a stacked bar plot:
The proportions are quite close, suggesting minimal sex differences.
Categorical/categorical
Did sex frequencies differ by genotype among study participants?
For this question, the column totals should be used to compute proportions.
As a table:
CC
CT
TT
Female
0.6127
0.5709
0.6087
Male
0.3873
0.4291
0.3913
total
1
1
1
As a stacked bar plot:
The proportions are close, suggesting minimal genotype differences.
Numeric/categorical
Side-by-side boxplots are usually a good option. Avoid stacked histograms.
Were differential changes in arm strength observed according to genotype?
Look for differences:
location shift
spread
center
What do you think? Any notable relationships?
Numeric/numeric
Did change in arm strength appear related in any way to body size among study participants?
Pairwise scatterplots indicate no apparent relationships.
Interpreting scatterplots
Scatterplots show the presence or absence of an association.
If there is an association (i.e., discernible pattern), it can be:
linear or nonlinear
linear if scatter roughly follows a straight line, nonlinear otherwise
positive or negative
positive if scatter is increasing from left to right, negative otherwise
The plot at left is an example of a positive and (slightly) nonlinear relationship.
Practice interpreting scatterplots
Correlation
In addition to graphical techniques, for numeric/numeric comparisons, there are also quantiative measures of relationship.
Correlation measures the strength of linear relationship, and is defined as: \[r_{xy} = \frac{1}{n - 1}\frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{s_x s_y}\]
\(r \rightarrow 1\): positive relationship
\(r \rightarrow -1\): negative relationship
\(r \rightarrow 0\): no relationship
Interpreting correlations
Did change in arm strength appear related in any way to body size among study participants?
Here are the correlations corresponding to the plots we checked earlier.
height
weight
bmi
drm.ch
-0.1104
-0.1159
-0.07267
ndrm.ch
-0.265
-0.2529
-0.1436
So there aren’t any linear relationships here. A rule of thumb:
\(|r| < 0.3\): no relationship
\(0.3 \leq |r| < 0.6\): weak to moderate relationship
\(0.6 \leq |r| < 1\): moderate to strong relationship
\(|r| = 1\): either a mistake or not real data
Data transformations
Sometimes a simple transformation can reveal a linear relationship on an alternate scale.
# correlation coefficientcor(height, weight)
[1] 0.5308787
# correlation coefficientcor(height, log(weight))
[1] 0.5609356
Interpretation
# correlation coefficientcor(height, log(weight))
[1] 0.5609356
Applying these rules of thumb:
\(|r| < 0.3\): minimal association
\(0.3 \leq |r| < 0.6\): weak to moderate
\(0.6 \leq |r| < 1\): moderate to strong
And:
\(r > 0\): positive association
\(r < 0\): negative association
The interpretation is:
There is a moderately strong positive linear relationship between height and log weight.