Simple linear regression

Model specification, parameter estimation, inference, and diagnostics

Today’s agenda

  1. [lecture] estimation and inference for simple linear regression
  2. course evaluations
  3. final scheduling
  4. [lab] fitting SLR models in R

PREVEND data

Ruff Figural Fluency Test (RFFT) is a cognitive assessment.

  • measures nonverbal capacity for initiation, planning, and divergent reasoning
  • scale: 0 (worst) to 175 (best)
casenr age rfft
126 37 136
33 36 80
145 37 102

How much does cognitive ability as measured by RFFT decline with age on average?

Best-fitting line

Previously you found the best-fitting line:

\[ \text{RFFT} = 134.098 - 1.191 \times \text{age} \]

With each year of age, RFFT decreases by 1.191 points on average.

\[ \begin{align} \text{slope}: \quad-1.191 &= \text{cor}(\text{age}, \text{RFFT})\times\frac{SD(\text{RFFT})}{SD(\text{age})} \\ \text{intercept}: \quad134.098 &= \text{mean}(\text{RFFT}) - (-1.191)\times\text{mean}(\text{age}) \end{align} \]

Bias and error

Recall how you found this line:

Bias and error are measured via residuals: \[ \textcolor{red}{e_i} = y_i - \textcolor{blue}{\hat{y}_i} \]

  • \(\text{bias} = -\frac{1}{n}\sum_i \textcolor{red}{e_i}\)
  • \(\text{SSE} = \sum_i \textcolor{red}{e_i}^2\)

We said that the best-fitting line achieved two conditions:

  • no bias: underestimates and overestimates equally often
  • minimal error: as close as possible to as many data points as possible

The SLR model

The simple linear regression model is:

\[ Y = \textcolor{blue}{\underbrace{\beta_0 + \beta_1 x}_\text{mean}} + \textcolor{red}{\underbrace{\epsilon}_\text{error}} \]

  • continuous response \(Y\)
  • explanatory variable \(x\)
  • regression coefficients \(\beta_0, \beta_1\)
  • model error \(\epsilon\)

The values that minimize error subject to the model being unbiased are:

\[\begin{align*} \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} &\quad(\text{unbiased}) \\ \hat{\beta}_1 &= \frac{s_y}{s_x}\times r &\quad(\text{minimizes SSE}) \end{align*}\]

These are called the least squares estimates.

Least squares estimates in R

According to the model, a one-unit increment in \(x\) corresponds to a \(\beta_1\)-unit change in mean \(Y\):

# fit model
fit <- lm(formula = rfft ~ age, data = prevend)
fit

Call:
lm(formula = rfft ~ age, data = prevend)

Coefficients:
(Intercept)          age  
    134.098       -1.191  

With each additional year of age, mean RFFT score decreases by an estimated 1.191 points.

  • formula = <RESPONSE> ~ <EXPLANATORY> specifies the model
  • data = <DATAFRAME> specifies the observations

Error variability and model fit

The residual standard deviation provides an estimate of error variability:

\[\textcolor{\red}{\hat{\sigma}} = \sqrt{\frac{1}{n - 2} \sum_i e_i^2} \qquad\text{(estimated error variability)}\]

The proportion of variability explained by the model is: \[ R^2 = 1 - \frac{(n - 2)\textcolor{red}{\hat{\sigma}^2}}{(n - 1)\textcolor{darkgrey}{s_y^2}} \quad\left(1 - \frac{\text{error variability}}{\text{total variability}}\right) \]

1 - (n - 2)*sigma(fit)^2/((n - 1)*var(rfft))
[1] 0.4043103

Age explains 40.43% of variability in RFFT.

Standard errors for the coefficients

Standard errors for the coefficients are:

\[SE\left(\hat{\beta}_0\right) = \hat{\sigma}\sqrt{\frac{1}{n} + \frac{\bar{x}^2}{(n - 1)s_x^2}} \qquad\text{and}\qquad SE\left(\hat{\beta}_1\right) = \hat{\sigma}\sqrt{\frac{1}{(n - 1)s_x^2}}\]

While you won’t need to know these formulae, do notice that:

  • more data \(\longrightarrow\) less sampling variability
  • more spread in \(x\) \(\longrightarrow\) less sampling variability

Inference for the coefficients

If the errors are symmetric and unimodal, then the sampling distribution of \[ T = \frac{\hat{\beta}_1 - \beta_1}{SE(\beta_1)} \] is well-approximated by a \(t_{n - 2}\) model.

  1. Significance test: \(\begin{cases} H_0: \beta_1 = 0 \\ H_A: \beta_1 \neq 0 \end{cases}\)

  2. Confidence interval: \(\hat{\beta}_1 \pm c\times SE\left(\hat{\beta}_1\right)\)

  • \(P(T > |T_\text{obs}|) \approx 0\): very strong evidence of an association (true slope is not zero)
  • confidence interval using \(t_{206}\) critical value: (-1.389, -0.992)

Inference for the PREVEND study

fit <- lm(rfft ~ age, data = prevend)
summary(fit)

Call:
lm(formula = rfft ~ age, data = prevend)

Residuals:
    Min      1Q  Median      3Q     Max 
-56.085 -14.690  -2.937  12.744  77.975 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 134.0981     6.0701   22.09   <2e-16 ***
age          -1.1908     0.1007  -11.82   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.52 on 206 degrees of freedom
Multiple R-squared:  0.4043,    Adjusted R-squared:  0.4014 
F-statistic: 139.8 on 1 and 206 DF,  p-value: < 2.2e-16
confint(fit)
                 2.5 %      97.5 %
(Intercept) 122.130647 146.0654574
age          -1.389341  -0.9922471

Fitted model: \[ \text{RFFT} = 134.098 - 1.191 \times \text{age} \]

  • Age explains an estimated 40.43% of variation in RFFT.

  • With each year of age mean RFFT declines by an estimated 1.19 points (SE 0.10).

  • There is a significant association between age and mean RFFT score (T = -11.82 on 206 degrees of freedom, p < 0.0001).

  • With 95% confidence, each additional year of age is associated with an estimated decline in mean RFFT between 0.99 and 1.39 points.

Kleiber’s law

Kleiber’s law refers to the relationship between metabolic rate and body mass.

We can estimate it via the SLR model: \[ \log(\text{metabolism}) = \beta_0 + \beta_1 \log(\text{mass}) + \epsilon \]

Fitted model: \[ \log(\text{metabolism}) = 5.64 + 0.74 \times \log(\text{mass}) \]

fit <- lm(log.metab ~ log.mass, data = kleiber)

Kleiber’s law: inference

fit <- lm(log.metab ~ log.mass, data = kleiber)
summary(fit)

Call:
lm(formula = log.metab ~ log.mass, data = kleiber)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.14216 -0.26466 -0.04889  0.25308  1.37616 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.63833    0.04709  119.73   <2e-16 ***
log.mass     0.73874    0.01462   50.53   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4572 on 93 degrees of freedom
Multiple R-squared:  0.9649,    Adjusted R-squared:  0.9645 
F-statistic:  2553 on 1 and 93 DF,  p-value: < 2.2e-16

  • \(\hat{\beta}_0 = 5.638\)
  • \(\hat{\beta}_1 = 0.739\)
  • \(\hat{\sigma} = 0.457\)

There is a significant association between body mass and metabolism (p < 0.0001): body mass explains 96.49% of variation in metabolism; with 95% confidence, a unit increment in log mass is associated with an estimated increase in mean log metabolism between 0.7097 and 0.7678, with a point estimate of 0.7387.

Kleiber’s law: model interpretation

Exponentiating both sides of the fitted SLR model equation:

\[ \underbrace{\text{metabolism}}_{e^{\log(\text{metabolism})}} = \underbrace{280.99}_{e^{5.64}} \times \underbrace{\text{mass}^{0.74}}_{e^{0.74 \log(\text{mass})}} \]

So we’ve really estimated what’s known as a power law relationship: \(y = ax^b\).

  • multiplicative, not additive, relationship
  • doubling \(x\) corresponds to changing \(y\) by a factor of \(2^b\)

The estimate and interval for \(\beta_1\) in the SLR model can be transformed appropriately for a more direct interpretation:

With 95% confidence, every doubling of body mass is associated with an estimated 63.55-70.26% increase in median metabolism.