Least squares estimation for the simple linear model
Today’s agenda
[lab] Line fitting
[lecture] Least squares estimation
Meadowfoam flowering
Mean flowering depends on light intensity.
the “response” variable is flowers per plant
the “explanatory” variable is light intensity
In an ANOVA model, the explanatory variable is treated as categorical.
one mean estimated per unique value of the explanatory variable
values usually fixed in advance by researcher
What if light intensity instead varied continuously?
Could we estimate mean flowering as a continuous function of intensity?
PREVEND data
Ruff Figural Fluency Test (RFFT) is a cognitive assessment.
measures nonverbal fluency and executive cognitive function
scale: 0 (worst) to 175 (best)
casenr
age
rfft
126
37
136
33
36
80
145
37
102
146
37
85
Question of interest:
How much does cognitive ability as measured by RFFT decline with age on average?
PREVEND data
Ruff Figural Fluency Test (RFFT) is a cognitive assessment.
measures nonverbal fluency and executive cognitive function
scale: 0 (worst) to 175 (best)
casenr
age
rfft
126
37
136
33
36
80
145
37
102
146
37
85
A straight line seems to describe the RFFT-age relationship well enough.
This suggests a model:
mean RFFT is a linear function of age
remaining variation in RFFT is random
Lines
The equation of a line in slope-intercept form is:
is the slope (rise over run) and is the intercept:
the line crosses the axis at
changes by per unit increment in
There is exactly one line through any two points.
Linear trends in data
We say a set of data points exhibit a linear trend if the points fall “near” a line.
Some things to keep in mind:
real data will never fall exactly on a line
there may be outliers (e.g., red point at right)
points could be pretty far spread out and still exhibit a linear trend
Linear trends in data
We can articulate two properties of linear trends:
direction: positive (A) or negative (B)
strength: concentration about line
Note that trends in each row are identical.
Correlation
Correlation is a signed measure of strength of linear relationship.
When data points are far from the mean at the same time in the same direction, the magnitude will be larger.
sign indicates direction of relationship
magnitude indicates strength (always between -1 and 1)
Common mistake: strength slope.
Line fitting activity
Correlation between age and RFFT:
cor(age, rfft)
[1] -0.635854
negative relationship (sign)
moderate strength
But how do we find a line that describes the trend?
Best-fitting line
With each year of age, RFFT decreases by 1.191 points on average.
Bias and error
Bias and error are measured via residuals:
The best-fitting line achieves two conditions:
no bias: underestimates and overestimates equally often
minimal error: as close as possible to as many data points as possible
The SLR model
The simple linear regression model is:
continuous response
explanatory variable
regression coefficients
model error
The values that minimize error subject to the model being unbiased are:
These are called the least squares estimates.
Least squares estimates in R
According to the model, a one-unit increment in corresponds to a -unit change in mean :
# fit modelfit <-lm(formula = rfft ~ age, data = prevend)fit
Call:
lm(formula = rfft ~ age, data = prevend)
Coefficients:
(Intercept) age
134.098 -1.191
With each additional year of age, mean RFFT score decreases by an estimated 1.191 points.
formula = <RESPONSE> ~ <EXPLANATORY> specifies the model
data = <DATAFRAME> specifies the observations
Error variability and model fit
The residual standard deviation provides an estimate of error variability:
The proportion of variability explained by the model is:
1- (n -2)*sigma(fit)^2/((n -1)*var(rfft))
[1] 0.4043103
Age explains 40.43% of variability in RFFT.
Model specification
Is the relationship actually linear?
Two ways to check:
Inspect observed data for nonlinear trend
Inspect model residuals for “patterns”
Local smoothing is shown in blue.
The linear model is a fine approximation here – the curvature is very minor – but let’s consider an alternative model specification as a thought exercise.
An alternative model
The model now implies that the mean RFFT score is a nonlinear function of age:
And we can still fit it using least squares:
lm(log(rfft) ~ age, data = prevend)
Call:
lm(formula = log(rfft) ~ age, data = prevend)
Coefficients:
(Intercept) age
5.21739 -0.01951
Linear models can be used to capture more than just linear relationships!
Kleiber’s law
Kleiber’s law refers to the relationship between metabolic rate and body mass.