Point estimation and sampling variability
A population distribution is a frequency distribution across all possible study units.
For simple random samples, the observed distribution of sample values resembles the population distribution:
The larger the sample, the closer the resemblance.1
Population statistics are called parameters. These are fixed but unknown values.
Population mean | Population SD |
---|---|
5.067 | 1.126 |
Notation:
Sample statistics provide point estimates of the corresponding population statistics.
Notation:
Sample mean | Sample SD | Sample size |
---|---|---|
5.043 | 1.075 | 3179 |
Population mean | Population SD |
---|---|
5.067 | 1.126 |
Sample mean | Sample SD | Sample size |
---|---|---|
5.043 | 1.075 | 3179 |
So we might say: “mean total cholesterol in the study population is estimated to be 5.043”
Different samples yield different estimates.
Sample means:
sample.1 | sample.2 |
---|---|
5.093 | 5.136 |
These are 20 random samples with the sample mean indicated by the dashed line and the population distribution and mean overlaid in red.
We can actually measure this variability!
If we had means calculated from a much larger number of samples, we could make a frequency distribution for the values of the sample mean.
sample | 1 | 2 | \(\cdots\) | 10,000 |
mean | 4.957 | 5.039 | \(\cdots\) | 5.24 |
We could then use the usual measures of center and spread to characterize the distribution of sample means.
Across 10,000 random samples of size 20, the average estimate was 5.07 and the variability of estimates was 0.237.
What we are simulating is known as a sampling distribution: the frequency of values of a statistic across all possible random samples.
When data are from a random sample, statistical theory provides that the sample mean \(\bar{x}\) has a sampling distribution with
regardless of the population distribution.
In other words, across all random samples of a fixed size…
In practice we use an estimate of sampling variability known as a standard error: \[SE(\bar{x}) = \frac{s_x}{\sqrt{n}} \qquad \left(\frac{\text{sample SD}}{\sqrt{\text{sample size}}}\right)\]
For example:
\[SE(\bar{x}) = \frac{1.073}{\sqrt{20}} = 0.240\]
It is common style to report the value of a point estimate with a standard error given parenthetically.
Statistics from full NHANES sample:
mean | sd | n |
---|---|---|
5.043 | 1.075 | 3179 |
The mean total cholesterol among the population is estimated to be 5.043 mmol/L (SE 0.019)
This style of report communicates:
There are two potential sources of variability in estimates:
For example, the estimates below are equally precise:
\(SE(\bar{x})\) = 0.1265079
\(SE(\bar{x})\) = 0.1223712
An interval estimate is a range of plausible values for a population parameter.
The general form of an interval estimate is
\[\text{point estimate} \pm \text{margin of error}\]
where the “margin of error” reflects the sampling variability.
A common interval for the population mean is: \[\bar{x} \pm 2\times SE(\bar{x}) \qquad\text{where}\quad SE(\bar{x}) = \left(\frac{s_x}{\sqrt{n}}\right)\]
By hand: \[5.043 \pm 2\times 0.0191 = (5.005, 5.081)\]
Interpretation: the mean total cholesterol among the study population is estimated to be between 5.005 and 5.081 mmol/L.
STAT218