mean | sd | n | se |
---|---|---|---|
5.043 | 1.075 | 3179 | 0.01906 |
Constructing interval estimates to attain a specified coverage rate
Under simple random sampling:
mean | sd | n | se |
---|---|---|---|
5.043 | 1.075 | 3179 | 0.01906 |
The mean total HDL cholesterol among the U.S. adult population is estimated to be 5.043 mmol/L (SE 0.0191).
A common interval estimate for the population mean is: \[\bar{x} \pm 2\times SE(\bar{x}) \qquad\text{where}\quad SE(\bar{x}) = \left(\frac{s_x}{\sqrt{n}}\right)\]
The mean total cholesterol among U.S. adults is estimated to be between 5.005 and 5.081 mmol/L.
Two related questions:
Consider the statistic:
\[ T = \frac{\bar{x} - \mu}{s_x/\sqrt{n}} \qquad\left(\frac{\text{estimation error}}{\text{standard error}}\right) \]
The sampling distribution of \(T\) is well-approximated by a \(t_{n - 1}\) model whenever either:
OR
The area under the density curve between any two values \((a, b)\) gives the proportion of random samples for which \(a < T < b\).
\[(\text{proportion of area between } a, b) = (\text{proportion of samples where } a < T < b)\]
The area under the density curve between any two values \((a, b)\) gives the proportion of random samples for which \(a < T < b\).
\[(\text{proportion of area between } a, b) = (\text{proportion of samples where } a < T < b)\]
The area under the density curve between any two values \((a, b)\) gives the proportion of random samples for which \(a < T < b\).
\[(\text{proportion of area between } a, b) = (\text{proportion of samples where } a < T < b)\]
The area under the density curve between any two values \((a, b)\) gives the proportion of random samples for which \(a < T < b\).
\[(\text{proportion of area between } a, b) = (\text{proportion of samples where } a < T < b)\]
The area under the density curve between any two values \((a, b)\) gives the proportion of random samples for which \(a < T < b\).
\[(\text{proportion of area between } a, b) = (\text{proportion of samples where } a < T < b)\]
The area under the density curve between any two values \((a, b)\) gives the proportion of random samples for which \(a < T < b\).
\[(\text{proportion of area between } a, b) = (\text{proportion of samples where } a < T < b)\]
So where did that 2 come from in the margin of error for our interval estimate?
\[ \bar{x} \pm \color{blue}{2}\times SE(\bar{x}) \]
Well:
\[ \begin{align*} 0.94 &= P(-\color{blue}{2} < T < \color{blue}{2}) \\ &= P\left(-\color{blue}{2} < \frac{\bar{x} - \mu}{s_x/\sqrt{n}} < \color{blue}{2}\right) \\ &= P(\underbrace{\bar{x} - \color{blue}{2}\times SE(\bar{x}) < \mu < \bar{x} + \color{blue}{2}\times SE(\bar{x})}_{\text{interval covers population mean}}) \end{align*} \]
For 94% of all random samples, the interval covers the population mean.
So the number 2 determines the proportion of samples for which the interval covers the mean, known as its coverage.
The sample size determines the exact shape of the \(t\) model through its ‘degrees of freedom’ \(n - 1\). This changes the areas slightly.
The exact coverage quickly converges to just over 95% as the sample size increases.
n | coverage |
---|---|
4 | 0.8607 |
8 | 0.9144 |
16 | 0.9361 |
32 | 0.9457 |
64 | 0.9502 |
128 | 0.9524 |
256 | 0.9534 |
Consider a slightly more general expression for an interval for the mean:
\[ \bar{x} \pm c\times SE(\bar{x}) \]
The number \(c\) is called a critical value. It determines the coverage.
The so-called “empirical rule” is that:
\[ P(\color{#FF6459}{-2 < T < 2}) = 1 - 2\times P(\color{blue}{T > 2}) \]
Look at how the areas add up so that: \[ P(\color{blue}{T > 2}) = 0.03 \] Moreover: \[ P(T < 2) = 1 - 0.03 = 0.97 \]
So the critical value 2 is actually the 97th percentile of the sampling distribution of \(T\).
To engineer an interval with a specific coverage, use the \(p\)th quantile where:
\[p = \left[1 - \left(\frac{1 - \text{coverage}}{2}\right)\right]\] In R:
# coverage 95% using t quantile
coverage <- 0.95
q.val <- 1 - (1 - coverage)/2
crit.val <- qt(q.val, df = 20 - 1)
crit.val
[1] 2.093024
The effect of increasing/decreasing coverage on the quantile is:
Precision refers to how wide or narrow the interval is.
Precision depends on every component of the margin of error:
By contrast, coverage depends only on the critical value used.
Interval estimates constructed to achieve a specified coverage are called “confidence intervals”; the coverage is interpreted and reported as a “confidence level”.
# ingredients
cholesterol.mean <- mean(cholesterol)
cholesterol.sd <- sd(cholesterol)
cholesterol.n <- length(cholesterol)
cholesterol.se <- cholesterol.sd/sqrt(cholesterol.n)
crit.val <- qt(1 - (1 - 0.95)/2, df = cholesterol.n - 1)
# interval
cholesterol.mean + c(-1, 1)*crit.val*cholesterol.se
[1] 5.005566 5.080310
With 95% confidence, the mean total cholesterol among U.S. adults is estimated to be between 5.0056 and 5.0803 mmol/L.
The general formula for a confidence interval for the population mean is
\[ \bar{x} \pm c\times SE(\bar{x}) \]
where \(c\) is a critical value, obtained as a quantile of the \(t_{n - 1}\) model and chosen to ensure a specific coverage.
The “common” interval estimate for the mean is actually an approximate 95% confidence interval:
\[ \bar{x} \pm 2 \times SE(\bar{x}) \]
captures the population mean \(\mu\) for roughly 95% of random samples
replacing 2 with a \(t_{n - 1}\) quantile allows the analyst to adjust coverage
the \(t_{n - 1}\) model is an approximation for the sampling distribution of \(\frac{\bar{x} - \mu}{SE(\bar{x})}\)
Interval interpretation:
With [XX]% confidence, the mean [population parameter] is estimated to be between [lower bound] and [upper bound] [units].
Artificially simulating a large number of intervals provides an empirical approximation of coverage.
This is also a handy way to remember the proper interpretation:
If I made a lot of intervals from independent samples, 95% of them would ‘get it right’.
STAT218