Foundations for statistical inference

Point estimation and sampling variability

Today’s agenda

  1. Survey: attitudes about statistics
  2. [lecture] point estimation and sampling variability
  3. [lab] calculating point and interval estimates
  4. [activity] estimating the class’ mean arm span

Point estimation

Population distributions

A population distribution is a frequency distribution across all possible study units.

For simple random samples, the observed distribution of sample values resembles the population distribution:

The larger the sample, the closer the resemblance.1

Foundations for inference

Population statistics are called parameters. These are fixed but unknown values.

Population mean Population SD
5.067 1.126

Notation:

  • population mean \(\mu\)
  • population standard deviation \(\sigma\)

Foundations for inference

Sample statistics provide point estimates of the corresponding population statistics.

Notation:

  • sample mean \(\bar{x}\)
  • sample standard deviation \(s_x\)
Sample mean Sample SD Sample size
5.043 1.075 3179

Foundations for inference

Population mean Population SD
5.067 1.126

Sample mean Sample SD Sample size
5.043 1.075 3179

So we might say: “mean total cholesterol in the study population is estimated to be 5.043

A difficulty

Different samples yield different estimates.

Sample means:

sample.1 sample.2
5.093 5.136
  • estimates are close but not identical
  • the population mean can’t be both 5.093 and 5.136
  • probably neither estimate is exactly correct
  • but both estimates should have similar errors if the study design is identical between the two samples

Simulating sampling variability

These are 20 random samples with the sample mean indicated by the dashed line and the population distribution and mean overlaid in red.

  • sample size \(n = 20\)
  • frequency distributions differ a lot
  • sample means differ some

We can actually measure this variability!

Simulating sampling variability

If we had means calculated from a much larger number of samples, we could make a frequency distribution for the values of the sample mean.

sample 1 2 \(\cdots\) 10,000
mean 4.957 5.039 \(\cdots\) 5.24

We could then use the usual measures of center and spread to characterize the distribution of sample means.

  • mean of \(\bar{x}\): 5.068425
  • standard deviation of \(\bar{x}\): 0.2369404

Across 10,000 random samples of size 20, the average estimate was 5.07 and the variability of estimates was 0.237.

Sampling distributions

What we are simulating is known as a sampling distribution: the frequency of values of a statistic across all possible random samples.

When data are from a random sample, statistical theory provides that the sample mean \(\bar{x}\) has a sampling distribution with

  • mean \(\color{red}{\mu}\) (population mean)
  • standard deviation \(\color{red}{\frac{\sigma}{\sqrt{n}}} \; \left(\frac{\text{population SD}}{\sqrt{\text{sample size}}}\right)\)

regardless of the population distribution.

In other words, across all random samples of a fixed size…

  • [accuracy] on average, the sample mean equals the population mean
  • [precision] on average, the estimation error is \(\frac{\sigma}{\sqrt{n}}\)

Measuring sampling variability

In practice we use an estimate of sampling variability known as a standard error: \[SE(\bar{x}) = \frac{s_x}{\sqrt{n}} \qquad \left(\frac{\text{sample SD}}{\sqrt{\text{sample size}}}\right)\]

For example:

\[SE(\bar{x}) = \frac{1.073}{\sqrt{20}} = 0.240\]

Reporting point estimates

It is common style to report the value of a point estimate with a standard error given parenthetically.

Statistics from full NHANES sample:

mean sd n
5.043 1.075 3179

The mean total cholesterol among the population is estimated to be 5.043 mmol/L (SE 0.019)

This style of report communicates:

  • parameter of interest
  • value of point estimate
  • error/variability of point estimate

Sources of variability

There are two potential sources of variability in estimates:

  1. population variability (\(\sigma\))
  2. sampling variability (determined by \(n\))

For example, the estimates below are equally precise:

\(SE(\bar{x})\) = 0.1265079

\(SE(\bar{x})\) = 0.1223712

Interval estimation

Interval estimation

An interval estimate is a range of plausible values for a population parameter.

The general form of an interval estimate is

\[\text{point estimate} \pm \text{margin of error}\]

where the “margin of error” reflects the sampling variability.

  • more sampling variability ⟹ larger margin of error
  • less sampling variability ⟹ smaller margin of error

An interval for the mean

A common interval for the population mean is: \[\bar{x} \pm 2\times SE(\bar{x}) \qquad\text{where}\quad SE(\bar{x}) = \left(\frac{s_x}{\sqrt{n}}\right)\]

By hand: \[5.043 \pm 2\times 0.0191 = (5.005, 5.081)\]

In R:

avg.totchol <- mean(totchol)
se.totchol <- sd(totchol)/sqrt(length(totchol))
avg.totchol + c(-2, 2)*se.totchol
[1] 5.004817 5.081059

Interpretation: the mean total cholesterol among the study population is estimated to be between 5.005 and 5.081 mmol/L.

Questions for next time

  1. In what sense are the values in the interval estimate “plausible”?
  2. Why use \(2\times SE(\bar{x})\) for the margin of error?
  3. What do you expect would happen if the margin of error were \(3\times SE(\bar{x})\)?