STAT218 - Foundations for statistical inference

Population distributions

A population distribution is a frequency distribution across all possible study units.

For simple random samples, the observed distribution of sample values resembles the population distribution:

The larger the sample, the closer the resemblance.¹

Foundations for inference

Population statistics are called parameters. These are fixed but unknown values.

Population mean	Population SD
5.067	1.126

Notation:

population mean
population standard deviation

Foundations for inference

Sample statistics provide point estimates of the corresponding population statistics.

Notation:

sample mean
sample standard deviation

Sample mean	Sample SD	Sample size
5.043	1.075	3179

Foundations for inference

Population mean	Population SD
5.067	1.126

Sample mean	Sample SD	Sample size
5.043	1.075	3179

So we might say: “mean total cholesterol in the study population is estimated to be 5.043”

A difficulty

Different samples yield different estimates.

Sample means:

sample.1	sample.2
5.093	5.136

estimates are close but not identical
the population mean can’t be both 5.093 and 5.136
probably neither estimate is exactly correct
but both estimates should have similar errors if the study design is identical between the two samples

Simulating sampling variability

These are 20 random samples with the sample mean indicated by the dashed line and the population distribution and mean overlaid in red.

sample size
frequency distributions differ a lot
sample means differ some

We can actually measure this variability!

Simulating sampling variability

If we had means calculated from a much larger number of samples, we could make a frequency distribution for the values of the sample mean.

sample	1	2		10,000
mean	4.957	5.039		5.24

We could then use the usual measures of center and spread to characterize the distribution of sample means.

mean of : 5.068425
standard deviation of : 0.2369404

Across 10,000 random samples of size 20, the average estimate was 5.07 and the variability of estimates was 0.237.

Sampling distributions

What we are simulating is known as a sampling distribution: the frequency of values of a statistic across all possible random samples.

When data are from a random sample, statistical theory provides that the sample mean has a sampling distribution with

mean (population mean)
standard deviation

regardless of the population distribution.

In other words, across all random samples of a fixed size…

[accuracy] on average, the sample mean equals the population mean
[precision] on average, the estimation error is

Measuring sampling variability

In practice we use an estimate of sampling variability known as a standard error:

For example:

Reporting point estimates

It is common style to report the value of a point estimate with a standard error given parenthetically.

Statistics from full NHANES sample:

mean	sd	n
5.043	1.075	3179

The mean total cholesterol among the population is estimated to be 5.043 mmol/L (SE 0.019)

This style of report communicates:

parameter of interest
value of point estimate
error/variability of point estimate

Sources of variability

There are two potential sources of variability in estimates:

population variability ()
sampling variability (determined by )

For example, the estimates below are equally precise:

= 0.1265079

= 0.1223712

Foundations for statistical inference

Today’s agenda

Point estimation

Population distributions

Foundations for inference

Foundations for inference

Foundations for inference

A difficulty

Simulating sampling variability

Simulating sampling variability

Sampling distributions

Measuring sampling variability

Reporting point estimates

Sources of variability

Interval estimation

Interval estimation

An interval for the mean

Questions for next time