Life is full of uncertainty, and this can make a lot of questions hard to answer, because similar situations do not always result in the same outcome.
Statistical thinking: uncertainty (random variability) is measurable.
What statistics can offer:
The overarching goal of 218 is to introduce you to statistics in a hands-on way that is relevant to your major.
We will learn about statistical thinking and data analysis by studying classical methods (mostly developed 1900-1940) and their application to case studies from life sciences.
Week | Topic(s) |
---|---|
1 | Data semantics, descriptive statistics |
2-6 | Statistical inference for one or more population means |
7-9 | Statistical inference for population proportions from categorical data |
10 | Simple linear regression |
Most class meetings will be a mixture of lecture and lab activities.
You will have three categories of graded assignments:
Select policies:
Assessments are averaged by learning outcome, not by assignment.
Graded questions are matched to specific learning outcomes and you receive a per-outcome score, for example:
This means, roughly, you answered 58% of study design questions (L1) correctly.
(This is after accounting for assessment weighting and corrections.)
You must have at least 6 outcome scores over 50% to pass.
Your course grade is then determined by the number of outcome scores exceeding 80%:
Grade | Number exceeding 80% |
---|---|
A | 9-10 |
B | 6-8 |
C | 3-5 |
D | 0-2 |
A study is an effort to collect data in order to answer one or more research questions.
studies must be well-matched to research questions to provide good answers
how data are obtained is just as important as how the resulting data are analyzed
no analysis, no matter how sophisticated will rescue a poorly conceived study
A study unit is the smallest object or entity that is measured in a study; also called experimental unit or observational unit.
Study units should be chosen so as to represent a larger collection or “population”.
A study population is a collection of all study units of interest.
A sample is a subcollection:
The gold standard is the simple random sample: all inclusion probabilities are equal.
Observational studies collect data from an existing situation without intervention.
Aim is to detect associations and patterns
Can’t be used to infer causal relationships owing to possible unmeasured confounding
Experiments collect data from a situation in which one or more interventions have been introduced by the investigator.
Learning early about peanut allergy (LEAP) study:
640 infants in UK with eczema or egg allergy but no peanut allergy enrolled
each infant randomly assigned to peanut consumption and peanut avoidance groups
peanut consumption: fed 6g peanut protein daily until 5 years old
peanut avoidance: no peanut consumption until 5 years old
at 5 years old, oral food challenge (OFC) allergy test administered
13.3% of the avoidance group developed allergies, compared with 1.9% of the consumption group
Study characteristics
Study type: experiment
Study population: UK infants with eczema or egg allergy but no peanut allergy
Sample: 640 infants from population
Treatments: peanut consumption; peanut avoidance
Treatment allocation: completely randomized
Study outcome: development of peanut allergy by 5 years of age
Study results
Moderated peanut consumption causes a reduction in the likelihood of developing an allergy among infants with prior risk (eczema or egg allergies).
Randomization eliminates confounding by ensuring that study interventions are independent of all extraneous conditions.
For example, imagine an observational version of the LEAP study in which allergy rates are compared between children who consumed peanuts as infants and those who didn’t.
Randomizing consumption regimens eliminates this possibility.
Data are a set of measurements.
A variable is any measured attribute of study units.
An observation is a measurement of one or more variables taken on one particular study unit.
It is usually expedient to arrange data values in a table in which each row is an observation and each column is a variable:
A table showing the observations and variables for the LEAP study would look like this:
participant.ID | treatment.group | ofc.test.result |
---|---|---|
LEAP_100522 | Peanut Consumption | PASS OFC |
LEAP_103358 | Peanut Consumption | PASS OFC |
LEAP_105069 | Peanut Avoidance | PASS OFC |
LEAP_105328 | Peanut Consumption | PASS OFC |
The table you saw in the reading was a summary of the data (not the data itself):
FAIL OFC | PASS OFC | |
---|---|---|
Peanut Avoidance | 36 | 227 |
Peanut Consumption | 5 | 262 |
Variables are classified according to their values. Values can be one of two different types:
For example:
sex
can be male or female, so it is categoricalage
(in years) can be any positive integer, so it is numericFurther distinctions are made based on the type of number or type of category used to measure an attribute. Can you match the subtypes to the variables at right?
age | hispanic | grade | weight |
---|---|---|---|
15 | not | 10 | 78.02 |
18 | hispanic | 12 | 78.47 |
17 | not | 11 | 95.26 |
18 | not | 12 | 95.26 |
Variable type (or subtype) is not an inherent quality — attributes can often be measured in many different ways.
For instance, age
might be measured as either a discrete, continuous, or ordinal variable, depending on the situation:
Age (years) | Age (minutes) | Age (brackets) |
---|---|---|
12 | 6307518.45 | 10-18 |
8 | 4209187.18 | 5-10 |
21 | 11258103.08 | 18-30 |
Numeric variables can always be discretized into categorical variables.
Classify each variable as nominal, ordinal, discrete, or continuous:
ndrm.ch | genotype | sex | age | race | bmi |
---|---|---|---|---|---|
33.3 | CT | Female | 19 | Caucasian | 21.01 |
71.4 | CT | Female | 18 | Other | 23.18 |
37.5 | CC | Female | 21 | Caucasian | 28.92 |
50 | CC | Female | 28 | Asian | 21.16 |
Data are from an observational study investigating demographic, physiological, and genetic characteristics associated with muscle strength.
ndrm.ch
is change in strength in nondominant arm after resistance traininggenotype
indicates genotype at a particular location within the ACTN3 geneA statistic is, mathematically, a function of the values of two or more observations
For numeric variables, the most common summary statistic is the average value:
\[\text{average} = \frac{\text{sum of values}}{\text{# observations}}\]
For example, the average percent change in nondominant arm strength was 53.291%.
For categorical variables, the most common summary statistic is a proportion:
\[\text{proportion}_i = \frac{\text{# observations in category } i}{\text{# observations}}\]
For example:
CC | CT | TT |
---|---|---|
0.2908 | 0.4387 | 0.2706 |
Typically, a set of observations is written as:
\[x_1, x_2, \dots, x_n\]
The sum of the observations is written \(\sum_i x_i\), where the symbol \(\sum\) stands for ‘summation’. This is useful for writing the formula for computing an average:
\[\bar{x} = \color{blue}{\frac{1}{n}}\color{red}{\sum_{i=1}^n x_i} \qquad \left(\text{average} = \color{blue}{\frac{1}{\text{# observations}}} \times \color{red}{\text{sum of values}}\right)\]
Sometimes, a few clever summary statistics can be used to answer a research question.
How much does the average change in arm strength differ by genotype, if at all?
Computing per-genotype group averages provides an answer:
genotype | avg.change | n.obs | prop.obs |
---|---|---|---|
TT | 58.08 | 161 | 0.2706 |
CT | 53.25 | 261 | 0.4387 |
CC | 48.89 | 173 | 0.2908 |
Number of observations and proportions are included because they provide information about genotype frequencies in the sample.
Data summaries indicated that the average change was highest (58.05% increase) among individuals with genotype TT.
This is a descriptive finding: it’s a fact about individuals in the study.
Can we conclude that the same is true of all individuals, including those not in the study?
Statistical inference refers to drawing conclusions about a population from a sample while accounting for uncertainty, and is the focus of this course.
The primary way data are stored in R is in a data frame which is a list of one or more variables arranged in a two-dimensional array.
By convention:
This lab will show you how to load, inspect, and use data frames. You will:
STAT218