# say hello
print("hello!")
[1] "hello!"
STAT218
This lab is intended to introduce you to the basics in R that you will need for this class. Most of your statistical analyses throughout the quarter will consist of just a few simple steps:
We will illustrate this process so that you can get used to the mechanics and familiarize yourself with how different data types appear in R. You will also learn how to perform simple calculations of common summary statistics: proportions and averages.
First let’s cover a few logistics of how to use this document to complete the lab activity.
Throughout this document, you will notice that there are “code cells” that look like this:
Any given line of code is simply an instruction for R to perform a certain task. You can “execute” or “run” the codes that appear in the cells in one of two ways:
Try it with the example above. You should see the result appear directly below the code cell.
The structure of our labs in STAT218 will be simple. I’ve provided a copy of this document and the necessary data files on Posit Cloud. The text explains statistical and data analytic tasks and the codes required to perform them. Code cells provide examples. “Your Turn” sections prompt you to replicate the examples in a slightly different setting. Your job is to read the activity and as you go:
You will need to run/complete cells in order — often, later tasks depend on the results of earlier tasks. So, don’t skip ahead, even if you feel confident in what you’re doing.
Remember that your goal is to understand the statistical concepts and learn to apply them to real data! I’ve done my best to write these activities with a minimum of code complexity so that you can focus on the statistical content. Try not to let the codes become a distraction — remember, they’re just instructions, no different from clicking a button in an app.
You will often see activities begin with a cell that looks like this:
This command loads a software “package”: a bundle of functions, datasets, and other objects that are imported into R for use in your working environment. If you see this kind of cell, you will need to run it before proceeding.
Packages do need to be installed before they can be loaded. One of the nice things about using Posit Cloud is that I can manage all of these installs for you. However, if you ever wish to install and use a package that’s not available (or if you use R on your own machine), you can install a package using the command install.packages("<PACKAGE NAME>")
after replacing <PACKAGE NAME>
with the actual name of the package (but keeping the quotation marks!).
There are several ways to load data in R. The strategy we’ll use most often is to load an .RData
file, but you will encounter a few others here and there.
This command looks for a file called nhanes.RData
in a directory folder named data
and reads the file.
Don’t be alarmed if nothing appears below the cell when you run it! However, notice that once you run the command, an object called nhanes
appears in the “Environment” tab in the upper right hand panel of your RStudio window.
If you click the little blue carrot next to nhanes
in the environment tab, you will then see a list of variables contained in the dataset. You can also see the first few rows of the dataset using head(...)
:
# A tibble: 6 × 9
subj.id gender age poverty pulse bpsys1 bpdia1 totchol sleephrsnight
<int> <fct> <int> <dbl> <int> <int> <int> <dbl> <int>
1 1 male 34 1.36 70 114 88 3.49 4
2 2 male 34 1.36 70 114 88 3.49 4
3 3 male 34 1.36 70 114 88 3.49 4
4 5 female 49 1.91 86 118 82 6.7 8
5 8 female 45 5 62 106 62 5.82 8
6 9 female 45 5 62 106 62 5.82 8
This kind of object in R is called a data frame. Data frames are displayed in a tabular layout, like a spreadsheet.
While data frames should be arranged so that observations are shown in rows and variables in columns, this is not guaranteed, so you should be in the habit of checking to make sure the layout is sensible; otherwise, you might accidentally perform bogus calculations and analyses.
Inspecting the data frame — either by expanding it in the environment panel or by previewing a few rows — will show you three key pieces of information besides the values of the first few observations of each variable:
Data dimensions: how many observations (rows) and how many variables (columns)
Variable names: subj.id
, gender
, age
, etc.
Data types:
int
for integer (numerical data type)fct
for factor (categorical data type)num
for numeric (numerical data type)chr
for character (categorical data type)So, for example, seeing that pulse
is of data type int
tells you that pulse is a discrete numerical variable. It also tells you what name to use to refer to the variable in subsequent R commands.
There is another data file in the data
directory called famuss.RData
. Load this into the environment, preview the first few observations, and check the variable names and data types.
To check your understanding:
bmi
?# A tibble: 6 × 9
ndrm.ch drm.ch sex age race height weight genotype bmi
<dbl> <dbl> <fct> <int> <fct> <dbl> <dbl> <fct> <dbl>
1 40 40 Female 27 Caucasian 65 199 CC 33.1
2 25 0 Male 36 Caucasian 71.7 189 CT 25.8
3 40 0 Female 24 Caucasian 65 134 CT 22.3
4 125 0 Female 40 Caucasian 68 171 CT 26.0
5 40 20 Female 32 Caucasian 61 118 CC 22.3
6 75 0 Female 24 Hispanic 62.2 120 CT 21.8
There are 595 observations and 9 variables. The variables sex
, race
, and genotype
are categorical; bmi
is numeric.
The variable names in a dataset can be used to retrieve or refer to specific variables. For example, try running this command:
# extract total cholesterol
total.cholesterol <- nhanes$totchol
# preview first few values
head(total.cholesterol)
[1] 3.49 3.49 3.49 6.70 5.82 5.82
That command did the following:
totchol
column of nhanes
(the nhanes$totchol
part)total.cholesterol
(the <-
part)Assignment (<-
) is a very important concept in R – you can store the result of any calculation as an object with a name of your choosing.
Be careful not to use the same name twice! For example, if you were to run nhanes <- nhanes$totchol
, you would overwrite the data frame nhanes
in your environment and no longer have the full dataset available.
You’ll notice that total.cholesterol
looks a bit different than the data frame in terms of its appearance. This is because it’s not a data frame but rather a different kind of object called a vector: a collection of values of the same data type.
Extract the change in nondominant arm strength variable from the FAMuSS dataset, and store it as a vector called strength
.
Extracting and storing variables as vectors isn’t strictly necessary, but does make it easier to perform many calculations. While you’re a beginner, I’d recommend using this strategy.
Most simple summary statistics can be calculated using simple functions in R that take a single vector argument. For example, to calculate the average, minimum, and maximum total cholesterol among the respondents in the sample:
[1] 5.042938
[1] 2.33
[1] 13.65
Run these one line at a time (i.e., don’t run the whole cell at once). Notice if you run the entire cell, all three numbers will be printed in order.
Find the average percent change in nondominant arm strength of participants in the FAMuSS study sample using the strength
vector you created before.
Most data summaries for categorical variables proceed from counts of the number of observations in each category. These counts can be obtained by passing a vector of observations to table(...)
:
To obtain the proportion of observations in each category – the counts divided by the total number of observations – pass the table to the proportions(...)
function:
The character string |>
is a bit of syntax that you could read verbally as ‘then’: first make a table, then obtain proportions. It’s known as the pipe operator, because it ‘pipes’ the result of the command on its left into the command on its right.
To see another example of the pipe operator in action, you could rewrite the previous command as a chain of three steps:
You could interpret this as follows: start with sex
, pass that to table()
, then pass the result to proportions
.
Using the FAMuSS dataset, calculate the genotype frequencies in the sample (i.e., find the proportion of observations of each genotype).
While the analyses you’ll learn will get more complex than computing summary statistics, the mechanics of performing the computations in R will be analogous to what you just did: executing a one-line command with a vector input.
Reflect for a moment on what you just did: you wrote a few lines of code to import a dataset, extract a variable, and compute a statistic. You now have a record of the commands you executed that you can use to retrace your steps.
In fact, anyone with your document and the data files (including future you) could easily reproduce your work. Reproducibility is a pillar of data-driven science; by storing analyses in the form of executable scripts, researchers can easily create and share records of their work.
We could put the steps above together in just a few lines as if it were a short script. Typical style is to provide line-by-line comments explaining what the commands do.
# import nhanes data
load('data/nhanes.RData')
# inspect data
head(nhanes)
# extract total cholesterol
total.cholesterol <- nhanes$totchol
# compute average total cholesterol
mean(total.cholesterol)
# extract sex
sex <- nhanes$gender
# proportions of men and women in sample
table(sex) |> proportions()
Follow the example above and combine the previous exercises into a few lines of code with appropriate line comments.
# A tibble: 6 × 9
ndrm.ch drm.ch sex age race height weight genotype bmi
<dbl> <dbl> <fct> <int> <fct> <dbl> <dbl> <fct> <dbl>
1 40 40 Female 27 Caucasian 65 199 CC 33.1
2 25 0 Male 36 Caucasian 71.7 189 CT 25.8
3 40 0 Female 24 Caucasian 65 134 CT 22.3
4 125 0 Female 40 Caucasian 68 171 CT 26.0
5 40 20 Female 32 Caucasian 61 118 CC 22.3
6 75 0 Female 24 Hispanic 62.2 120 CT 21.8
# extract nondominant change in arm strength
strength <- famuss$ndrm.ch
# compute average change in strength
mean(strength)
[1] 53.29109
# extract genotype
gtype <- famuss$genotype
# compute genotype frequencies (proportions)
table(gtype) |> proportions()
gtype
CC CT TT
0.2907563 0.4386555 0.2705882
If this was all entirely new to you, congratulations on writing your first lines of code!
While you will learn new commands going forward, we won’t go much more in depth with R than what you just saw. However, if you’re interested in understanding the above concepts in greater detail, or learning about R as a programming environment, see An Introduction to R.