Lab 1: R Basics

Course

STAT218

This lab is intended to introduce you to the basics in R that you will need for this class. Most of your statistical analyses throughout the quarter will consist of just a few simple steps:

  1. load a dataset
  2. identify and select variable(s) of interest
  3. perform one or more calculations using variable(s) of interest as inputs

We will illustrate this process so that you can get used to the mechanics and familiarize yourself with how different data types appear in R. You will also learn how to perform simple calculations of common summary statistics: proportions and averages.

Preliminaries

First let’s cover a few logistics of how to use this document to complete the lab activity.

Code cells

Throughout this document, you will notice that there are “code cells” that look like this:

# say hello
print("hello!")
[1] "hello!"

Any given line of code is simply an instruction for R to perform a certain task. You can “execute” or “run” the codes that appear in the cells in one of two ways:

  1. Click the little green arrow “‣” in the upper-right corner of the code cell
  2. Place your cursor on the line you wish to run and press Ctrl + Enter (windows) or Cmd + Enter (mac)

Try it with the example above. You should see the result appear directly below the code cell.

How to do this lab

The structure of our labs in STAT218 will be simple. I’ve provided a copy of this document and the necessary data files on Posit Cloud. The text explains statistical and data analytic tasks and the codes required to perform them. Code cells provide examples. “Your Turn” sections prompt you to replicate the examples in a slightly different setting. Your job is to read the activity and as you go:

  1. run example code cells
  2. fill in the “your turn” code cells

You will need to run/complete cells in order — often, later tasks depend on the results of earlier tasks. So, don’t skip ahead, even if you feel confident in what you’re doing.

Remember that your goal is to understand the statistical concepts and learn to apply them to real data! I’ve done my best to write these activities with a minimum of code complexity so that you can focus on the statistical content. Try not to let the codes become a distraction — remember, they’re just instructions, no different from clicking a button in an app.

Packages in R

You will often see activities begin with a cell that looks like this:

library(tidyverse)

This command loads a software “package”: a bundle of functions, datasets, and other objects that are imported into R for use in your working environment. If you see this kind of cell, you will need to run it before proceeding.

Aside: installing packages

Packages do need to be installed before they can be loaded. One of the nice things about using Posit Cloud is that I can manage all of these installs for you. However, if you ever wish to install and use a package that’s not available (or if you use R on your own machine), you can install a package using the command install.packages("<PACKAGE NAME>") after replacing <PACKAGE NAME> with the actual name of the package (but keeping the quotation marks!).

Data in R

There are several ways to load data in R. The strategy we’ll use most often is to load an .RData file, but you will encounter a few others here and there.

# load nhanes data
load('data/nhanes.RData')

This command looks for a file called nhanes.RData in a directory folder named data and reads the file.

Don’t be alarmed if nothing appears below the cell when you run it! However, notice that once you run the command, an object called nhanes appears in the “Environment” tab in the upper right hand panel of your RStudio window.

If you click the little blue carrot next to nhanes in the environment tab, you will then see a list of variables contained in the dataset. You can also see the first few rows of the dataset using head(...):

# first few rows
head(nhanes)
# A tibble: 6 × 9
  subj.id gender   age poverty pulse bpsys1 bpdia1 totchol sleephrsnight
    <int> <fct>  <int>   <dbl> <int>  <int>  <int>   <dbl>         <int>
1       1 male      34    1.36    70    114     88    3.49             4
2       2 male      34    1.36    70    114     88    3.49             4
3       3 male      34    1.36    70    114     88    3.49             4
4       5 female    49    1.91    86    118     82    6.7              8
5       8 female    45    5       62    106     62    5.82             8
6       9 female    45    5       62    106     62    5.82             8

This kind of object in R is called a data frame. Data frames are displayed in a tabular layout, like a spreadsheet.

While data frames should be arranged so that observations are shown in rows and variables in columns, this is not guaranteed, so you should be in the habit of checking to make sure the layout is sensible; otherwise, you might accidentally perform bogus calculations and analyses.

Inspecting the data frame — either by expanding it in the environment panel or by previewing a few rows — will show you three key pieces of information besides the values of the first few observations of each variable:

  1. Data dimensions: how many observations (rows) and how many variables (columns)

  2. Variable names: subj.id, gender, age, etc.

  3. Data types:

    • int for integer (numerical data type)
    • fct for factor (categorical data type)
    • num for numeric (numerical data type)
    • chr for character (categorical data type)

So, for example, seeing that pulse is of data type int tells you that pulse is a discrete numerical variable. It also tells you what name to use to refer to the variable in subsequent R commands.

Your turn

There is another data file in the data directory called famuss.RData. Load this into the environment, preview the first few observations, and check the variable names and data types.

To check your understanding:

  • how many observations and variables?
  • identify a categorical variable
  • what kind of variable is bmi?
# load famuss dataset
load('data/famuss.RData')

# preview first few rows
head(famuss)
# A tibble: 6 × 9
  ndrm.ch drm.ch sex      age race      height weight genotype   bmi
    <dbl>  <dbl> <fct>  <int> <fct>      <dbl>  <dbl> <fct>    <dbl>
1      40     40 Female    27 Caucasian   65      199 CC        33.1
2      25      0 Male      36 Caucasian   71.7    189 CT        25.8
3      40      0 Female    24 Caucasian   65      134 CT        22.3
4     125      0 Female    40 Caucasian   68      171 CT        26.0
5      40     20 Female    32 Caucasian   61      118 CC        22.3
6      75      0 Female    24 Hispanic    62.2    120 CT        21.8

There are 595 observations and 9 variables. The variables sex, race, and genotype are categorical; bmi is numeric.

Extracting variables

The variable names in a dataset can be used to retrieve or refer to specific variables. For example, try running this command:

# extract total cholesterol
total.cholesterol <- nhanes$totchol

# preview first few values
head(total.cholesterol)
[1] 3.49 3.49 3.49 6.70 5.82 5.82

That command did the following:

  • extracted the totchol column of nhanes (the nhanes$totchol part)
  • assigned the result a new name total.cholesterol (the <- part)

Assignment (<-) is a very important concept in R – you can store the result of any calculation as an object with a name of your choosing.

On names

Be careful not to use the same name twice! For example, if you were to run nhanes <- nhanes$totchol , you would overwrite the data frame nhanes in your environment and no longer have the full dataset available.

You’ll notice that total.cholesterol looks a bit different than the data frame in terms of its appearance. This is because it’s not a data frame but rather a different kind of object called a vector: a collection of values of the same data type.

Your turn

Extract the change in nondominant arm strength variable from the FAMuSS dataset, and store it as a vector called strength.

# store the change in nondominant arm strength variable as a vector called 'strength'
strength <- famuss$ndrm.ch

# preview the first few values
head(strength)
[1]  40  25  40 125  40  75

Performing calculations in R

Extracting and storing variables as vectors isn’t strictly necessary, but does make it easier to perform many calculations. While you’re a beginner, I’d recommend using this strategy.

Numeric summaries

Most simple summary statistics can be calculated using simple functions in R that take a single vector argument. For example, to calculate the average, minimum, and maximum total cholesterol among the respondents in the sample:

# average total cholesterol
mean(total.cholesterol)
[1] 5.042938
# minimum
min(total.cholesterol)
[1] 2.33
# maximum
max(total.cholesterol)
[1] 13.65

Run these one line at a time (i.e., don’t run the whole cell at once). Notice if you run the entire cell, all three numbers will be printed in order.

Your turn

Find the average percent change in nondominant arm strength of participants in the FAMuSS study sample using the strength vector you created before.

# compute mean change in nondominant arm strength
mean(strength)
[1] 53.29109

Categorical summaries

Most data summaries for categorical variables proceed from counts of the number of observations in each category. These counts can be obtained by passing a vector of observations to table(...):

# retreive sex variable
sex <- nhanes$gender

# counts
table(sex)
sex
female   male 
  1588   1591 

To obtain the proportion of observations in each category – the counts divided by the total number of observations – pass the table to the proportions(...) function:

# proportions
table(sex) |> proportions()
sex
   female      male 
0.4995282 0.5004718 

The character string |> is a bit of syntax that you could read verbally as ‘then’: first make a table, then obtain proportions. It’s known as the pipe operator, because it ‘pipes’ the result of the command on its left into the command on its right.

To see another example of the pipe operator in action, you could rewrite the previous command as a chain of three steps:

# same as above
sex |> table() |> proportions()
sex
   female      male 
0.4995282 0.5004718 

You could interpret this as follows: start with sex, pass that to table(), then pass the result to proportions.

Your turn

Using the FAMuSS dataset, calculate the genotype frequencies in the sample (i.e., find the proportion of observations of each genotype).

# retrieve genotype
gtype <- famuss$genotype

# counts
table(gtype)
gtype
 CC  CT  TT 
173 261 161 
# proportions
table(gtype) |> proportions()
gtype
       CC        CT        TT 
0.2907563 0.4386555 0.2705882 

While the analyses you’ll learn will get more complex than computing summary statistics, the mechanics of performing the computations in R will be analogous to what you just did: executing a one-line command with a vector input.

Recap

Reflect for a moment on what you just did: you wrote a few lines of code to import a dataset, extract a variable, and compute a statistic. You now have a record of the commands you executed that you can use to retrace your steps.

In fact, anyone with your document and the data files (including future you) could easily reproduce your work. Reproducibility is a pillar of data-driven science; by storing analyses in the form of executable scripts, researchers can easily create and share records of their work.

We could put the steps above together in just a few lines as if it were a short script. Typical style is to provide line-by-line comments explaining what the commands do.

# import nhanes data
load('data/nhanes.RData')

# inspect data
head(nhanes)

# extract total cholesterol
total.cholesterol <- nhanes$totchol

# compute average total cholesterol
mean(total.cholesterol)

# extract sex
sex <- nhanes$gender

# proportions of men and women in sample
table(sex) |> proportions()
Your turn

Follow the example above and combine the previous exercises into a few lines of code with appropriate line comments.

# load famuss dataset
load('data/famuss.RData')

# inspect data
head(famuss)
# A tibble: 6 × 9
  ndrm.ch drm.ch sex      age race      height weight genotype   bmi
    <dbl>  <dbl> <fct>  <int> <fct>      <dbl>  <dbl> <fct>    <dbl>
1      40     40 Female    27 Caucasian   65      199 CC        33.1
2      25      0 Male      36 Caucasian   71.7    189 CT        25.8
3      40      0 Female    24 Caucasian   65      134 CT        22.3
4     125      0 Female    40 Caucasian   68      171 CT        26.0
5      40     20 Female    32 Caucasian   61      118 CC        22.3
6      75      0 Female    24 Hispanic    62.2    120 CT        21.8
# extract nondominant change in arm strength
strength <- famuss$ndrm.ch

# compute average change in strength
mean(strength)
[1] 53.29109
# extract genotype
gtype <- famuss$genotype

# compute genotype frequencies (proportions)
table(gtype) |> proportions()
gtype
       CC        CT        TT 
0.2907563 0.4386555 0.2705882 

If this was all entirely new to you, congratulations on writing your first lines of code!

More about R

While you will learn new commands going forward, we won’t go much more in depth with R than what you just saw. However, if you’re interested in understanding the above concepts in greater detail, or learning about R as a programming environment, see An Introduction to R.