Introduction to R

The first part of this workshop will demonstrate very basic functionality in R, including functions, functions, vectors, creating variables, getting help, filtering, data frames, plotting, and reading/writing files.

Link to slides.

Link to R Cheat Sheet.

Before coming

Note: R and RStudio are separate downloads and installations. R is the underlying statistical computing environment, but using R alone is no fun. RStudio is a graphical integrated development environment that makes using R much easier. You need R installed before you install RStudio.

Download data. Download the gapminder.csv and malebmi.csv files from bioconnector.org/data. Save them somewhere easy to find. Optionally, open them up in Excel and look around.
Install R. You'll need R version 3.1.2 or higher. Download and install R for Windows or Mac OS X (download the latest R-3.x.x.pkg file for your appropriate version of OS X).
Install RStudio. Download and install the latest stable version of RStudio Desktop. Alternatively, download the RStudio Desktop v0.99 preview release (the 0.99 preview version has many nice new features that are especially useful for this particular workshop).
Install R packages. Launch RStudio (RStudio, not R itself). Ensure that you have internet access, then enter the following commands into the Console panel (usually the lower-left panel, by default). Note that these commands are case-sensitive. At any point (especially if you've used R/Bioconductor in the past), R may ask you if you want to update any old packages by asking Update all/some/none? [a/s/n]:. If you see this, type a at the propt and hit Enter to update any old packages. If you're using a Windows machine you might get some errors about not having permission to modify the existing libraries -- don't worry about this message. You can avoid this error altogether by running RStudio as an administrator.

# Install packages from CRAN
install.packages("dplyr")
install.packages("ggplot2")
install.packages("tidyr")
install.packages("knitr")
install.packages("rmarkdown")

You can check that you've installed everything correctly by closing and reopening RStudio and entering the following commands at the console window:

library(dplyr)
library(ggplot2)
library(tidyr)
library(knitr)
library(rmarkdown)

These commands may produce some notes or other output, but as long as they work without an error message, you're good to go. If you get a message that says something like: Error in library(packageName) : there is no package called 'packageName', then the required packages did not install correctly. Please do not hesitate to email me prior to the course if you are still having difficulty.

RStudio

Let's start by learning about RStudio. R is the underlying statistical computing environment, but using R alone is no fun. RStudio is a graphical integrated development environment that makes using R much easier.

Panes in RStudio: I set up my window to have the editor in the top left, console top right, environment/history on the bottom left, and plots/help on the bottom right.
Code that you type into the console is code that R executes. From here forward we will use the editor window to write a script that we can save to a file and run it again whenever we want to. We usually give it a .R extension, but it's just a plain text file. If you want to send commands from your editor to the console, use CMD+Enter (Ctrl+Enter on Windows).
Anything after a # sign is a comment. Use them liberally to comment your code.

Basic operations

R can be used as a glorified calculator. Try typing this in directly into the console. Make sure you're typing into into the editor, not the console, and save your script. Use the run button, or press CMD+Enter (Ctrl+Enter on Windows).

2 + 2
5 * 4
2^3

R Knows order of operations and scientific notation.

2 + 3 * 4/(5 + 3) * 15/2^2 + 3 * 4^2
50000

However, to do useful and interesting things, we need to assign values to objects. To create objects, we need to give it a name followed by the assignment operator <- and the value we want to give it:

weight_kg <- 55

<- is the assignment operator. Assigns values on the right to objects on the left, it is like an arrow that points from the value to the object. Mostly similar to = but not always. Learn to use <- as it is good programming practice. Using = in place of <- can lead to issues down the line.

Objects can be given any name such as x, current_temperature, or subject_id. You want your object names to be explicit and not too long. They cannot start with a number (2x is not valid but x2 is). R is case sensitive (e.g., weight_kg is different from Weight_kg). There are some names that cannot be used because they represent the names of fundamental functions in R (e.g., if, else, for, see here for a complete list). In general, even if it's allowed, it's best to not use other function names, which we'll get into shortly (e.g., c, T, mean, data, df, weights). In doubt check the help to see if the name is already in use. It's also best to avoid dots (.) within a variable name as in my.dataset. It is also recommended to use nouns for variable names, and verbs for function names.

When assigning a value to an object, R does not print anything. You can force to print the value by typing the name:

weight_kg

Now that R has weight_kg in memory, we can do arithmetic with it. For instance, we may want to convert this weight in pounds (weight in pounds is 2.2 times the weight in kg):

2.2 * weight_kg

We can also change a variable's value by assigning it a new one:

weight_kg <- 57.5
2.2 * weight_kg

This means that assigning a value to one variable does not change the values of other variables. For example, let's store the animal's weight in pounds in a variable.

weight_lb <- 2.2 * weight_kg

and then change weight_kg to 100.

weight_kg <- 100

What do you think is the current content of the object weight_lb? 126.5 or 220?

You can see what objects (variables) are stored by viewing the Environment tab in Rstudio. You can also use the ls() function. You can remove objects (variables) with the rm() function. You can do this one at a time or remove several objects at once.

ls()
rm(weight_lb, weight_kg)
ls()
weight_lb  # oops! you should get an error because weight_lb no longer exists!

EXERCISE 1

What are the values after each statement in the following?

mass <- 50  # mass?
age <- 30  # age?
mass <- mass * 2  # mass?
age <- age - 10  # age?
mass_index <- mass/age  # massIndex?

Functions

R has built-in functions.

# Notice that this is a comment.  Anything behind a # is 'commented out' and
# is not run.
sqrt(144)
log(1000)

Get help by typing a question mark in front of the function's name, or help(functionname):

help(log)
?log

Note syntax highlighting when typing this into the editor. Also note how we pass arguments to functions. The base= part inside the parentheses is called an argument, and most functions use arguments. Arguments modify the behavior of the function. Functions some input (e.g., some data, an object) and other options to change what the function will return, or how to treat the data provided. Finally, see how you can next one function inside of another (here taking the square root of the log-base-10 of 1000).

log(1000)
log(1000, base = 10)
log(1000, 10)
sqrt(log(1000, base = 10))

EXERCISE 2

See ?abs and calculate the square root of the log-base-10 of the absolute value of -4*(2550-50). Answer should be 2.

Vectors and classes

Let's create some numeric vectors. Vectors (aka "arrays" in Perl, "lists" in Python) are single objects containing an ordered collection of elements. A simple vector is a numeric vector, a single object containing several numbers. Here let's display a few vectors. We can also do vector arithmetic. When printing vectors to the screen that have lots of elements, notice that the bracketed number in the gutter of the output is just a counter indexing the number of elements in the vector.

# Some simple numeric vectors:
1:5
6:10
1:5 + 6:10
1:100

We can also create arbitrary vectors with the c() function (short for "combine").

c(1, 2, 5)
c(1:5, 11:15)

What if we wanted to create a vector from 2 to 10 by 2's? What about 2 to 200 by 4's? This might be useful for setting up an experiment where every other sample is an experimental group and every other is a control.

c(2, 4, 6, 8, 10)

# Get some help with the seq() function, then create a vector from 2 to 200
# by 2s.  Notice how the seq() function works -- the `to` argument will
# never be exceeded.
help(seq)
seq(from = 2, to = 200, by = 4)

You can assign this vector of values to an object, just like you would for one item. For example we can create a vector of animal weights:

weights <- c(50, 60, 65)
weights

A vector can also contain characters:

animals <- c("mouse", "rat", "dog")
animals

There are many functions that allow you to inspect the content of a vector. length() tells you how many elements are in a particular vector:

length(weights)
length(animals)

class() indicates the class (the type of element) of an object:

class(weights)
class(animals)

EXERCISE 3

Use the c() function to create/assign a new object that combines the weights and animals vectors into a single vector called combined.
What happened to the numeric values? Hint: What's the class() of combined?
Why do you think this happens?

The function str() provides an overview of the structure of an object and the elements it contains. It is a really useful function when working with large and complex objects:

str(weights)
str(animals)

You can add elements to your vector simply by using the c() function:

weights
weights <- c(weights, 90)  # adding at the end
weights <- c(30, weights)  # adding at the beginning
weights

What happens here is that we take the original vector weights, and we are adding another item first to the end of the other ones, and then another item at the beginning. We can do this over and over again to build a vector or a dataset. When you're programming this may be useful to autoupdate results that we are collecting or calculating.

Certain functions operate only on certain classes of object. Here, weights is a numeric vector. The built-in sum() function will operate on numeric objects, but not characters.

sum(weights)
sum(animals)

EXERCISE 4

Sum the integers 1 through 100 and 501 through 600 (e.g. 1+2+...+99+100+501+502+...+599+600)

Slicing/indexing vectors

Let's create a vector of 50 integers going from 101 to 150. We can access certain elements of that vector by putting the element's index(es) in square brackets. E.g., x[1] will return the first element in vector x. Calling x[c(3,5)] will access the third and fifth elements. Calling x[1:10] will return the first ten elements of x.

Special note: R indexes vectors starting at 1. This is different from many other languages, including Perl and Python, which index starting from zero.

# Create the vector.
x <- 101:150

# Get the first element.
x[1]

# Get the 42nd element.
x[42]

# Get the 20th through the 25th elements.
x[20:25]

# If you try to access elements that don't exist, you'll return missing
# values.  Missing values are represented as NA
x[45:55]  #NA is missing value!

Data Frames

We use data frames to store heterogeneous tabular data in R: tabular, meaning that individuals or observations are typically represented in rows, while variables or features are represented as columns; heterogeneous, meaning that columns/features/variables can be different classes (on variable, e.g. age, can be numeric, while another, e.g., cause of death, can be text).

Before coming, you should have downloaded the gapminder data. If you downloaded this entire lesson repository, once you extract it you'll find it in workshops/lessons/r/data/gapminder.csv. Alternatively you can download it directly from http://bioconnector.org/data/. This dataset is an excerpt from the Gapminder data, that's already been cleaned up to a degree. This particular dataset has 1704 observations on six variables:

country a categorical variable (aka "factor") 142 levels
continent, a categorical variable with 5 levels
year: going from 1952 to 2007 in increments of 5 years
pop: population
gdpPercap: GDP per capita
lifeExp: life expectancy

Let's load the data first. There are two ways to do this. You can use RStudio's menus to select a file from your computer (tools, import dataset, from text file). But that's not reproducible. The best way to do this is to save the data and the script you're using to the same place, and read in the data in using read.csv. It's important to tell R that the file has a header, which tells R the names of the columns. We tell this to R with the header=TRUE argument.

Once we've loaded it we can type the name of the object itself (gm) to view the entire data frame. Note: doing this with large data frames can cause you trouble.

# Read in the data from a file
gm <- read.csv("data/gapminder.csv", header = TRUE)

# Alternatively, read directly from the web: gm <-
# read.csv(url('http://bioconnector.org/data/gapminder.csv'), header=TRUE)

# See what kind of data it is, and print the object to the screen
class(gm)
gm

# Set an option to only show a few elements
options(max.print = 80)
gm

Inspecting data.frame objects

There are several built-in functions that are useful for working with data frames.

Content:
- head(): shows the first 6 rows
- tail(): shows the last 6 rows
Size:
- dim(): returns a 2-element vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
- nrow(): returns the number of rows
- ncol(): returns the number of columns
Summary:
- colnames() (or just names()): returns the column names
- str(): structure of the object and information about the class, length and content of each column
- summary(): works differently depending on what kind of object you pass to it. Passing a data frame to the summary() function prints out some summary statistics about each column (min, max, median, mean, etc.)

head(gm)
tail(gm)
dim(gm)
nrow(gm)
ncol(gm)
colnames(gm)
str(gm)
summary(gm)

View(gm)

Accessing variables & subsetting data frames

We can access individual variables within a data frame using the $ operator, e.g., mydataframe$specificVariable. Let's print out the populations for every country. Then let's calculate the average life expectancy for every country for every year (using the built-in mean() function).

# display all populations
gm$pop

# we could have also taken a head of the population:
head(gm$pop)

# mean life expectancy
mean(gm$lifeExp)

Now that's not too interesting. This is the average life expectancy across all countries across all years. We might be interested in something like the life expectancy for a particular country, and how that changes over time. Or maybe the average life expectancy across all countries, separately for each year. Or maybe the average life expectancy for different continents. So the natural next step here is to filter the data according to some criterion (that is, we'll limit the rows that are returned by restricting the value of some variable stored in a column).

There's a built-in function called subset() that does this but we're going to use a function called filter() that's part of a package called dplyr, because we're going to use other stuff from the dplyr package in a more advanced class to do some advanced data manipulation and analysis. The filter() function takes two arguments.

The first argument is the data frame you want to filter, e.g. filter(gm....
The second argument is a condition you must satisfy, e.g. filter(gm, year==1982). If you want to satisfy all of multiple conditions, you can use the "and" operator, &. The "or" operator | (the pipe character, usually shift-backslash) will return a subset that meet any of the conditions.
- ==: Equal to
- !=: Not equal to
- >, >=: Greater than, greater than or equal to
- <, <=: Less than, less than or equal to

Let's try it out. But first we have to load the package that has the function we want. It's called dplyr. You should have already installed it.

# Load the dplyr package
library(dplyr)

# Show only stats for the year 1982
filter(gm, year == 1982)

# Show only stats for the US
filter(gm, country == "United States")

# Show only stats for American contries in 1997
filter(gm, continent == "Americas" & year == 1997)

# Show only stats for countries with per-capita GDP of less than 300 OR a
# life expectancy of less than 30. What happened those years?
filter(gm, gdpPercap < 300 | lifeExp < 30)

Finally, take a look at the class of what's returned by a filter() function. The filter() function takes a data.frame and returns a data.frame. You can operate on this new data.frame just as you would any other data.frame using the $ operator. E.g., print out the GDP for the two oceanic countries in 2002, and take the mean of that.

filter(gm, continent == "Oceania" & year == 2002)
filter(gm, continent == "Oceania" & year == 2002)$gdpPercap
mean(filter(gm, continent == "Oceania" & year == 2002)$gdpPercap)

EXERCISE 5

What country and what years had a low GDP (<500) but high life expectancy (>50)?
What's the average GDP for Asian countries in 2002? How does that compare to European countries in the same year? To the Americas?

BONUS: Preview to advanced class

What if we wanted to compute the mean population size and mean GDP for each country for each year? We have 12 different years, times 5 continents is 60, times 2 calculations (mean GDP and population), gives us 120 operations total. In the advanced class I'll show you how to do this in a single line.

with()

The with() function is particularly helpful. Let's say you wanted to compute some (senseless) value by computing the population size divided by the life expectancy. You could access the dataset's variables using the $ notation, or you could use with() to temporarily attach the data frame, and call the variables directly, as if they were just vectors hanging out in your workspace. The first argument to with() is the name of the data frame, and the second argument is all the stuff you'd like to do with the particular features in that data frame.

Try typing the following commands:

# Display the number of cylinders.
gm$pop
with(gm, pop)

# Compute the senseless value described above. Both return the same results.
gm$pop/gm$lifeExp
with(gm, pop/lifeExp)

EXERCISE 6

Using the with() and filter() functions:

Compute the average GDP in billions for all Asian countries in 2007.
Do the same for Europe in 2007.

Hint: GDP per capita is the GDP divided by the population size. So to get GDP, you'd multiple gdpPercap*pop. To get that in billions, divide by 1,000,000, or more easily expressed in R using scientific notation: 1e9.

Plotting basics

Plotting a single numeric variable goes down the rows and plots a value on the y-axis for each observation (index) in the data frame.

plot(gm$lifeExp)

This isn't a very useful figure. But you can start to see something in this data. Look at the diagonal pattern that pops up? Why is this?

More appropriate might be a histogram. We can try to let R decide how many breaks to insert in the histogram, or we can set that manually. We can also set the color of the bars.

hist(gm$lifeExp)
hist(gm$lifeExp, breaks = 100)
hist(gm$lifeExp, breaks = 100, col = "black")

But keep in mind, this is over all the years in the dataset. It might make more sense to look at the distribution of life expectancies and how those distributions change from year to year. I'll teach an advanced visualization class later on where we'll show how to do that. But at least this lets you look at the integrity of your data. If you saw a bump in the distribution at a negative value or at 150 years, you'd know you have a problem.

We can create a scatterplot between two variables with plot(varX, varY).

with(gm, plot(gdpPercap, lifeExp))

There are hundreds of plotting parameters you can use to make your plot look exactly like you want. Let's use a solid-filled point instead of an open circle with the pch= argument (point character), color the points red with the col= argument, give it a title by passing a character object to the main= argument, and change the x and y axis titles with the xlab= and ylab= arguments, respectively. Let's go through this one step at a time.

with(gm, plot(gdpPercap, lifeExp, pch = 16))
with(gm, plot(gdpPercap, lifeExp, pch = 16, col = "red"))
with(gm, plot(gdpPercap, lifeExp, pch = 16, col = "red", main = "Life Exp vs GDP"))
with(gm, plot(gdpPercap, lifeExp, pch = 16, col = "red", main = "Life Exp vs GDP", 
    ylab = "Life Expectancy (years)", xlab = "Per-capita GDP ($)"))

Notice how on that last line I broke the command up into two lines for better readability. I broke the command at the comma separating arguments, and indented the following line for readability.

With plotting parameters, Google is your friend.

Forget what each point character represents? Google R pch.
Forget the names of R's colors? Google R colors. Want to learn more about color schemes in R? Google RColorBrewer.
Try googling R graphical parameters.

EXERCISE 7

Plot GDP in trillions (gdpPercap*pop/1e9) on the y-axis versus population size in millions on the x-axis for all countries in the Americas. Use solid (pch=16) "blue" points, and give the plot a title and legends.

Reading in / writing out data

First, lets create a small dataset that only has the data for 1997 in it.

gm97 <- filter(gm, year == 1997)

Next, check what your working directory is with getwd() with no arguments, and look up some help for write.table() and write.csv().

getwd()
help(write.table)
help(write.csv)

Using RStudio, go to the Session menu, and select the directory (folder) you want to work from under the "Set Working Directory" menu. You can also do this manually with the setwd() command, but you should really be working in the same directory as wherever you have your script and data stored.

Now you can save the new reduced data frame to a comma-separated file called gm97.csv using the write.csv() function.

write.csv(gm97, file = "data/gm97.csv")

Later on you can load this particular dataset again either using the menus (Tools -- Import Dataset) or using read.csv().