Advanced Data Manipulation with R

Data analysis involves a large amount of janitor work - munging and cleaning data to facilitate downstream data analysis. This workshop assumes you have a basic familiarity with R and want to learn tools and techniques for advanced data manipulation. It will cover data cleaning and "tidy data," and will introduce R packages that enable data manipulation, analysis, and visualization using split-apply-combine strategies. Upon completing this lesson, you will be able to use the dplyr package in R to effectively manipulate and conditionally compute summary statistics over subsets of a "big" dataset containing many observations.

This is not a beginner course. This workshop requires basic familiarity with R: data types including vectors and data frames, importing/exporting data, and plotting. You can refresh your R knowledge with DataCamp's Intro to R or TryR from CodeSchool.

Before coming

You'll need to bring a laptop to the course with the software installed as detailed below.

Note: R and RStudio are separate downloads and installations. R is the underlying statistical computing environment, but using R alone is no fun. RStudio is a graphical integrated development environment that makes using R much easier. You need R installed before you install RStudio.

Download data. Download the gapminder.csv and malebmi.csv files from bioconnector.org/data. Save them somewhere easy to find. Optionally, open them up in Excel and look around.
Install R. You'll need R version 3.1.2 or higher. Download and install R for Windows or Mac OS X (download the latest R-3.x.x.pkg file for your appropriate version of OS X).
Install RStudio. Download and install the latest stable version of RStudio Desktop. Alternatively, download the RStudio Desktop v0.99 preview release (the 0.99 preview version has many nice new features that are especially useful for this particular workshop).
Install R packages. Launch RStudio (RStudio, not R itself). Ensure that you have internet access, then enter the following commands into the Console panel (usually the lower-left panel, by default). Note that these commands are case-sensitive. At any point (especially if you've used R/Bioconductor in the past), R may ask you if you want to update any old packages by asking Update all/some/none? [a/s/n]:. If you see this, type a at the propt and hit Enter to update any old packages. If you're using a Windows machine you might get some errors about not having permission to modify the existing libraries -- don't worry about this message. You can avoid this error altogether by running RStudio as an administrator.

# Install packages from CRAN
install.packages("dplyr")
install.packages("ggplot2")
install.packages("tidyr")
install.packages("knitr")
install.packages("rmarkdown")

You can check that you've installed everything correctly by closing and reopening RStudio and entering the following commands at the console window:

library(dplyr)
library(ggplot2)
library(tidyr)
library(knitr)
library(rmarkdown)

These commands may produce some notes or other output, but as long as they work without an error message, you're good to go. If you get a message that says something like: Error in library(packageName) : there is no package called 'packageName', then the required packages did not install correctly. Please do not hesitate to email me prior to the course if you are still having difficulty.

Review

We use data frames to store heterogeneous tabular data in R: tabular, meaning that individuals or observations are typically represented in rows, while variables or features are represented as columns; heterogeneous, meaning that columns/features/variables can be different classes (on variable, e.g. age, can be numeric, while another, e.g., cause of death, can be text).

Before coming, you should have downloaded the gapminder data. If you downloaded this entire lesson repository, once you extract it you'll find it in workshops/lessons/r/data/gapminder.csv. Alternatively you can download it directly from http://bioconnector.org/data/. This dataset is an excerpt from the Gapminder data, that's already been cleaned up to a degree. This particular dataset has 1704 observations on six variables:

country a categorical variable (aka "factor") 142 levels
continent, a categorical variable with 5 levels
year: going from 1952 to 2007 in increments of 5 years
pop: population
gdpPercap: GDP per capita
lifeExp: life expectancy

Let's load the data first. There are two ways to do this. You can use RStudio's menus to select a file from your computer (tools, import dataset, from text file). But that's not reproducible. The best way to do this is to save the data and the script you're using to the same place, and read in the data in using read.csv. It's important to tell R that the file has a header, which tells R the names of the columns. We tell this to R with the header=TRUE argument.

Once we've loaded it we can type the name of the object itself (gm) to view the entire data frame. Note: doing this with large data frames can cause you trouble.

gm <- read.csv("data/gapminder.csv", header=TRUE)

# Alternatively, read directly from the web:
# gm <- read.csv(url("http://bioconnector.org/data/gapminder.csv"), header=TRUE)

class(gm)
gm

There are several built-in functions that are useful for working with data frames.

Content:
- head(): shows the first 6 rows
- tail(): shows the last 6 rows
Size:
- dim(): returns a 2-element vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
- nrow(): returns the number of rows
- ncol(): returns the number of columns
Summary:
- colnames() (or just names()): returns the column names
- str(): structure of the object and information about the class, length and content of each column
- summary(): works differently depending on what kind of object you pass to it. Passing a data frame to the summary() function prints out some summary statistics about each column (min, max, median, mean, etc.)

head(gm)
tail(gm)
dim(gm)
nrow(gm)
ncol(gm)
colnames(gm)
str(gm)
summary(gm)

View(gm)

The dplyr package

The dplyr package is a relatively new R package that makes data manipulation fast and easy. It imports functionality from another package called magrittr that allows you to chain commands together into a pipeline that will completely change the way you write R code such that you're writing code the way you're thinking about the problem.

tbl_df

Let's go back to our gapminder data. It's currently stored in an object called gm. Remember that gm is a data.frame object if we look at its class(). Also, remember what happens if we try to display gm on the screen? It's too big to show us and it fills up our console.

class(gm)
gm

The first really nice thing about the dplyr package is the tbl_df class. A tbl_df is basically an improved version of the data.frame object. The main advantage to using a tbl_df over a regular data frame is the printing: tbl objects only print a few rows and all the columns that fit on one screen, describing the rest of it as text. We can turn our gm data frame into a tbl_df using the tbl_df() function. Let's do that, and reassign the result back to gm. Now, if we take a look at gm's class, we'll see that it's still a data frame, but it's also now a tbl_df. If we now type the name of the object, it will by default only print out a few lines. If this was a "wide" dataset with many columns, it would also not try to show us everything.

library(dplyr)
gm <- tbl_df(gm)
class(gm)
gm

You don't have to turn all your data.frame objects into tbl_df objects, but it does make working with large datasets a bit easier.

EXERCISE 1

Load the malebmi.csv data. This is male BMI data from 1980 through 2008 from the gapminder project. If you downloaded the zip file from the repository, it's located in workshops/lessons/r/data. Alternatively you can donload it directly from http://bioconnector.org/data/.

Load this into a data frame object called bmi.
Turn it into a tbl_df.
How many rows and columns does it have?

Don't remove it -- we're going to use it later on.

dplyr verbs

The dplyr package gives you a handful of useful verbs for managing data. On their own they don't do anything that base R can't do.

filter
select
mutate
arrange
summarize
group_by
more later...

They all take a data.frame or tbl_df as their input for the first argument.

filter()

First, this is from the introduction lecture, if you want to filter rows of the data where some condition is true, use the filter() function. The first argument is the data frame or tbl, and the second argument is the condition you want to satisfy.

# Show only stats for the year 1982
filter(gm, year==1982)

# Show only stats for the US
filter(gm, country=="United States")

# Show only stats for American contries in 1997
filter(gm, continent=="Americas" & year==1997)

# Show only stats for countries with per-capita GDP of less than 300 OR a 
# life expectancy of less than 30. What happened those years? 
filter(gm, gdpPercap<300 | lifeExp<30)

select()

The filter() function allows you to return only certain rows matching a condition. The select() function lets you subset the data and restrict to a number of columns. The first argument is the data, and subsequent arguments are the columns you want. Let's just get the year and the population variables.

select(gm, year, pop)

mutate()

The mutate() function adds new columns to the data. Remember, the variable in our dataset is GDP per capita, which is the total GDP divided by the population size for that country, for that year. Let's mutate this dataset and add a column called gdp:

mutate(gm, gdp=pop*gdpPercap)

Mutate has a nice little feature too in that it's "lazy." You can mutate and add one variable, then continue mutating to add more variables based on that variable. Let's make another column that's GDP in billions.

mutate(gm, gdp=pop*gdpPercap, gdpBil=gdp/1e9)

arrange()

The arrange() function does what it sounds like. It takes a data frame or tbl and arranges (or sorts) by column(s) of interest. The first argument is the data, and subsequent arguments are columns to sort on. Use the desc() function to arrange by descending.

arrange(gm, lifeExp)
arrange(gm, year, desc(lifeExp))

summarize()

The summarize() function summarizes multiple values to a single value. On its own the summarize() function doesn't seem to be all that useful.

summarize(gm, mean(pop))
summarize(gm, meanpop=mean(pop))
summarize(gm, n())
summarize(gm, n_distinct(country))

group_by()

We saw that summarize() isn't that useful on its own. Neither is group_by() All this does is takes an existing tbl and coverts it into a grouped tbl where operations are performed by group.

gm
group_by(gm, continent)
class(group_by(gm, continent))

The real power comes in where group_by() and summarize() are used together. Let's take the same grouped tbl from last time, and pass all that as an input to summarize, where we get the mean population size. We can also group by more than one variable.

summarize(group_by(gm, continent), mean(pop))

group_by(gm, continent, year)
summarize(group_by(gm, continent, year), mean(pop))

The almighty pipe: %>%

This is where things get awesome. The dplyr package imports functionality from the magrittr package that lets you pipe the output of one function to the input of another, so you can avoid nesting functions. It looks like this: %>%. You don't have to load the magrittr package to use it since dplyr imports its functionality when you load the dplyr package.

Here's the simplest way to use it. Think of the head() function. It expects a data frame as input, and the next argument is the number of lines to print. These two commands are identical:

head(gm, 5)

## Source: local data frame [5 x 6]
## 
##       country continent year lifeExp      pop gdpPercap
## 1 Afghanistan      Asia 1952    28.8  8425333       779
## 2 Afghanistan      Asia 1957    30.3  9240934       821
## 3 Afghanistan      Asia 1962    32.0 10267083       853
## 4 Afghanistan      Asia 1967    34.0 11537966       836
## 5 Afghanistan      Asia 1972    36.1 13079460       740

gm %>% head(5)

## Source: local data frame [5 x 6]
## 
##       country continent year lifeExp      pop gdpPercap
## 1 Afghanistan      Asia 1952    28.8  8425333       779
## 2 Afghanistan      Asia 1957    30.3  9240934       821
## 3 Afghanistan      Asia 1962    32.0 10267083       853
## 4 Afghanistan      Asia 1967    34.0 11537966       836
## 5 Afghanistan      Asia 1972    36.1 13079460       740

So what?

Now, think about this for a minute. What if we wanted to get the life expectancy and GDP averaged across all Asian countries for each year? (See top of image). Mentally we would do something like this:

Take the gm dataset
mutate() it to add raw GDP
filter() it to restrict to Asian countries only
group_by() year
and summarize() it to get the mean life expectancy and mean GDP.

But in code, it gets ugly. First, mutate the data to add GDP.

mutate(gm, gdp=gdpPercap*pop)

Wrap that whole command with filter().

filter(mutate(gm, gdp=gdpPercap*pop), continent=="Asia")

Wrap that again with group_by():

group_by(filter(mutate(gm, gdp=gdpPercap*pop), continent=="Asia"), year)

Finally, wrap everything with summarize():

summarize(
  group_by(
    filter(
      mutate(gm, gdp=gdpPercap*pop), 
    continent=="Asia"), 
  year), 
mean(lifeExp), mean(gdp))

Now compare that with the mental process of what you're actually trying to accomplish. The way you would do this without pipes is completely inside-out and backwards from the way you express in words and in thought what you want to do. The pipe operator %>% allows you to pass data frame or tbl objects from one function to another, so long as those functions expect data frames or tables as input.

This is how we would do that in code. It's as simple as replacing the word "then" in words to the symbol %>% in code.

gm %>%
  mutate(gdp=gdpPercap*pop) %>%
  filter(continent=="Asia") %>%
  group_by(year) %>%
  summarize(mean(lifeExp), mean(gdp))

EXERCISE 2

Here's a warm-up round. Try the following.

What was the population of Peru in 1992? Show only the population variable. Hint: 2 pipes; use filter and select.

## Source: local data frame [1 x 1]
## 
##        pop
## 1 22430449

Which countries and which years had the worst five GDP per capita measurements? Hint: 2 pipes; use arrange and head.

## Source: local data frame [5 x 6]
## 
##            country continent year lifeExp      pop gdpPercap
## 1 Congo, Dem. Rep.    Africa 2002    45.0 55379852       241
## 2 Congo, Dem. Rep.    Africa 2007    46.5 64606759       278
## 3          Lesotho    Africa 1952    42.1   748747       299
## 4    Guinea-Bissau    Africa 1952    32.5   580653       300
## 5 Congo, Dem. Rep.    Africa 1997    42.6 47798986       312

What was the average life expectancy across all contries for each year in the dataset? Hint: 2 pipes; group by and summarize.

## Source: local data frame [12 x 2]
## 
##    year mean(lifeExp)
## 1  1952          49.1
## 2  1957          51.5
## 3  1962          53.6
## 4  1967          55.7
## 5  1972          57.6
## 6  1977          59.6
## 7  1982          61.5
## 8  1987          63.2
## 9  1992          64.2
## 10 1997          65.0
## 11 2002          65.7
## 12 2007          67.0

That was easy, right? How about some tougher ones.

EXERCISE 3

Which five Asian countries had the highest life expectancy in 2007? Hint: 3 pipes.

## Source: local data frame [5 x 6]
## 
##            country continent year lifeExp       pop gdpPercap
## 1            Japan      Asia 2007    82.6 127467972     31656
## 2 Hong Kong, China      Asia 2007    82.2   6980412     39725
## 3           Israel      Asia 2007    80.7   6426679     25523
## 4        Singapore      Asia 2007    80.0   4553009     47143
## 5      Korea, Rep.      Asia 2007    78.6  49044790     23348

How many countries are on each continent? Hint: 2 pipes.

## Source: local data frame [5 x 2]
## 
##   continent n_distinct(country)
## 1    Africa                  52
## 2  Americas                  25
## 3      Asia                  33
## 4    Europe                  30
## 5   Oceania                   2

Separately for each year, compute the correlation coefficients (e.g., cor(x,y)) for life expectancy (y) against both log₁₀ of the population size and log₁₀ of the per capita GDP. What do these trends mean? Hint: 2 pipes.

## Source: local data frame [12 x 3]
## 
##    year cor(log10(pop), lifeExp) cor(log10(gdpPercap), lifeExp)
## 1  1952                   0.1543                          0.748
## 2  1957                   0.1584                          0.759
## 3  1962                   0.1376                          0.771
## 4  1967                   0.1482                          0.773
## 5  1972                   0.1322                          0.789
## 6  1977                   0.1142                          0.814
## 7  1982                   0.0944                          0.846
## 8  1987                   0.0732                          0.874
## 9  1992                   0.0593                          0.856
## 10 1997                   0.0636                          0.864
## 11 2002                   0.0746                          0.825
## 12 2007                   0.0653                          0.809

Compute the average GDP (not per-capita) in billions averaged across all contries separately for each continent separately for each year. What continents/years had the top 5 overall GDP? Hint: 6 pipes. If you want to arrange a dataset by a value computed on grouped data, you first have to pass that resulting dataset to a funcion called ungroup() before continuing to operate.

## Source: local data frame [5 x 3]
## 
##   continent year meangdp
## 1  Americas 2007     777
## 2  Americas 2002     661
## 3      Asia 2007     628
## 4  Americas 1997     583
## 5    Europe 2007     493

Tidy data

So far we've dealt exclusively with tidy data - data that's easy to work with, manipulate, and visualize. That's because our gapminder dataset has two key properties:

Each column is a variable.
Each row is an observation.

You can read a lot more about tidy data in this paper. Let's load some untidy data and see if we can see the difference. This is some made-up data for five different patients (Jon, Ann, Bill, Kate, and Joe) given three different drugs (A, B, and C), and measuring their heart rate. If you didn't download the entire lesson repository from GitHub, you can download the heartrate.csv file directly from http://bioconnector.org/data/.

hr <- read.csv("data/heartrate.csv")
hr

Notice how with the gapminder data each variable, country, continent, year, lifeExp, pop, and gdpPercap, were each in their own column. In the heart rate data, we have three variables: name, drug, and heart rate. Name is in a column, But drug is in the top row, and heart rate is scattered around the entire table. If we wanted to do things like filter the dataset where drug=="a" or heartrate>=80 we couldn't do it because these variables aren't in columns.

The tidyr package helps with this. There are other functions in the tidyr package but one particularly handy one is the gather() function, which takes multiple columns, and gathers them into key-value pairs: it makes "wide" data longer. We can get a little help with ?gather.

library(tidyr)
hr %>% gather(key=drug, value=heartreate, a,b,c)
hr %>% gather(key=drug, value=heartreate, -name)

If we create a new data frame that's a tidy version of hr, we can do those kinds of manipulations we talked about before

# Create a new data.frame
hrtidy <- hr %>%
  gather(key=drug, value=heartrate, -name)
hrtidy

# filter
hrtidy %>% filter(drug=="a")
hrtidy %>% filter(heartrate>=80)

# analyze
hrtidy %>%
  group_by(drug) %>%
  summarize(mean(heartrate))

Two-table verbs

It's rare that a data analysis involves only a single table of data. You normally have many tables that contribute to an analysis, and you need flexible tools to combine them. The dplyr package has several tools that let you work with multiple tables at once. The most straightforward one is the inner_join.

First, let's read in some BMI data from the gapminder project. This is the average body mass index for men only for the years 1980-2008. Now, if we look at the data we'll immediately see that it's in wide format: we've got three variables (Country, year, and BMI), but just like the heart rate data we looked at before, the year variable is stuck in the column header while all the BMI variables are spread out throughout the table.

country	1980	1981	1982	1983	1984
Afghanistan	21.5	21.5	21.5	21.4	21.4
Albania	25.2	25.2	25.3	25.3	25.3
Algeria	22.3	22.3	22.4	22.5	22.6
Andorra	25.7	25.7	25.7	25.8	25.8
Angola	20.9	20.9	20.9	20.9	20.9

There's also another problem here. Let's read in the data to see what happens. Let's make it a tbl_df while we're at it.

bmi <- read.csv("data/malebmi.csv") %>% tbl_df()
bmi

Look at that. R decides to put an "X" in front of each year name because you can't have column names or variables in R that start with a number. You can see where we're going with this though. We want to "gather" this data into a long format, and join it together with our original gapminder data by year and by country, but to do this, we'll first need to have this in a tidy format, and the year variable can't have those X's in front.

There's a function called gsub() that takes three arguments: a pattern to replace, what you want to replace it with, and a character vector. For example:

tmp <- c("id1", "id2", "id3")
gsub("id", "", tmp)

Notice that what we're left with is a character vector instead of numbers. The last thing we'll have to do is pass that whole thing through as.numeric() to coerce it back to a numeric vector:

gsub("id", "", tmp) %>% as.numeric

So let's tidy up our BMI data.

bmi
bmitidy <- bmi %>%
  gather(key=year, value=bmi, -country) %>%
  mutate(year=gsub("X", "", year) %>% as.numeric)
bmitidy

Now, look up some help for ?inner_join. There are many different kinds of two-table joins in dplyr. Inner join will return a table with all rows from the first table where there are matching rows in the second table, and returns all columns from both tables.

gmbmi <- inner_join(gm, bmitidy, by=c("country", "year"))
gmbmi

Now we can do all sorts of things that we were doing before, but now with BMI as well.

library(ggplot2)
gmbmi %>%
  group_by(continent, year) %>%
  summarize(bmi=mean(bmi)) %>%
  ggplot(aes(year, bmi, colour=continent)) + geom_line()