(Refer back to the Advanced Data Manipulation lesson).

Key Concepts

  • dplyr verbs
  • the pipe %>%
  • variable creation
  • multiple conditions
  • properties of grouped data
  • aggregation
  • summary functions
  • window functions

Getting Started

We’re going to work with a different dataset for the homework here. It’s a cleaned-up excerpt from the Gapminder data. Download the gapminder.csv data by clicking here or using the link above. Download it, and save it in a data/ subfolder of the project directory where you can access it easily from R.

Load the dplyr and readr packages, and read the gapminder data into R using the read_csv() function (n.b. read_csv() is not the same as read.csv()). Assign the data to an object called gm. Run gm to display it.

Note, the code is available by hitting the “Code” button above each expected output, but try not to use it unless you’re stuck.

# Load required libraries
library(dplyr)
library(readr)

# Read the data
gm <- read_csv("data/gapminder.csv")

# Take a look
gm
## # A tibble: 1,704 x 6
##        country continent  year lifeExp      pop gdpPercap
##          <chr>     <chr> <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan      Asia  1952  28.801  8425333  779.4453
##  2 Afghanistan      Asia  1957  30.332  9240934  820.8530
##  3 Afghanistan      Asia  1962  31.997 10267083  853.1007
##  4 Afghanistan      Asia  1967  34.020 11537966  836.1971
##  5 Afghanistan      Asia  1972  36.088 13079460  739.9811
##  6 Afghanistan      Asia  1977  38.438 14880372  786.1134
##  7 Afghanistan      Asia  1982  39.854 12881816  978.0114
##  8 Afghanistan      Asia  1987  40.822 13867957  852.3959
##  9 Afghanistan      Asia  1992  41.674 16317921  649.3414
## 10 Afghanistan      Asia  1997  41.763 22227415  635.3414
## # ... with 1,694 more rows

Problem set

Use dplyr functions to address the following questions:

  1. How many unique countries are represented per continent? (Hint: group_by then summarize with a call to n_distinct(...)).
gm %>% 
  group_by(continent) %>% 
  summarize(n=n_distinct(country))
## # A tibble: 5 x 2
##   continent     n
##       <chr> <int>
## 1    Africa    52
## 2  Americas    25
## 3      Asia    33
## 4    Europe    30
## 5   Oceania     2
  1. Which European nation had the lowest GDP per capita in 1997? (Hint: filter, arrange, head(n=1))
gm %>%
    filter(continent == "Europe" & year == 1997) %>%
    arrange(gdpPercap) %>%
    head(1)
## # A tibble: 1 x 6
##   country continent  year lifeExp     pop gdpPercap
##     <chr>     <chr> <int>   <dbl>   <int>     <dbl>
## 1 Albania    Europe  1997   72.95 3428038  3193.055
  1. According to the data available, what was the average life expectancy across each continent in the 1980s? (Hint: filter, group_by, summarize)
gm %>%
    filter(year == 1982 | year == 1987) %>%
    group_by(continent) %>%
    summarize(mean_lifeExp = mean(lifeExp))
## # A tibble: 5 x 2
##   continent mean_lifeExp
##       <chr>        <dbl>
## 1    Africa     52.46883
## 2  Americas     67.15978
## 3      Asia     63.73456
## 4    Europe     73.22428
## 5   Oceania     74.80500
  1. What 5 countries have the highest total GDP over all years combined? (Hint: GDP per capita is simply GDP divided by the total population size. To get GDP back, you’d mutate to calculate GDP as the product of GDP per capita times the population size. Mutate, group_by, summarize, arrange, head(n=5))
gm %>%
    mutate(gdp = gdpPercap*pop) %>%
    group_by(country) %>%
    summarise(Total.GDP = sum(gdp)) %>%
    arrange(desc(Total.GDP)) %>%
    head(5)
## # A tibble: 5 x 2
##          country    Total.GDP
##            <chr>        <dbl>
## 1  United States 7.676192e+13
## 2          Japan 2.543482e+13
## 3          China 2.039549e+13
## 4        Germany 1.949689e+13
## 5 United Kingdom 1.328937e+13
  1. What countries and years had life expectancies of at least 80 years? N.b. only output the columns of interest: country, life expectancy and year (in that order).
gm %>%
    filter(lifeExp >= 80) %>%
    select(country, lifeExp, year)
## # A tibble: 22 x 3
##             country lifeExp  year
##               <chr>   <dbl> <int>
##  1        Australia  80.370  2002
##  2        Australia  81.235  2007
##  3           Canada  80.653  2007
##  4           France  80.657  2007
##  5 Hong Kong, China  80.000  1997
##  6 Hong Kong, China  81.495  2002
##  7 Hong Kong, China  82.208  2007
##  8          Iceland  80.500  2002
##  9          Iceland  81.757  2007
## 10           Israel  80.745  2007
## # ... with 12 more rows
  1. What 10 countries have the strongest correlation (in either direction) between life expectancy and per capita GDP?
gm %>%
    group_by(country) %>%
    summarise(cor = cor(lifeExp, gdpPercap)) %>%
    arrange(desc(abs(cor))) %>%
    head(10)
## # A tibble: 10 x 2
##           country       cor
##             <chr>     <dbl>
##  1         France 0.9962239
##  2        Austria 0.9929642
##  3        Belgium 0.9927496
##  4         Norway 0.9921416
##  5           Oman 0.9907526
##  6 United Kingdom 0.9898930
##  7          Italy 0.9897600
##  8         Israel 0.9884894
##  9        Denmark 0.9870896
## 10      Australia 0.9864457
  1. Which combinations of continent (besides Asia) and year have the highest average population across all countries? N.b. your output should include all results sorted by highest average population. (Hint: filter where continent != Asia, group by two variables, summarize, then arrange.)
gm %>%
    filter(continent != "Asia") %>%
    group_by(continent, year) %>%
    summarise(mean.pop = mean(pop)) %>%
    arrange(desc(mean.pop)) 
## # A tibble: 48 x 3
## # Groups:   continent [4]
##    continent  year mean.pop
##        <chr> <int>    <dbl>
##  1  Americas  2007 35954847
##  2  Americas  2002 33990910
##  3  Americas  1997 31876016
##  4  Americas  1992 29570964
##  5  Americas  1987 27310159
##  6  Americas  1982 25211637
##  7  Americas  1977 23122708
##  8  Americas  1972 21175368
##  9    Europe  2007 19536618
## 10    Europe  2002 19274129
## # ... with 38 more rows
  1. Which three countries have had the most consistent population estimates (i.e. lowest standard deviation) across the years of available data?
gm %>%
    group_by(country) %>%
    summarize(sd.pop = sd(pop)) %>%
    arrange(sd.pop) %>%
    head(3)
## # A tibble: 3 x 2
##                 country   sd.pop
##                   <chr>    <dbl>
## 1 Sao Tome and Principe 45906.14
## 2               Iceland 48541.68
## 3            Montenegro 99737.94
  1. Bonus! Which observations indicate that the population of a country has decreased from the previous year and the life expectancy has increased from the previous year? See the vignette on window functions.
gm %>% 
  arrange(country, year) %>% 
  group_by(country) %>% 
  filter(pop < lag(pop) & lifeExp > lag(lifeExp))
## # A tibble: 36 x 6
## # Groups:   country [22]
##                   country continent  year lifeExp      pop  gdpPercap
##                     <chr>     <chr> <int>   <dbl>    <int>      <dbl>
##  1            Afghanistan      Asia  1982  39.854 12881816   978.0114
##  2 Bosnia and Herzegovina    Europe  1992  72.178  4256013  2546.7814
##  3 Bosnia and Herzegovina    Europe  1997  73.244  3607000  4766.3559
##  4               Bulgaria    Europe  2002  72.140  7661799  7696.7777
##  5               Bulgaria    Europe  2007  73.005  7322858 10680.7928
##  6                Croatia    Europe  1997  73.680  4444595  9875.6045
##  7         Czech Republic    Europe  1997  74.010 10300707 16048.5142
##  8         Czech Republic    Europe  2002  75.510 10256295 17596.2102
##  9         Czech Republic    Europe  2007  76.486 10228744 22833.3085
## 10      Equatorial Guinea    Africa  1977  42.024   192675   958.5668
## # ... with 26 more rows