(Refer back to the Advanced Data Manipulation lesson).
- dplyr verbs
- the pipe
%>%- variable creation
- multiple conditions
- properties of grouped data
- aggregation
- summary functions
- window functions
We’re going to work with a different dataset for the homework here. It’s a cleaned-up excerpt from the Gapminder data. Download the gapminder.csv data by clicking here or using the link above. Download it, and save it in a data/ subfolder of the project directory where you can access it easily from R.
Load the dplyr and readr packages, and read the gapminder data into R using the read_csv() function (n.b. read_csv() is not the same as read.csv()). Assign the data to an object called gm. Run gm to display it.
Note, the code is available by hitting the “Code” button above each expected output, but try not to use it unless you’re stuck.
# Load required libraries
library(dplyr)
library(readr)
# Read the data
gm <- read_csv("data/gapminder.csv")
# Take a look
gm
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
## 7 Afghanistan Asia 1982 39.854 12881816 978.0114
## 8 Afghanistan Asia 1987 40.822 13867957 852.3959
## 9 Afghanistan Asia 1992 41.674 16317921 649.3414
## 10 Afghanistan Asia 1997 41.763 22227415 635.3414
## # ... with 1,694 more rows
Use dplyr functions to address the following questions:
group_by then summarize with a call to n_distinct(...)).gm %>%
group_by(continent) %>%
summarize(n=n_distinct(country))
## # A tibble: 5 x 2
## continent n
## <chr> <int>
## 1 Africa 52
## 2 Americas 25
## 3 Asia 33
## 4 Europe 30
## 5 Oceania 2
head(n=1))gm %>%
filter(continent == "Europe" & year == 1997) %>%
arrange(gdpPercap) %>%
head(1)
## # A tibble: 1 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 Albania Europe 1997 72.95 3428038 3193.055
gm %>%
filter(year == 1982 | year == 1987) %>%
group_by(continent) %>%
summarize(mean_lifeExp = mean(lifeExp))
## # A tibble: 5 x 2
## continent mean_lifeExp
## <chr> <dbl>
## 1 Africa 52.46883
## 2 Americas 67.15978
## 3 Asia 63.73456
## 4 Europe 73.22428
## 5 Oceania 74.80500
head(n=5))gm %>%
mutate(gdp = gdpPercap*pop) %>%
group_by(country) %>%
summarise(Total.GDP = sum(gdp)) %>%
arrange(desc(Total.GDP)) %>%
head(5)
## # A tibble: 5 x 2
## country Total.GDP
## <chr> <dbl>
## 1 United States 7.676192e+13
## 2 Japan 2.543482e+13
## 3 China 2.039549e+13
## 4 Germany 1.949689e+13
## 5 United Kingdom 1.328937e+13
gm %>%
filter(lifeExp >= 80) %>%
select(country, lifeExp, year)
## # A tibble: 22 x 3
## country lifeExp year
## <chr> <dbl> <int>
## 1 Australia 80.370 2002
## 2 Australia 81.235 2007
## 3 Canada 80.653 2007
## 4 France 80.657 2007
## 5 Hong Kong, China 80.000 1997
## 6 Hong Kong, China 81.495 2002
## 7 Hong Kong, China 82.208 2007
## 8 Iceland 80.500 2002
## 9 Iceland 81.757 2007
## 10 Israel 80.745 2007
## # ... with 12 more rows
gm %>%
group_by(country) %>%
summarise(cor = cor(lifeExp, gdpPercap)) %>%
arrange(desc(abs(cor))) %>%
head(10)
## # A tibble: 10 x 2
## country cor
## <chr> <dbl>
## 1 France 0.9962239
## 2 Austria 0.9929642
## 3 Belgium 0.9927496
## 4 Norway 0.9921416
## 5 Oman 0.9907526
## 6 United Kingdom 0.9898930
## 7 Italy 0.9897600
## 8 Israel 0.9884894
## 9 Denmark 0.9870896
## 10 Australia 0.9864457
!= Asia, group by two variables, summarize, then arrange.)gm %>%
filter(continent != "Asia") %>%
group_by(continent, year) %>%
summarise(mean.pop = mean(pop)) %>%
arrange(desc(mean.pop))
## # A tibble: 48 x 3
## # Groups: continent [4]
## continent year mean.pop
## <chr> <int> <dbl>
## 1 Americas 2007 35954847
## 2 Americas 2002 33990910
## 3 Americas 1997 31876016
## 4 Americas 1992 29570964
## 5 Americas 1987 27310159
## 6 Americas 1982 25211637
## 7 Americas 1977 23122708
## 8 Americas 1972 21175368
## 9 Europe 2007 19536618
## 10 Europe 2002 19274129
## # ... with 38 more rows
gm %>%
group_by(country) %>%
summarize(sd.pop = sd(pop)) %>%
arrange(sd.pop) %>%
head(3)
## # A tibble: 3 x 2
## country sd.pop
## <chr> <dbl>
## 1 Sao Tome and Principe 45906.14
## 2 Iceland 48541.68
## 3 Montenegro 99737.94
gm %>%
arrange(country, year) %>%
group_by(country) %>%
filter(pop < lag(pop) & lifeExp > lag(lifeExp))
## # A tibble: 36 x 6
## # Groups: country [22]
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1982 39.854 12881816 978.0114
## 2 Bosnia and Herzegovina Europe 1992 72.178 4256013 2546.7814
## 3 Bosnia and Herzegovina Europe 1997 73.244 3607000 4766.3559
## 4 Bulgaria Europe 2002 72.140 7661799 7696.7777
## 5 Bulgaria Europe 2007 73.005 7322858 10680.7928
## 6 Croatia Europe 1997 73.680 4444595 9875.6045
## 7 Czech Republic Europe 1997 74.010 10300707 16048.5142
## 8 Czech Republic Europe 2002 75.510 10256295 17596.2102
## 9 Czech Republic Europe 2007 76.486 10228744 22833.3085
## 10 Equatorial Guinea Africa 1977 42.024 192675 958.5668
## # ... with 26 more rows