(Refer back to the Advanced Data Manipulation lesson).
- dplyr verbs
- the pipe
%>%
- variable creation
- multiple conditions
- properties of grouped data
- aggregation
- summary functions
- window functions
We’re going to work with a different dataset for the homework here. It’s a cleaned-up excerpt from the Gapminder data. Download the gapminder.csv data by clicking here or using the link above. Download it, and save it in a data/
subfolder of the project directory where you can access it easily from R.
Load the dplyr and readr packages, and read the gapminder data into R using the read_csv()
function (n.b. read_csv()
is not the same as read.csv()
). Assign the data to an object called gm
. Run gm
to display it.
Note, the code is available by hitting the “Code” button above each expected output, but try not to use it unless you’re stuck.
# Load required libraries
library(dplyr)
library(readr)
# Read the data
gm <- read_csv("data/gapminder.csv")
# Take a look
gm
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
## 7 Afghanistan Asia 1982 39.854 12881816 978.0114
## 8 Afghanistan Asia 1987 40.822 13867957 852.3959
## 9 Afghanistan Asia 1992 41.674 16317921 649.3414
## 10 Afghanistan Asia 1997 41.763 22227415 635.3414
## # ... with 1,694 more rows
Use dplyr functions to address the following questions:
group_by
then summarize with a call to n_distinct(...)
).gm %>%
group_by(continent) %>%
summarize(n=n_distinct(country))
## # A tibble: 5 x 2
## continent n
## <chr> <int>
## 1 Africa 52
## 2 Americas 25
## 3 Asia 33
## 4 Europe 30
## 5 Oceania 2
head(n=1)
)gm %>%
filter(continent == "Europe" & year == 1997) %>%
arrange(gdpPercap) %>%
head(1)
## # A tibble: 1 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 Albania Europe 1997 72.95 3428038 3193.055
gm %>%
filter(year == 1982 | year == 1987) %>%
group_by(continent) %>%
summarize(mean_lifeExp = mean(lifeExp))
## # A tibble: 5 x 2
## continent mean_lifeExp
## <chr> <dbl>
## 1 Africa 52.46883
## 2 Americas 67.15978
## 3 Asia 63.73456
## 4 Europe 73.22428
## 5 Oceania 74.80500
head(n=5)
)gm %>%
mutate(gdp = gdpPercap*pop) %>%
group_by(country) %>%
summarise(Total.GDP = sum(gdp)) %>%
arrange(desc(Total.GDP)) %>%
head(5)
## # A tibble: 5 x 2
## country Total.GDP
## <chr> <dbl>
## 1 United States 7.676192e+13
## 2 Japan 2.543482e+13
## 3 China 2.039549e+13
## 4 Germany 1.949689e+13
## 5 United Kingdom 1.328937e+13
gm %>%
filter(lifeExp >= 80) %>%
select(country, lifeExp, year)
## # A tibble: 22 x 3
## country lifeExp year
## <chr> <dbl> <int>
## 1 Australia 80.370 2002
## 2 Australia 81.235 2007
## 3 Canada 80.653 2007
## 4 France 80.657 2007
## 5 Hong Kong, China 80.000 1997
## 6 Hong Kong, China 81.495 2002
## 7 Hong Kong, China 82.208 2007
## 8 Iceland 80.500 2002
## 9 Iceland 81.757 2007
## 10 Israel 80.745 2007
## # ... with 12 more rows
gm %>%
group_by(country) %>%
summarise(cor = cor(lifeExp, gdpPercap)) %>%
arrange(desc(abs(cor))) %>%
head(10)
## # A tibble: 10 x 2
## country cor
## <chr> <dbl>
## 1 France 0.9962239
## 2 Austria 0.9929642
## 3 Belgium 0.9927496
## 4 Norway 0.9921416
## 5 Oman 0.9907526
## 6 United Kingdom 0.9898930
## 7 Italy 0.9897600
## 8 Israel 0.9884894
## 9 Denmark 0.9870896
## 10 Australia 0.9864457
!=
Asia, group by two variables, summarize, then arrange.)gm %>%
filter(continent != "Asia") %>%
group_by(continent, year) %>%
summarise(mean.pop = mean(pop)) %>%
arrange(desc(mean.pop))
## # A tibble: 48 x 3
## # Groups: continent [4]
## continent year mean.pop
## <chr> <int> <dbl>
## 1 Americas 2007 35954847
## 2 Americas 2002 33990910
## 3 Americas 1997 31876016
## 4 Americas 1992 29570964
## 5 Americas 1987 27310159
## 6 Americas 1982 25211637
## 7 Americas 1977 23122708
## 8 Americas 1972 21175368
## 9 Europe 2007 19536618
## 10 Europe 2002 19274129
## # ... with 38 more rows
gm %>%
group_by(country) %>%
summarize(sd.pop = sd(pop)) %>%
arrange(sd.pop) %>%
head(3)
## # A tibble: 3 x 2
## country sd.pop
## <chr> <dbl>
## 1 Sao Tome and Principe 45906.14
## 2 Iceland 48541.68
## 3 Montenegro 99737.94
gm %>%
arrange(country, year) %>%
group_by(country) %>%
filter(pop < lag(pop) & lifeExp > lag(lifeExp))
## # A tibble: 36 x 6
## # Groups: country [22]
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1982 39.854 12881816 978.0114
## 2 Bosnia and Herzegovina Europe 1992 72.178 4256013 2546.7814
## 3 Bosnia and Herzegovina Europe 1997 73.244 3607000 4766.3559
## 4 Bulgaria Europe 2002 72.140 7661799 7696.7777
## 5 Bulgaria Europe 2007 73.005 7322858 10680.7928
## 6 Croatia Europe 1997 73.680 4444595 9875.6045
## 7 Czech Republic Europe 1997 74.010 10300707 16048.5142
## 8 Czech Republic Europe 2002 75.510 10256295 17596.2102
## 9 Czech Republic Europe 2007 76.486 10228744 22833.3085
## 10 Equatorial Guinea Africa 1977 42.024 192675 958.5668
## # ... with 26 more rows