12 Iteration with for loops and map functions

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.0     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

gapminder <- read.csv("data/gapminder_data.csv")

12.1 “For” loops for repeating operations

If you want to iterate over a set of values, when the order of iteration is important, and perform the same operation on each, one way to do this is using a for() loop.

In general, the advice of many R users would be to learn about for() loops, but to avoid using for() loops unless the order of iteration is important: i.e. the calculation at each iteration depends on the results of previous iterations.

The basic structure of a for() loop is:

for (iterator in set of values) {
  do a thing
}

For example:

for (i in 1:10) {
  print(i)
}

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

To save the output of a computation in a vector, you need to first create an empty vector (e.g., x <- c()) and sequentially fill the values of this vector in each iteration of the for loop.

result <- c()
for (i in 1:10) {
  result[i] <- (1 + i)^{10}
}
result

 [1]        1024       59049     1048576     9765625    60466176   282475249
 [7]  1073741824  3486784401 10000000000 25937424601

For loops are common in programming in general, but for loops are rarely used in R, primarily due to their computational inefficiency.

Instead, a much more efficient method for iterating in R is using the map() functions from the purrr R package. To load the purrr R package, you need to run the following code (if the purrr package is not installed, you will need to run the commented install.packages() line)

# install.packages("purrr")
library(purrr)

The first argument from the map() function is the object whose elements we want to iterate over. The second argument is the function that we want to apply at each iteration.

The output of the map() function is always a list.

For example, the following code will apply the exp() function to each element in the vector 1:10 and return the results in a list:

map(1:10, exp)

[[1]]
[1] 2.718282

[[2]]
[1] 7.389056

[[3]]
[1] 20.08554

[[4]]
[1] 54.59815

[[5]]
[1] 148.4132

[[6]]
[1] 403.4288

[[7]]
[1] 1096.633

[[8]]
[1] 2980.958

[[9]]
[1] 8103.084

[[10]]
[1] 22026.47

While the list output format offers maximal flexibility, we typically want to create a vector or a data frame. This can be done using alternative versions of the map() function, such as map_dbl(), which specifies the type of your output in its name.

For instance, if you want your output to be a numeric “double” vector, you can use map_dbl():

map_dbl(1:10, exp)

 [1]     2.718282     7.389056    20.085537    54.598150   148.413159
 [6]   403.428793  1096.633158  2980.957987  8103.083928 22026.465795

and if you want it to be a character vector, you can use map_chr():

map_chr(gapminder, class)

    country        year         pop   continent     lifeExp   gdpPercap 
"character"   "integer"   "numeric" "character"   "numeric"   "numeric"

Here, recall that the gapminder data frame is a list, and the map_ function is iterating over the elements of the list, which in this case, is the columns.

Note that the output of the function you are applying must match the map_ function that you use, else you will get an error:

map_dbl(1:10, class)

Error in `map_dbl()`:
ℹ In index: 1.
Caused by error:
! Can't coerce from a string to a double vector.

The true power of the map functions really comes once you learn how to write your own functions.

For example, we could conduct the following transformation to each entry in 1:10:

map_dbl(1:10, function(x) (x + 1)^10)

 [1]        1024       59049     1048576     9765625    60466176   282475249
 [7]  1073741824  3486784401 10000000000 25937424601

An even more compact way of writing a function with one argument is using the ~. anonymous function shorthand:

map_dbl(1:10, ~(. + 1)^10)

 [1]        1024       59049     1048576     9765625    60466176   282475249
 [7]  1073741824  3486784401 10000000000 25937424601

This shorthand involves:

Replacing function(x) with ~
Replacing the argument in the body of the function with .

The map_df() function will return a data frame (but requires that the function being applied outputs a data frame).

As an example, the following code takes each entry in the vector c(1, 4, 7), and adds 10 to it, and returns a two-column data frame containing the old number and new number:

map_df(c(1, 4, 7), function(.x) {
  return(data.frame(old_number = .x, 
                    new_number = .x + 10))
})

  old_number new_number
1          1         11
2          4         14
3          7         17

To help see how this works, set .x to be one of the values in the vector and run the body of the code:

.x <- 1
data.frame(old_number = .x, 
           new_number = .x + 10)

  old_number new_number
1          1         11

As another example, the following code takes the gapminder dataset selects the pop, gdpPercap, and lifeExp columns, and then computes a data frame for each column/variable containing the mean and sd.

gapminder |>
  select(pop, gdpPercap, lifeExp) |>
  map_df(function(.x) data.frame(mean = mean(.x),
                                 sd = sd(.x)),
         .id = "variable")

   variable         mean           sd
1       pop 2.960121e+07 1.061579e+08
2 gdpPercap 7.215327e+03 9.857455e+03
3   lifeExp 5.947444e+01 1.291711e+01

To figure out what is happening here, map_df() is computing the following for each variable:

.x <- gapminder$pop
data.frame(mean = mean(.x), 
           sd = sd(.x))

      mean        sd
1 29601212 106157897

and is then pasting the results for each variable together into a single data frame.

Challenge 1

For each column in the gapminder dataset, compute the number of unique entries using the n_distinct() function. Make sure the output of your code is a numeric vector.

Do this in two different ways: using a for loop and a map_dbl() function.

Hint: n_distinct() is a dplyr function which counts the number of unique/distinct values in a vector. Try n_distinct(c(1, 1, 4, 4, 4, 4, 1, 3)) as an example of its usage

Solution to Challenge 1

unique_gapminder <- c()
for (i in 1:ncol(gapminder)) {
  unique_gapminder[i] <- n_distinct(gapminder[, i])
}
unique_gapminder

[1]  142   12 1704    5 1626 1704

map_dbl(gapminder, ~n_distinct(.))

  country      year       pop continent   lifeExp gdpPercap 
      142        12      1704         5      1626      1704

Challenge 2

Use map_df() to compute the number of distinct values and the class of each variable in the gapminder dataset and store them in a data frame.

The output of your code should look like this:

   variable n_distinct     class
1   country        142 character
2      year         12   integer
3       pop       1704   numeric
4 continent          5 character
5   lifeExp       1626   numeric
6 gdpPercap       1704   numeric

Hint: the argument .id = "variable" variable of map_df() can be used to add the variable column automatically based on the gapminder column names.

Solution to Challenge 2

gapminder %>% map_df(function(.x) {
  data.frame(n_distinct = n_distinct(.x),
             class = class(.x))
  }, .id = "variable")

   variable n_distinct     class
1   country        142 character
2      year         12   integer
3       pop       1704   numeric
4 continent          5 character
5   lifeExp       1626   numeric
6 gdpPercap       1704   numeric