library(tidyverse)
13 Reshaping data frames with tidyr
Researchers often want to reshape their data frames from ‘wide’ to ‘longer’ layouts, or vice-versa.
In the purely ‘long’ (or ‘longest’) format, you usually have 1 column for the observed variable and the other columns are ID variables.
For the ‘wide’ format each row is often a site/subject/patient and you have multiple observation variables containing the same type of data. These can be either repeated observations over time or observations of multiple variables (or a mix of both). You may find data input may be simpler or some other applications may prefer the ‘wide’ format.
However, many of R
‘s functions have been designed assuming you have a ’tidy’ formatted dataset, in which
Each row contains all the data for a single observational unit
Each column is a distinct type of measurement
The “wide”-labeled format in the example below is actually tidy:
13.1 Getting started
The tools we will be using in this lesson come from the “tidyr” package, which is part of the tidyverse suite:
Load the tidyverse package
First, let’s look at the structure of our original gapminder data frame:
head(gapminder)
country year pop continent lifeExp gdpPercap
1 Afghanistan 1952 8425333 Asia 28.801 779.4453
2 Afghanistan 1957 9240934 Asia 30.332 820.8530
3 Afghanistan 1962 10267083 Asia 31.997 853.1007
4 Afghanistan 1967 11537966 Asia 34.020 836.1971
5 Afghanistan 1972 13079460 Asia 36.088 739.9811
6 Afghanistan 1977 14880372 Asia 38.438 786.1134
A long format of gapminder would instead look like this:
country year continent variable value
1 Afghanistan 1952 Asia pop 8.425333e+06
2 Afghanistan 1952 Asia lifeExp 2.880100e+01
3 Afghanistan 1952 Asia gdpPercap 7.794453e+02
4 Afghanistan 1957 Asia pop 9.240934e+06
5 Afghanistan 1957 Asia lifeExp 3.033200e+01
6 Afghanistan 1957 Asia gdpPercap 8.208530e+02
7 Afghanistan 1962 Asia pop 1.026708e+07
8 Afghanistan 1962 Asia lifeExp 3.199700e+01
9 Afghanistan 1962 Asia gdpPercap 8.531007e+02
10 Afghanistan 1967 Asia pop 1.153797e+07
Can you identify the difference?
Note: Some ggplots are easier to create using long-format data, but generally you want your data to be in a tidy format.
13.2 Pivoting longer with pivot_longer()
To convert the original gapminder data to the longer format, we will use the pivot_longer()
function.
pivot_longer()
makes datasets longer by increasing the number of rows and decreasing the number of columns, or ‘lengthening’ your observation variables into a single variable.
<- gapminder |>
gapminder_long pivot_longer(cols = c(pop, lifeExp, gdpPercap),
names_to = "variable",
values_to = "value")
gapminder_long
# A tibble: 5,112 × 5
country year continent variable value
<chr> <int> <chr> <chr> <dbl>
1 Afghanistan 1952 Asia pop 8425333
2 Afghanistan 1952 Asia lifeExp 28.8
3 Afghanistan 1952 Asia gdpPercap 779.
4 Afghanistan 1957 Asia pop 9240934
5 Afghanistan 1957 Asia lifeExp 30.3
6 Afghanistan 1957 Asia gdpPercap 821.
7 Afghanistan 1962 Asia pop 10267083
8 Afghanistan 1962 Asia lifeExp 32.0
9 Afghanistan 1962 Asia gdpPercap 853.
10 Afghanistan 1967 Asia pop 11537966
# … with 5,102 more rows
We first provide to pivot_longer()
a vector of column names that will be used to pivot into a longer format.
Note: pivot_longer()
also allows the alternative syntax of using the -
symbol to identify which variables are not to be pivoted (i.e. ID variables).
The next arguments to pivot_longer()
are names_to
for naming the column that will contain the variable names (variable
) and values_to
for naming the column that will contain the values of each variable. We supply these new column names as strings.
Typically, long-format is useful when you want to plot multiple variables on the same plot, e.g., a line plot of each of the three variables (pop, lifeExp, and gdpPercap) over time.
|>
gapminder_long # filter to just the data for the US
filter(country == "United States") |>
# scale each variable's value to a common scale between 0 (min) and 1
group_by(variable) |>
mutate(value_scaled = value / sd(value)) |>
ungroup() |>
# plot a line plot of each variable over time
ggplot() +
geom_line(aes(x = year, y = value_scaled, color = variable))
13.3 Pivoting wider with pivot_wider()
Let’s convert our long format data back to the (wider) tidy format using pivot_wider()
.
pivot_wider()
is the opposite of pivot_longer()
, making a dataset wider by increasing the number of columns and decreasing the number of rows.
The pivot_wider()
function takes names_from
and values_from
arguments.
To names_from
we supply the column name whose contents will be pivoted into new output columns in the widened data frame. The corresponding values will be added from the column named in the values_from
argument.
<- gapminder_long |>
gapminder_wider pivot_wider(names_from = variable, values_from = value)
head(gapminder_wider)
# A tibble: 6 × 6
country year continent pop lifeExp gdpPercap
<chr> <int> <chr> <dbl> <dbl> <dbl>
1 Afghanistan 1952 Asia 8425333 28.8 779.
2 Afghanistan 1957 Asia 9240934 30.3 821.
3 Afghanistan 1962 Asia 10267083 32.0 853.
4 Afghanistan 1967 Asia 11537966 34.0 836.
5 Afghanistan 1972 Asia 13079460 36.1 740.
6 Afghanistan 1977 Asia 14880372 38.4 786.
dim(gapminder_wider)
[1] 1704 6
Which is the same as our original gapminder data (but with some of the columns in a different order)
head(gapminder)
country year pop continent lifeExp gdpPercap
1 Afghanistan 1952 8425333 Asia 28.801 779.4453
2 Afghanistan 1957 9240934 Asia 30.332 820.8530
3 Afghanistan 1962 10267083 Asia 31.997 853.1007
4 Afghanistan 1967 11537966 Asia 34.020 836.1971
5 Afghanistan 1972 13079460 Asia 36.088 739.9811
6 Afghanistan 1977 14880372 Asia 38.438 786.1134
dim(gapminder)
[1] 1704 6