7  Subsetting data frames

8 Data frames

Data frames represent a table of data. Recall our cats data frame:

cats <-  data.frame(coat = c("calico", "black", "tabby"),
                   weight = c(2.1, 5.0, 3.2),
                   likes_string = c(1, 0, 1))
cats
    coat weight likes_string
1 calico    2.1            1
2  black    5.0            0
3  tabby    3.2            1

Columns of a data frame are vectors of different types, each of the same length, that are organized by belonging to the same table.

In our cats example, we have a character, a numeric, and a logical column/variable. As we have seen already, each column of data.frame is a vector.

8.1 Extracting columns from a data frame

8.1.1 The dollar sign syntax $

There are several ways to extract an individual column in a data frame, including using the $ notation for extracting columns by name:

cats$coat
[1] "calico" "black"  "tabby" 

This approach returns the column as a vector.

8.1.2 The square bracket syntax with two dimensions [,]

A column can also be accessed using the square bracket notation with two dimensions:

cats[, 1]
[1] "calico" "black"  "tabby" 

The syntax df[i, j] to extract the entry in the ith row and the jth column from the data frame called df. For example, df[3, 1] will extract the single entry in the third row and first column.

A blank i or j tells R to extract all of the rows or columns, respectively. That is, df[, 1] will extract all rows for the 1st column, and df[2, ] will extract the second row across all columns.

  • df[, j] will extract the jth column from the data frame called df as a vector.

  • df[i, ] will extract the ith row from the data frame called df as a data frame.

For example, the following code extracts the data from the second column of cats as a vector

cats[, 2]
[1] 2.1 5.0 3.2

and the following code extracts the second row of cats as a data frame:

cats[2, ]
   coat weight likes_string
2 black      5            0

8.1.3 The square bracket syntax with one dimension []

Note, to extract the \(j\)th column, you can alternatively use the single-dimension square bracket syntax: df[j].

cats[2]
  weight
1    2.1
2    5.0
3    3.2

This syntax also works with named indexing.

cats["weight"]
  weight
1    2.1
2    5.0
3    3.2

This, however, returns a one-dimensional data frame, rather than a vector.

8.1.4 The double bracket syntax with one dimension [[]]

Finally, another way to extract a column from a data frame is using the double bracket syntax [[]]:

cats[[2]]
[1] 2.1 5.0 3.2

This also works with named indexing:

cats[["weight"]]
[1] 2.1 5.0 3.2

Can you see how the output of indexing with this [[]] syntax differs from the single bracket syntax []?

The double bracket syntax [[]] returns the column as a vector, whereas the single bracket syntax [] returns the column as a single-column data frame.

Challenge 1

There are several subtly different ways to extract observations and elements from data.frames:

  • cats[1]
  • cats$coat
  • cats["coat"]
  • cats[1, 1]
  • cats[, 1]
  • cats[1, ]

Try out these examples and explain what is returned by each one.

Hint: Use the function class() to examine what is returned in each case.

cats[1]
    coat
1 calico
2  black
3  tabby

The single brace [1] returns the first column of the data frame as a one-column data frame (this will make more sense once we introduce lists).

The $ notation, however, lets us extract a column by name as a vector.

cats$coat
[1] "calico" "black"  "tabby" 

As for vectors, we can also use the single brace notation with named indexing:

cats["coat"]
    coat
1 calico
2  black
3  tabby

Here we are using a single brace ["coat"] replacing the index number with the column name. Like example 1, the returned object is a data frame.

Next,

cats[1, 1]
[1] "calico"

This example uses a single brace, but this time we provide row and column coordinates. The returned object is the value in row 1, column 1. The object is a vector of type character.

What about:

cats[, 1]
[1] "calico" "black"  "tabby" 

Like the previous example, we use single braces and provide row and column coordinates. The row coordinate is not specified, R interprets this missing value as all the elements in this column and returns them as a vector.

cats[1, ]
    coat weight likes_string
1 calico    2.1            1

Again we use the single brace with row and column coordinates. The column coordinate is not specified. The return value is a list containing all the values in the first row.

Tip: Renaming data frame columns

Like vectors, data frames have column names, which can be accessed with the names() function.

names(cats)
[1] "coat"         "weight"       "likes_string"

If you want to rename the second column of cats, you can assign a new name to the second element of names(cats).

names(cats)[2] <- "weight_kg"
cats
    coat weight_kg likes_string
1 calico       2.1            1
2  black       5.0            0
3  tabby       3.2            1
Tip: select and rename

In the dplyr lesson, you will learn another method of extracting columns of a data frame and renaming them using the select() and rename() functions from the dplyr package in the tidyverse suite.

8.2 Extracting multiple columns (and rows)

The square bracket syntaxes (but not the $ syntax) options allow us to select multiple columns at once:

Can you guess what the following code will return?

cats[1:2]
    coat weight_kg
1 calico       2.1
2  black       5.0
3  tabby       3.2
cats[c(3, 1), 3]
[1] 1 1

Like for vectors, we can also subset using logical operations:

The following code will select only the rows where the cat’s weight is at least 3

cats$weight > 3
[1] FALSE  TRUE  TRUE
cats[cats$weight > 3, ]
   coat weight_kg likes_string
2 black       5.0            0
3 tabby       3.2            1

The following code will select only the rows where the cat’s coat is tabby or calico and returns just the values in the “weight” column:

cats[cats$coat %in% c("tabby", "calico"), "weight"]
NULL

8.3 Gapminder example

Let’s move away from cats to a more interesting data example: gapminder

gapminder <- read.csv("data/gapminder_data.csv")

The gapminder dataset contains information on the population, life expectancy, GDP per capita for a range of country-year combinations:

head(gapminder)
      country year      pop continent lifeExp gdpPercap
1 Afghanistan 1952  8425333      Asia  28.801  779.4453
2 Afghanistan 1957  9240934      Asia  30.332  820.8530
3 Afghanistan 1962 10267083      Asia  31.997  853.1007
4 Afghanistan 1967 11537966      Asia  34.020  836.1971
5 Afghanistan 1972 13079460      Asia  36.088  739.9811
6 Afghanistan 1977 14880372      Asia  38.438  786.1134
dim(gapminder)
[1] 1704    6
Challenge 2

Each of the following pieces of code are incorrect. Identify the error and fix each of the following common data frame subsetting errors:

  1. Extract observations collected for the year 1957
gapminder[gapminder$year = 1957, ]
  1. Extract all columns except 1 through to 4
gapminder[, -1:4]
  1. Extract the rows where the life expectancy is longer the 80 years
gapminder[gapminder$lifeExp > 80]
  1. Extract the first row, and the fourth and fifth columns (continent and lifeExp).
gapminder[1, 4, 5]
  1. Advanced: extract rows that contain information for the years 2002 and 2007
gapminder[gapminder$year == 2002 | 2007, ]

Fix each of the following common data frame subsetting errors:

  1. Extract observations collected for the year 1957
# gapminder[gapminder$year = 1957, ]
gapminder[gapminder$year == 1957, ]
  1. Extract all columns except 1 through to 4
# gapminder[, -1:4]
gapminder[, -c(1:4)]
  1. Extract the rows where the life expectancy is longer than 80 years
# gapminder[gapminder$lifeExp > 80]
gapminder[gapminder$lifeExp > 80, ]
  1. Extract the first row, and the fourth and fifth columns (continent and lifeExp).
# gapminder[1, 4, 5]
gapminder[1, c(4, 5)]
  1. Advanced: extract rows that contain information for the years 2002 and 2007
# gapminder[gapminder$year == 2002 | 2007, ]
gapminder[gapminder$year == 2002 | gapminder$year == 2007, ]
# or
gapminder[gapminder$year %in% c(2002, 2007), ]
Challenge 3

Selecting elements of a vector that match “any of” a list of components is a very common data analysis task.

Suppose we want to pull out information from Southeast Asia, which includes the countries of Myanmar, Thailand, Cambodia, Vietnam, and Laos.

Your task is to subset the data just to the rows that include the countries in Southeast Asia

  1. Define a logical vector that is TRUE for all of the countries in southeast Asia and FALSE otherwise.

  2. Use this logical vector to extract just the rows for Southeast Asia from the gapminder dataset.

se_asia = gapminder$country %in% c("Myanmar", "Thailand", "Cambodia", "Vietnam", "Laos")
gapminder[se_asia, ]
      country year      pop continent lifeExp gdpPercap
217  Cambodia 1952  4693836      Asia  39.417  368.4693
218  Cambodia 1957  5322536      Asia  41.366  434.0383
219  Cambodia 1962  6083619      Asia  43.415  496.9136
220  Cambodia 1967  6960067      Asia  45.415  523.4323
221  Cambodia 1972  7450606      Asia  40.317  421.6240
222  Cambodia 1977  6978607      Asia  31.220  524.9722
223  Cambodia 1982  7272485      Asia  50.957  624.4755
224  Cambodia 1987  8371791      Asia  53.914  683.8956
225  Cambodia 1992 10150094      Asia  55.803  682.3032
226  Cambodia 1997 11782962      Asia  56.534  734.2852
227  Cambodia 2002 12926707      Asia  56.752  896.2260
228  Cambodia 2007 14131858      Asia  59.723 1713.7787
1045  Myanmar 1952 20092996      Asia  36.319  331.0000
1046  Myanmar 1957 21731844      Asia  41.905  350.0000
1047  Myanmar 1962 23634436      Asia  45.108  388.0000
1048  Myanmar 1967 25870271      Asia  49.379  349.0000
1049  Myanmar 1972 28466390      Asia  53.070  357.0000
1050  Myanmar 1977 31528087      Asia  56.059  371.0000
1051  Myanmar 1982 34680442      Asia  58.056  424.0000
1052  Myanmar 1987 38028578      Asia  58.339  385.0000
1053  Myanmar 1992 40546538      Asia  59.320  347.0000
1054  Myanmar 1997 43247867      Asia  60.328  415.0000
1055  Myanmar 2002 45598081      Asia  59.908  611.0000
1056  Myanmar 2007 47761980      Asia  62.069  944.0000
1525 Thailand 1952 21289402      Asia  50.848  757.7974
1526 Thailand 1957 25041917      Asia  53.630  793.5774
1527 Thailand 1962 29263397      Asia  56.061 1002.1992
1528 Thailand 1967 34024249      Asia  58.285 1295.4607
1529 Thailand 1972 39276153      Asia  60.405 1524.3589
1530 Thailand 1977 44148285      Asia  62.494 1961.2246
1531 Thailand 1982 48827160      Asia  64.597 2393.2198
1532 Thailand 1987 52910342      Asia  66.084 2982.6538
1533 Thailand 1992 56667095      Asia  67.298 4616.8965
1534 Thailand 1997 60216677      Asia  67.521 5852.6255
1535 Thailand 2002 62806748      Asia  68.564 5913.1875
1536 Thailand 2007 65068149      Asia  70.616 7458.3963
1645  Vietnam 1952 26246839      Asia  40.412  605.0665
1646  Vietnam 1957 28998543      Asia  42.887  676.2854
1647  Vietnam 1962 33796140      Asia  45.363  772.0492
1648  Vietnam 1967 39463910      Asia  47.838  637.1233
1649  Vietnam 1972 44655014      Asia  50.254  699.5016
1650  Vietnam 1977 50533506      Asia  55.764  713.5371
1651  Vietnam 1982 56142181      Asia  58.816  707.2358
1652  Vietnam 1987 62826491      Asia  62.820  820.7994
1653  Vietnam 1992 69940728      Asia  67.662  989.0231
1654  Vietnam 1997 76048996      Asia  70.672 1385.8968
1655  Vietnam 2002 80908147      Asia  73.017 1764.4567
1656  Vietnam 2007 85262356      Asia  74.249 2441.5764
Challenge 4
  1. Why does gapminder[1:20] return an error? How does it differ from gapminder[1:20, ]?

  2. Create a new data.frame called gapminder_subset that only contains rows 1 through 9 and rows 19 through 23.

  1. gapminder[1:20] tries to extract the first 20 columns, but there are not 20 columns. gapminder[1:20, ] subsets the data to give the first 20 rows and all columns.

gapminder_subset <- gapminder[c(1:9, 19:23), ]