library(tidyverse)
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
geom_point()
9 Data visualization with ggplot2
Plotting our data is one of the best ways to quickly explore it and the various relationships between variables.
While there exists a plotting system in base R, most people use the ggplot2 package from the tidyverse suite of packages.
ggplot2 is built on the grammar of graphics, the idea that any plot can be expressed from the same set of components: a data set, a coordinate system, and a set of geoms – the visual representation of data points.
The key to understanding ggplot2 is thinking about a figure in layers. This idea may be familiar to you if you have used image editing programs like Photoshop, Illustrator, or Inkscape.
Let’s start with an example:
So the first thing we do is call the ggplot()
function. This function lets R know that we’re creating a new plot, and any of the arguments we give the ggplot
function are the global options for the plot: they apply to all layers on the plot.
We’ve passed in two arguments to ggplot
.
First, we tell
ggplot
what data we want to show on our figure, in this example the gapminder data we read in earlier.For the second argument, we passed in the
aes
function, which tellsggplot
how variables in the data map to aesthetic properties of the figure, in this case, the x and y locations. Here we toldggplot
we want to plot the “gdpPercap” column of the gapminder data frame on the x-axis, and the “lifeExp” column on the y-axis.
Notice that we didn’t need to explicitly pass aes
these columns (e.g. x = gapminder[, "gdpPercap"]
), this is because ggplot
is smart enough to know to look in the data for that column!
However, by itself, the call to ggplot
isn’t enough to draw a figure:
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
We need to tell ggplot
how we want to visually represent the data, which we do by adding a new geom layer. In our example, we used geom_point
, which tells ggplot
we want to visually represent the relationship between x and y as a scatterplot of points:
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
geom_point()
9.1 Line plots
Using a scatterplot probably isn’t the best for visualizing change over time. Instead, let’s tell ggplot
to visualize the data as a line plot (dropping the argument names for brevity):
ggplot(gapminder, aes(x = year, y = lifeExp, color = continent)) +
geom_line()
Instead of adding a geom_point
layer, we’ve added a geom_line
layer.
However, the result doesn’t look quite as we might have expected: it seems to be jumping around a lot in each continent. This is because we haven’t told ggplot2 to plot a separate line for each country. We can do that by adding a group
argument inside the aes()
function:
ggplot(gapminder, aes(x = year, y = lifeExp, color = continent, group = country)) +
geom_line()
The group aesthetic tells ggplot
to draw a line for each country.
But what if we want to visualize both lines and points in the plot? We can add another layer to the plot:
ggplot(gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) +
geom_line() +
geom_point()
It’s important to note that each layer is drawn on top of the previous layer. In this example, the points have been drawn on top of the lines. Here’s a demonstration:
ggplot(gapminder, aes(x = year, y = lifeExp, group = country)) +
geom_line(aes(color = continent)) + geom_point()
In this example, the aesthetic mapping of color has been moved from the global plot options in ggplot
to the geom_line
layer so it no longer applies to the points layer. Now we can clearly see that the points are drawn on top of the lines.
9.2 Transformations
ggplot2 also makes it easy to overlay statistical models over the data. To demonstrate we’ll go back to our earlier example:
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
geom_point()
Currently, it’s hard to see the relationship between the points due to some strong outliers in GDP per capita. We can change the scale of units on the x axis using the scale functions. These control the mapping between the data values and visual values of an aesthetic.
We can also modify the transparency of the points, using the alpha function, which is especially helpful when you have a large amount of data thatxs is very clustered.
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(alpha = 0.5) +
scale_x_log10()
The scale_x_log10()
function applied a transformation to the coordinate system of the plot so that each multiple of 10 is evenly spaced from left to right. For example, a GDP per capita of 1,000 is the same horizontal distance away from a value of 10,000 as the 10,000 value is from 100,000. This helps to visualize the spread of the data along the x-axis.
We can also fit a simple relationship to the data by adding another layer, geom_smooth()
:
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
scale_x_log10()
`geom_smooth()` using formula = 'y ~ x'
We can make the line thicker by setting the linewidth aesthetic in the geom_smooth
layer:
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", linewidth = 1.5) +
scale_x_log10()
`geom_smooth()` using formula = 'y ~ x'
9.3 Multi-panel figures
Earlier we visualized the change in life expectancy over time across all countries in one plot like this:
ggplot(gapminder, aes(x = year, y = lifeExp, color = continent, group = country)) +
geom_line()
Another way to view this data is to split this out over multiple panels by adding a layer of facet panels.
Since there are a lot of countries, we will first filter just to the “Americas”:
|>
gapminder filter(continent == "Americas") |>
ggplot() +
geom_line(aes(x = year, y = lifeExp)) +
# make a separate plot for each country in the facet
facet_wrap(~country) +
# set the x-axis angle to 45 degrees
theme(axis.text.x = element_text(angle = 90))
The facet_wrap
layer took a “formula” as its argument, denoted by the tilde (~). This tells R to draw a panel for each unique value in the country column of the gapminder dataset.
9.4 Modifying labels
To clean this figure up for a publication we need to change some of the text elements. The x-axis is too cluttered, and the y-axis should read “Life expectancy”, rather than the column name in the data frame.
We can do this by adding a couple of different layers. The theme layer controls the axis text and overall text size. Labels for the axes, plot title, and any legend can be set using the labs
function.
|>
gapminder filter(continent == "Americas") |>
ggplot() +
geom_line(aes(x = year, y = lifeExp)) +
# make a separate plot for each country in the facet
facet_wrap(~country) +
labs(x = "Year",
y = "Life expectancy",
title = "Life expectancy by year in the Americas") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
9.5 Built-in themes
There are several themes for making your plots even prettier. For example,
theme_classic()
:
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(size = pop, color = continent)) +
scale_x_log10() +
theme_classic()
theme_minimal()
:
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(size = pop, color = continent)) +
scale_x_log10() +
theme_minimal()
theme_bw()
:
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(size = pop, color = continent)) +
scale_x_log10() +
theme_bw()
9.6 Exporting the plot
The ggsave()
function allows you to export a plot created with ggplot. You can specify the dimension and resolution of your plot by adjusting the appropriate arguments (width
, height
, and dpi
) to create high-quality graphics for publication. In order to save the plot from above, we first assign it to a variable lifeExp_plot
, then tell ggsave
to save that plot in png
format to a directory called results
. (Make sure you have a results/
folder in your working directory.)
<- gapminder |>
facet_plot filter(continent == "Americas") |>
ggplot() +
geom_line(aes(x = year, y = lifeExp)) +
# make a separate plot for each country in the facet
facet_wrap(~country) +
labs(x = "Year",
y = "Life expectancy",
title = "Life expectancy by year in the Americas") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggsave(filename = "results/lifeExp.png", plot = facet_plot, width = 12, height = 10, dpi = 300, units = "cm")
There are two nice things about ggsave
. First, it defaults to the last plot, so if you omit the plot
argument it will automatically save the last plot you created with ggplot
. Secondly, it tries to determine the format you want to save your plot in from the file extension you provide for the filename (for example .png
or .pdf
). If you need to, you can specify the format explicitly in the device
argument.
This is a taste of what you can do with ggplot2.
RStudio provides a really useful cheat sheet of the different layers available, and more extensive documentation is available on the ggplot2 website. Finally, if you have no idea how to change something, a quick Google search will usually send you to a relevant question and answer on Stack Overflow with reusable code to modify!