Report ready figures using ggplot2

Author

Hana Travers-Smith, adapted by Brianne Boufford

In this workshop module, you will learn the basic of ggplot and how to set up different types of plots (scatterplot, barplot, boxplot, multi-panel plots) and customize them. We will also discuss what makes a show-stopping figure that is ready to publish. It’s one thing to have interesting results but being able to communicate them effectively through visualizations is also a huge part of research.

Key principles for scientific figures:

Make the data the focus of the figure and reduce unused space
Avoid unnecessary elements or distractions - simplify axes, grid lines and legends where possible
Text and labels should be large enough to be read easily at a glance
Use appealing and color-blind friendly pallets (not the ggplot default colors please!)
Make sure to export figures as high resolution images (at least dpi = 300) with appropriate dimensions

ggplot2 basics

First off … Why use ggplot? “what about plotting with base R” you may ask.

Base R can be great for quickly visualizing data and exploratory analyses but ggplot offers superior flexibility allowing the creation of complex publication quality plots.

ggplot2 is an R package used for visualizing data and is based on a flexible framework that lets you build a graph by combining independent components. ggplot is designed so you can work iteratively. You start with the raw data as a layer, and then you add layers of customization and annotation. ggplots are comprised of a 7 key elements that contain all of the data and instruction for creating a graph which are:

data *
- the information that you want to visualize
mapping *
- description of how that data is mapped to aesthetic attributes
layer *
- collection of geometric elements (geoms, e.g. points, lines, polygons) and statistical transformations (stats, e.g. bins, counts, linear model)
scales
- maps the data values to aesthetics. This includes the use of colour, shape, or size.
coordinates
- how data coordinates are mapped to the plane of the graphic. Typically cartesian (x,y) but can also use polar (theta, phi) and map projections
theme
- controls the finer visual details like font size and background colour
facet
- specifies how to split up the data into smaller subplots

We can use the iris dataset to start to visualize some of these plot elements. Lets take a look at what’s in this dataset.

library(ggplot2)
head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Scatterplot

Okay now lets try to make a simple scatterplot of Sepal.Width vs. Sepal.Length but first lets use base R.

plot(x = iris$Sepal.Length, y = iris$Sepal.Width)

With one line of code we were able to quickly visualize our data so now we know what we are working with but we can make this a lot nicer with ggplot. Let’s do that now.

The first step is to pass information to ggplot with the data argument. This lets ggplot store the data to be used later.
The next step it to provide the mapping argument. This specifies to ggplot how parts of the data are mapped onto aesthetic attributes using the aes() function.

ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width))

So now we have a blank ggplot drawn, but the data doesn’t appear?

This is because ggplot doesn’t assume the graph type. It doesn’t know if you want a scatterplot or a barplot or another graph style. So this is where you need to add a layer on top of the blank ggplot to specify the graphic primitive.

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + 
  geom_point()

However, we aren’t limited to just the geom functions, we can also use ggplot to do statistical transformations of the data.

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + # specify data and aesthetic mapping to ggplot  
  geom_point() + # add point geometry layer 
  geom_smooth(formula = y ~ x, method = "lm") # fit a linear model

But there is more to be desired from this data. We can use scales to help us visualize more information from this dataset. Let’s use colour to create groups by Species.

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + # The color parameter defines groups in the data and uses a different color for each group
  geom_point() + # aesthetics are "inherited" from the first ggplot() line 
  geom_smooth(formula = y ~ x, method = "lm")

This gives each point a colour that corresponds to one of the unique Species classes. If you want to group the data by Species and assign this to unique colours, the colour argument will go in aes() because we are mapping this variable to an aesthetic. As an aside, we can also set the colour to be a static Value.

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width,  colour = "blue")) + # The color parameter is still scaled since it is inside aes()
  geom_point() + # aesthetics are "inherited" from the first ggplot() line 
  geom_smooth(formula = y ~ x, method = "lm")

If we pull the static colour attribute outside of aes() this will properly assign the colour without scaling.

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + # The color parameter is now outside of aes() so "colour" is set to a fixed value
  geom_point(colour = "red") + # aesthetics are "inherited" from the first ggplot() line 
  geom_smooth(formula = y ~ x, method = "lm", color = "black")

Using a few extra parameters we can make the plot look a lot nicer. The essentials:

xlab & ylab: Define x and y-labels instead of default variable names
alpha: Controls the transparency of points, lines or polygons. It varies between 0 and 1 - with values closer to 0 being more transparent. This is important when you have overlapping elements on a plot.

We can also use themes, which are fine controls over the non-data elements of your plot that are not related to the data being plotted. This includes:

theme elements: the non data elements you can adjust e.g. plot.title, axis.tick.x, legend.key.height
element functions: each element has an element function and this describes the visual properties of the element. e.g. element_text() sets the font size, colour and face of text elements like plot.title.
The theme() function which allows you to override the default theme elements by calling element functions, like theme(plot.title = element_text(colour = "red"))
you can also use complete themes: like theme_classic() which are presets of common combinations of theme elements

So for our plot we will use the following:

theme_classic or theme_bw: Removes the default grey background from the plot and goes a long way in tidying up the figure. theme_classic removes gridlines but theme_bw does not.
theme: We can control many different aspects of how the plot looks. In particular we should make the axis and legend text bigger using element_text() and specifying a font size. We can also change the font face and type.

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point(alpha = 0.5, size = 2)+ ## adjust point size and transparency 
  xlab("Sepal Width (cm)") + ## set x and y labels 
  ylab("Sepal Length (cm)") +
  geom_smooth(method = "lm") +
  theme_classic() + ## a nice theme that removes the grey background and graticules
  theme(axis.text = element_text(size = 14), ## font size of legend/axis 
        axis.title = element_text(size = 16),
        legend.text = element_text(size = 14),
        legend.title = element_text(size = 16),
        legend.position = "bottom")

This plot is getting closer - but there are sill issues with the legend. Notice how the legend includes the grey fill and how points inherit the transparency from the alpha = 0.5 argument. The base ggplot colors are also a little tacky.

We can use scale_color_manual and scale_fill_manual to set the colors of the points and standard error ribbons. The online R Color picker allows you to easily copy and paste codes for all R colors! To clean up the legend we can also remove the line and fill information and override the point transparency.

ggplot(aes(x = Sepal.Length, y = Sepal.Width, color = Species, fill = Species), data = iris) +
  geom_point(alpha = 0.5, size = 2) +
  scale_color_manual(name = "", values = c("slateblue", "skyblue1", "steelblue4")) + # manually set the point colors and remove "Species" from the legend 
  scale_fill_manual(name = "", values = c("slateblue", "skyblue1", "steelblue4")) + # the standard error around the lines is controlled by the "fill" 
  scale_x_continuous(breaks = seq(1, 8, by = 0.5))+ # set the x-axis breakpoints
  xlab("Sepal Width (cm)") +
  ylab("Sepal Length (cm)") +
  geom_smooth(alpha = 0.2, linewidth = 0.75, method = "lm", show.legend = FALSE)+ # adjust the transparency and size of the lines and remove the fill from the legend
  theme_classic() +
  guides(colour = guide_legend(override.aes = list(alpha = 1))) + # override the point transparency in the legend
  theme(axis.text = element_text(size = 14),
        axis.title = element_text(size = 16),
        legend.text = element_text(size = 14, face = "italic"), # add italics for species names 
        legend.title = element_text(size = 16),
        legend.position = "bottom")

This plot is looking pretty good! But if we are plotting multiple groups of data we might want to consider using a multi-panel plot for each group. This is easy to set up using on of our key ggplot elements: facets. Faceting generates small multiples each showing a different subset of the data. facet_wrap() is the most common function used to implement facets because of its flexibility. facet_wrap() makes a long ribbon of panels and wraps it into 2d, splitting the data by the variable that you specify. Some of the key arguments include:

scales: Do you want all panels to have the same x and y axis range? The default is fixed but scales = free finds the min/max values for each group in the plot.
nrow and ncol: Set the number of rows and columns.
labeller: Controls the text label of each panel. You can change the size and font of the label in theme() using theme(strip.text = element_text()) .

# Multi-panel plot 
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species, fill = Species)) +
  facet_wrap(~Species, scales = "free") + # plot by Species and set different x-axes for each plot (alternative is scales = "fixed")
  geom_point(alpha = 0.5, size = 3) +
  scale_color_manual(name = "", values = c("slateblue", "skyblue1", "steelblue4")) +
  scale_fill_manual(name = "", values = c("slateblue", "skyblue1", "steelblue4")) +
  scale_x_continuous(breaks = seq(1, 8, by = 0.5))+
  xlab("Sepal Width (cm)") +
  ylab("Sepal Length (cm)") +
  geom_smooth(alpha = 0.2, method = "lm", linewidth = 0.75) +
  theme_classic() +
  theme(axis.text = element_text(size = 11),
        axis.title = element_text(size = 16),
        strip.text = element_text(size = 14, face = "italic"),
        legend.position = "none") # remove the legend

Figure 1: Linear relationships between sepal length and width (cm) by iris species. The shaded area shows the standard error around the fitted trend.

Nice! We have made our first report ready figure! Let’s try a different type next.

Boxplots

Boxplots can be used to show the distribution of numeric data between different groups. We will use the diamonds dataset within the ggplot2 package to show how diamond price differs among cut and clarity categories.

head(diamonds)

# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

ggplot(aes(x = cut, y = price), data = diamonds) +
  geom_boxplot()

Using the fill argument we can color the boxplots by categories of cut. scale_fill_brewer() can be used to select predefined color palettes by name.

ggplot(aes(x = cut, y = price, fill = cut), data = diamonds) +
  geom_boxplot(alpha = 0.8, outlier.alpha = 0.05, outlier.size = 2)+ # adjust the size and transparency of the outliers - we could also remove them completely using alpha = 0 
  scale_fill_brewer(palette = "Blues")+ # use "brewer" for set color palettes 
  xlab("Cut")+
  ylab("Price (USD)")+
  theme_classic()+
  theme(axis.text = element_text(size = 12),
        axis.title = element_text(size = 16),
        legend.position = "none")

If we want to create grouped barplots we just need to use the fill argument to define a second grouping variable - in this case clarity.

ggplot(aes(x = cut, y = price, fill = clarity), data = diamonds) + # use the fill argument to group by a second category
  geom_boxplot(alpha = 0.6, outlier.alpha = 0.05) +
  scale_fill_brewer(palette = "Spectral") +
  xlab("Cut") +
  ylab("Price (USD)") +
  theme_classic() +
  theme(axis.text = element_text(size = 11),
        axis.title = element_text(size = 16),
        legend.position = "right")

Density plots

An alternative to boxplots, density plots also show the distribution of numeric variables by estimating a probability density function. In a density plot the area under the curve sums to 1 with “peaks”represnting x-values that are more likely to occur. You can think of it like we are “smoothing” out the bars of a typical frequency histogram to create a continuous line. These plots are especially good for displaying data that is not normally distributed. Note that violin plots are ugly and should never be used. Sorry Evan :(

The figure below shows a frequency histogram with the probability density function plotted on top.

library(dplyr)
diamonds %>% filter(clarity == "I1") %>% # filter one level of clarity for plotting and pass the filtered dataset into ggplot
  ggplot(aes(x = price/1000, y = ..density..))+ # scale price to 1000's of USD
  geom_histogram(binwidth = 1, color = "black", alpha = 0.3)+ # alpha controls the transparency of the fill 
  geom_density(color = "blue",linewidth = 0.8, alpha = 0.8)+
  xlab("Price (1000 USD)")+
  ylab("Density")+
  theme_classic()+
  theme(axis.text = element_text(size = 11),
        axis.title = element_text(size = 16))

Density plots can be used to compare the distribution of multiple groups. The fill parameter specifies the color of the area under the curve. Setting alpha to a relatively low value ensures that the plots do not block each other.

We can change the title and the names of the groups in the legend we when we set the color scheme using scale_fill_manual() or scale_color_manual().

name: Defines the title of the legend
values: Defines the colors
labels: We can pass a vector of strings (one for each factor level) to specify the group names.

diamonds %>% filter(clarity == "I1" | clarity == "VVS2") %>% # filter two levels of clarity
  ggplot(aes(x = price/1000, fill = clarity, color = clarity))+ # scale the price in terms of 1000's of dollars
  geom_density(alpha = 0.4, linewidth = 0.1)+ #adjust transparency and color of the density plots
  scale_fill_manual(name = "Clarity", values = c("#FFB90F", "#00BFFF"))+ # define color and fill for each level of "clarity"
  scale_color_manual(name = "Clarity", values = c("#FFB90F", "#00BFFF"))+
  geom_vline(xintercept = 3.20, linetype = "dashed", color = "#FFB90F")+ # add vertical lines to show peaks in the data
  geom_vline(xintercept = 0.83, linetype = "dashed", color = "#00BFFF")+
  xlab("Price (1000 USD)")+
  ylab("Density")+
  theme_bw()+
  theme(axis.text = element_text(size = 11),
        axis.title = element_text(size = 16),
        legend.position = "bottom")

We can also use the ggridges package to stack multiple density plots vertically and compare the distribution of many groups at once.

library(ggridges)

ggplot(aes(x = price, y = clarity, fill = clarity), data = diamonds)+
  geom_density_ridges2(quantile_lines = TRUE, quantiles = 2, alpha = 0.5, linewidth = 0.2)+
  xlim(c(0, 10000))+
  xlab("Price (USD)")+
  ylab("Clarity")+
  theme_bw()+
  theme(legend.position = "none")+
  theme(axis.text = element_text(size = 11),
        axis.title = element_text(size = 16))

Barplots

We can use barplots to compare the frequency or counts of groups within data. When we use geom_bar() we provide an x value representing the variable we want to count. We can count the frequency of different types of cars using the mpg dataset. The height of the bar represents the number of rows within each class.

library(ggplot2)
head(mpg)

# A tibble: 6 × 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

ggplot(aes(x = class, fill = class), data = mpg)+
  geom_bar(alpha = 0.8, color = "black")+
  scale_fill_brewer(palette = 'Reds')+
  xlab("Class")+
  ylab("Count")+
  theme_bw()+
  theme(legend.position = "none")+
  theme(axis.text = element_text(size = 11),
        axis.title = element_text(size = 16))

A stacked barplot can show the frequency of groups within groups using the fill parameter and defining a second grouping variable.

We can also use the labels() argument in scale_fill_brewer() to change the group labels in the legend.

ggplot(aes(x = class, fill = drv), data = mpg)+
  geom_bar(alpha = 0.8, color = "black")+
  scale_fill_brewer(palette = 'Spectral', name = "Drive Type", labels = c("4 wheel", "Front", "Rear"))+ # set the color palette and change the legend name and labels for each group 
  xlab("Class")+
  ylab("Count")+
  theme_bw()+
  theme(legend.position = "bottom")+
  theme(axis.text = element_text(size = 12),
        axis.title = element_text(size = 16),
        legend.text = element_text(size = 12))

Sometimes instead of counting rows within a dataset we already have a summary statistic or value calculated and we want to the height of the bar to represent this value, rather than a count. We can define the height of the bar using y in the aes() parameters and changing the statistic type to “identity” (which means the values are not aggregated into counts before plotting). This can be achieved using geom_col() in or geom_bar(stat = "identity") .

This example shows a Variable Importance plot from a Random Forests model displayed as a barplot. We have reordered the levels of Variable by Importance so that the variables with the greatest importance are plotted first. coord_flip() rotates the plot 90 degrees.

# make some sample variable importance data 
dt = data.frame(Variable = c("Temperature", "Precipitation", "Slope", "Aspect", "Elevation"), Importance = c(0.8, 0.2, 0.6, 0.3, 0.4))

ggplot(aes(x = reorder(Variable, Importance), y = Importance, fill = Variable), data = dt)+
  geom_bar(stat = "identity", color = "black", alpha = 0.8, width = 0.8)+
  coord_flip()+
  scale_fill_brewer(palette = "Blues")+
  ylab("Scaled Importance")+
  xlab("Random Forests Variable")+
  theme_bw()+
  theme(axis.text = element_text(size = 12),
        axis.title = element_text(size = 16),
        legend.position = "none")

Exporting figures from R

Now that you have made some beautiful ggplot figures it is time to export them as high quality images! Do not take a screenshot of the R plotting window and submit it! There are two methods shown below the key parameters are:

res or dpi: Resolution of the image - should be 300 dpi (dots per pixel)
height and width: Dimensions of the figure in units you define (“inches” in the example)

You should look at your figure after exporting it and experiment with different dimensions to make sure that everything is visible and scaled correctly. If parts of the figure are cut off you can increase the dimensions or make text size smaller.

# define plot
your_plot = ggplot(aes(x = Sepal.Length, y = Sepal.Width, color = Species), data = iris)+
  geom_point()

## Export a high resolution .png of the plot 
## You will want to experiment with the height/width to get the best dimensions
png("filepath_here.png", height = 4, width = 8, units = "in", res = 300)
your_plot
dev.off()

## Using ggsave()
ggsave(filename = "filepath_here.png", your_plot,  height = 4, width = 8, units = "in", dpi = 300)

Here are two examples of the same plot exported with different height/width dimensions:

Resources

Exercises

Try to recreate the following plots! Links to data descriptions included.

Using the midwest dataset from the ggplot2 package:

Tips: The color palette is “Dark2”

Using the msleep dataset in ggplot2:

Tips:

The color palette is “Spectral”.
You will need to manipulate the dataset before passing it to ggplot.