Visualizing Data in R with ggplot

Overview

Teaching: 15 min
Exercises: 10 min
Questions
  • Key question

Objectives
  • Produce scatter plots using ggplot.

  • Export publication quality graphics.

Plotting with ggplot2

ggplot2 is a plotting package that makes it simple to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. Therefore, we only need minimal changes if the underlying data change
or if we decide to change from a bar plot to a scatterplot. This helps in creating publication quality plots with minimal amounts of adjustments and tweaking.

ggplot graphics are built step by step by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.

To use ggplot2, first we need to install the package and load it into our current R session.

install.packages("ggplot2")
library("ggplot2")

Remember, we only need to install it once, but you’ll need to load the library each time you start a new R session.

To build a ggplot, we need to:

ggplot(data = por)

You’ll see a new plot appear in your Plots pane (bottom right), but the plot is empty. This is because, although we’ve told ggplot which data we want to plot, we haven’t told it how we want that data to be plotted.

ggplot(data = por, aes(x = G1, y = G3))

Now you will see a plot that still shows no data, but now the axes are labeled and show scales. We’ve told ggplot which of the columns in our data we are going to plot, so it knows how to label the plot and what the proper scale is, but it still isn’t showing any of the actual data. This is because we haven’t specified which geometry we want to plot with. A geom is a graphical representation of the data (e.g. points, lines, bars). ggplot2 offers many different geoms; we will use some common ones today, including:

To add a geom to the plot use + operator. Because we have two continuous variables,
let’s use geom_point() first to create an x-y scatter plot.

ggplot(data = por, aes(x = G1, y = G3)) +
geom_point()

The + in the ggplot2 package is particularly useful because it allows you to modify existing ggplot objects. This means you can easily set up plot “templates” and conveniently explore different types of plots, so the above plot can also be generated with code like this:

# Assign plot to a variable
por_plot = ggplot(data = por, aes(x = G1, y = G3))

# Draw the plot
por_plot + geom_point()

Notes:

# this is the correct syntax for adding layers
por_plot +
geom_point()

# this will not add the new layer and will return an error message
por_plot
+ geom_point()

There appears to be a nice linear relationship between a student’s first grade (G1) and their final grade (G3) in the course. However, we have 649 students in our dataset and not all 649 points are showing. We need to add a little bit of randomness to the positions of the plotting symbols so that they don’t all overlap each other.

por_plot +
geom_jitter()

There’s still a lot of overplotting. We can get a better picture of the data by increasing the transparency of the points, so that we can see where the dots are the most dense.

por_plot + 
geom_jitter(alpha = 0.2)

The ability to quickly adjust your geom and other layers of your plots is good reason to assign your basic plotting object to a variable.

We can also add colors for all the points:

por_plot + 
geom_jitter(alpha = 0.2, color = "blue")

Or color students from the two schools differently:

por_plot +
geom_jitter(alpha = 0.2, aes(color=school))

Exercise

Create a scatter plot of G3 vs G1 showing the students sex in different colors.

Faceting

ggplot has a special technique called faceting that allows the user to split one plot into multiple plots based on a factor included in the dataset. We can use this to create two separate plots of G1 vs G3, one for each of our two schools.

por_plot +
geom_jitter(alpha = 0.2) +
facet_wrap(~ school)

Exercise

Add coloring by sex to the faceted plot we created above.

Solution

por_plot +
geom_jitter(alpha = 0.2, aes(color = sex)) +
facet_wrap(~ school)

Other Geoms

We might be interested in looking at the relationship between grades and some non-continuous variable, like number of previous failed courses (failures). To plot a relationship between a continuous and a non-continuous variable, we should use a box-plot. There is a geom_boxplot geometry within ggplot2.

First we will create a new base plot to build on:

failures_plot = ggplot(data = por, aes(x = as.factor(failures), y = G3))

Now we can create our boxplot.

failures_plot + 
geom_boxplot()

If we want to see the data plotted on top of our boxplot, we can add another layer.

failures_plot + 
geom_boxplot() + geom_jitter(alpha = 0.2)

Now our outliers are showing up twice (once from the boxplot and once from the dot plot). We can remove them by setting the outlier.shape in geom_boxplot to NA.

failures_plot + 
geom_boxplot(outlier.shape = NA) + geom_jitter(alpha = 0.2)

Exercise

Create a boxplot showing the relationship between students’ final grades and any non-continuous variable other than failures.

Using these features we’ve learned, we can make very complex plots quite quickly. For example, we could plot the relationship between study time and final grades for for students at each of our two schools, split by school and by whether they live in a rural or urban area. We can add our datapoints on to this boxplot and color the indidivual points by students’ sex.

ggplot(data = por, aes(x = as.factor(studytime), y = G3)) + geom_boxplot(outlier.shape = NA) + facet_grid(address ~ school) + geom_jitter(alpha = 0.2, aes(color = sex))

This plot probably shows more information than is useful. It’s important to be thoughtful in creating your final graphics for present your data in a way that is interpretable to the reader. For now, let’s use this plot as a scaffold to practice what we’ve learned.

Exercise

Change the code for the complicated plot above to:

  1. Color code the data points by mother’s education level.
  2. Switch the order of the panels so that school is in the rows and rural/urban is in the columns.
  3. Make all the data points purple.
  4. Plot against the students total grade (G1 + G2 + G3) instead of just G3.
  5. Facet by mother’s education level and father’s education level instead of by school and address type.

Solution

ggplot(data = por, aes(x = as.factor(studytime), y = G3)) + geom_boxplot(outlier.shape = NA) + facet_grid(address ~ school) + geom_jitter(alpha = 0.2, aes(color = as.factor(Medu)))
ggplot(data = por, aes(x = as.factor(studytime), y = G3)) + geom_boxplot(outlier.shape = NA) + facet_grid(school ~ address) + geom_jitter(alpha = 0.2, aes(color = sex))
ggplot(data = por, aes(x = as.factor(studytime), y = G3)) + geom_boxplot(outlier.shape = NA) + facet_grid(address ~ school) + geom_jitter(alpha = 0.2, color = "purple")
ggplot(data = por, aes(x = as.factor(studytime), y = G1 + G2 + G3)) + geom_boxplot(outlier.shape = NA) + facet_grid(address ~ school) + geom_jitter(alpha = 0.2, aes(color = sex))
ggplot(data = por, aes(x = as.factor(studytime), y = G3)) + geom_boxplot(outlier.shape = NA) + facet_grid(as.factor(Medu) ~ as.factor(Fedu)) + geom_jitter(alpha = 0.2, aes(color = sex))

Exercise

Which of the plots from the exercise above are useful ways of showing the data? Which aren’t?

Usually plots with white background look more readable when printed. We can set the background to white using the function theme_bw(). Additionally, you can remove the grid:

failures_plot + 
geom_boxplot(outlier.shape = NA) + geom_jitter(alpha = 0.2, aes(color = sex)) + 
theme_bw() +
theme(panel.grid = element_blank())

Exercise

Try one of the other pre-designed themes.

After creating your plot, you can save it to a file in your favorite format. The Export tab in the Plot pane in RStudio will save your plots at low resolution, which will not be accepted by many journals and will not scale well for posters.

Instead, use the ggsave() function, which allows you change the dimension and resolution of your plot by adjusting the appropriate arguments (width, height and dpi):

ggsave("Desktop/failures_plot.pdf")

We can increase the dpi:

ggsave("Desktop/failures_plot.pdf", dpi = 600)

Or set specific size settings according to individual journal’s requirements.

ggsave("Desktop/failures_plot.pdf", height = 6, units = "cm", dpi = 600)

Key Points

  • First key point.