A Summer of RStudio and ggplot2

For those of you wondering why I haven’t been tweeting and/or blogging about mud and lakes all summer, it’s because I had the incredible opportunity to spend the summer as an RStudio intern working with Hadley Wickham on ggplot2! It was a welcome change of pace from writing articles about mud in lakes, and I’m sad the internship is coming to a close. I had the opportunity to work alongside a lot of great interns at a fantastic company, prepare tons of issues for tidy-dev-day at UseR!, and develop a few humble new features for ggplot2! Here are a few of them:


New vignette on using ggplot2 in packages

I joined as an intern just prior to the release of ggplot2 3.2.0. Before all ggplot2 releases, we run the CMD check on every reverse dependency (there are 2,622 of them as of the last revdep check) with the CRAN version and with the release candidate to make sure we don’t introduce new failures. In sifting through the failures, it became clear that there was no documentation about how to use ggplot2 in a package in a way that wouldn’t be likely to break in the future. Thus, the Using ggplot2 in packages vignette was born! It covers how to refer to ggplot2 functions, how to create a mapping without triggering a CMD check error, and best practices for common ggplot2 uses in packages (like creating a theme or visualizing an object).

coord_trans() improvements

The difference between a ggplot with scale_(x|y)_log10() and a ggplot with coord_trans((x|y) = "log10") is a common reason that issues get opened in ggplot2. In short, scale_(x|y)_log10() applies log10() to the x and/or y aesthetics before anything happens, including computing any statistics. Using coord_trans((x|y) = "log10") applies log10() after everything happens. This means that a geom_boxplot() with scale_(x|y)_log10() is going to have different outliers (say) than a geom_boxplot() with coord_trans((x|y) = "log10").

p <- ggplot(diamonds, aes(cut, price)) + geom_boxplot()

  p + scale_y_log10(),
  p + coord_trans(y = "log10"),
  nrow = 1

It’s common for an issue to be opened for cases where this is non-intuitive (stat_summary() comes to mind - it’s not intuitive that summary statistics are not calculated on the original data), and the response is often that coord_trans() should be used instead of a transformed scale. However, there were problems with the expansion of discrete scales in coord_trans() that prevented coord_trans() from being a viable solution. In the PR fixing this, I also fixed a problem with second axes in coord_trans(), and made sure that the "reverse" trans worked (it didn’t before, but it doesn’t appear that anybody noticed). Hopefully coord_trans() is now ready to serve as a drop-in replacement when scale_(x|y)_log10() gives non-intuitive results!

NA limits in coord_cartesian()

Another common source of confusion in ggplot2 is the difference between scale_(x|y)_continuous(limits = ...) and coord_cartesian((x|y)lim = ...). When setting scale limits (this includes xlim() and ylim()), data is “censored” by default, meaning values outside this range magically turn into NA and disappear; when setting the coordinate system limits, the data are still exist, but data outside the (expanded) limits are not shown.

  p + scale_y_continuous(limits = c(0, 10000)),
  p + coord_cartesian(ylim = c(0, 10000))
## Warning: Removed 5222 rows containing non-finite values (stat_boxplot).

In this example, using scale limits leads to displaying spurious information about where the min and max of the data are. When this issue comes up, the response is usually that the user should use coord_cartesian(ylim = ...) instead of scale_y_continuous(limits = ...). Scale limits have this awesome feature where you can pass NA as one or more of the limits to refer to the minimum or maximum of the data, but this previously wasn’t possible for coordinate system limits. Now it is! It’s particularly useful with facets where scales = "free":

ggplot(diamonds, aes(color, price)) +
  geom_boxplot() +
  facet_wrap(vars(cut), scales = "free_y") +
  coord_cartesian(ylim = c(0, NA))

Axis guide improvements

When I started my internship, there was a long-standing open issue about overlapping axis text. Previously, it was impossible to do any customization of axes other than change the breaks and/or labels in the scale_*() functions, which could be customized a bit using theme(), and anything else was a crazy workaround. After this pull request is merged, axes will use the same guide system that powers guide_legend() and guide_colourbar(), such that you will be able to customize how axes are drawn (and in the future create custom ones!). This feature comes with a couple improvements for dealing with overlapping text in the new guide_axis() function:

# this isn't merged yet, so you'll need to
# remotes::install_github("tidyverse/ggplot2#3398")
ggplot(mpg, aes(hwy, cty)) +
  geom_point() +
  # create closely-spaced breaks that will overlap
  scale_x_continuous(breaks = seq(10, 50, 2)) +
  facet_wrap(vars(drv)) +
  guides(x = guide_axis(check.overlap = TRUE))
Software Engineer & Geoscientist