Annotate ggplot with text labels using built-in functions and create non-overlapping labels with the ggrepel package. Credit: Thinkstock Labeling all or some of your data with text can help tell a story — even when your graph is using other cues like color and size. ggplot has a couple of built-in ways of doing this, and the ggrepel package adds some more functionality to those options. For this demo, I’ll start with a scatter plot looking at percentage of adults with at least a four-year college degree vs. known Covid-19 cases per capita in Massachusetts counties. (The theory: A college education might mean you’re more likely to have a job that lets you work safely from home. Of course there are plenty of exceptions, and many other factors affect infection rates.) If you want to follow along, you can get the code to re-create my sample data on page 2 of this article. Creating a scatter plot with ggplot To start, the code below loads several libraries and sets scipen = 999 so I don’t get scientific notation in my graphs: library(ggplot2) library(ggrepel) library(dplyr) options(scipen = 999) Here is the data structure for the ma_data data frame: head(ma_data) Place AdultPop Bachelors PctBachelors CovidPer100K Positivity Region 1 Barnstable 165336 70795 0.4281887 7.0 0.0188 Southeast 2 Berkshire 92946 31034 0.3338928 9.0 0.0095 West 3 Bristol 390230 109080 0.2795275 30.8 0.0457 Southeast 4 Dukes and Nantucket 20756 9769 0.4706591 25.3 0.0294 Southeast 5 Essex 538981 212106 0.3935315 29.5 0.0406 Northeast 6 Franklin 53210 19786 0.3718474 4.7 0.0052 West The next group of code creates a ggplot scatter plot with that data, including sizing points by total county population and coloring them by region. geom_smooth() adds a linear regression line, and I also tweak a couple of ggplot design defaults. The graph is stored in a variable called ma_graph. ma_graph <- ggplot(ma_data, aes(x = PctBachelors, y = CovidPer100K, size = AdultPop, color = Region)) + geom_point() + scale_x_continuous(labels = scales::percent) + geom_smooth(method='lm', se = FALSE, color = "#0072B2", linetype = "dotted") + theme_minimal() + guides(size = FALSE) That creates a basic scatter plot: Sharon Machlis, IDG Basic scatter plot with ggplot2. However, it’s currently impossible to know which points represent what counties. ggplot’s geom_text() function adds labels to all the points: ma_graph + geom_text(aes(label = Place)) Sharon Machlis ggplot scatter plot with default text labels. geom_text() uses the same color and size aesthetics as the graph by default. But sizing the text based on point size makes the small points’ labels hard to read. I can stop that behavior by setting size = NULL. It can also be a bit difficult to read labels when they’re right on top of the points. geom_text() lets you “nudge” them a bit higher with the nudge_y argument. There’s another built-in ggplot labeling function called geom_label(), which is similar to geom_text() but adds a box around the text. The following code using geom_label() produces the graph shown below. ma_graph + geom_label(aes(label = Place, size = NULL), nudge_y = 0.7) Sharon Machlis, IDG ggplot scatter plot with geom_label(). These functions work well when points are spaced out. But if data points are closer together, labels can end up on top of each other — especially in a smaller graph. I added a fake data point close to Middlesex County in the Massachusetts data. If I re-run the code with the new data, Fake blocks part of the Middlesex label. ma_graph2 <- ggplot(ma_data_fake, aes(x = PctBachelors, y = CovidPer100K, size = AdultPop, color = Region)) + geom_point() + scale_x_continuous(labels = scales::percent) + geom_smooth(method='lm', se = FALSE, color = "#0072B2", linetype = "dotted") + theme_minimal() + guides(size = FALSE) ma_graph2 ma_graph2 + geom_label(aes(label = Place, size = NULL, color = NULL), nudge_y = 0.75) Sharon Machlis, IDG ggplot2 scatter plot with default geom_label() labels on top of each other Enter ggrepel. Creating non-overlapping labels with ggrepel The ggrepel package has its own versions of ggplot’s text and label geom functions: geom_text_repel() and geom_label_repel(). Using those functions’ defaults will automatically move one of the labels below its point so it doesn’t overlap with the other one. As with ggplot’s geom_text() and geom_label(), the ggrepel functions allow you to set color to NULL and size to NULL. You can also use the same nudge_y arguments to create more space between the labels and the points. ma_graph2 + geom_label_repel(data = subset(ma_data_fake, Region == "MetroBoston"), aes(label = Place, size = NULL, color = NULL), nudge_y = 0.75) Sharon Machlis, IDG Scatter plot with geom_label_repel(). The graph above has the Middlesex label above the point and the Fake label below, so there’s no risk of overlap. Focusing attention on subsets of data with ggrepel Sometimes you may want to label only a few points of special interest and not all of your data. You can do so by specifying a subset of data in the data argument of geom_label_repel(): ma_graph2 + geom_label_repel(data = subset(ma_data_fake, Region == "MetroBoston"), aes(label = Place, size = NULL, color = NULL), nudge_y = 2, segment.size = 0.2, segment.color = "grey50", direction = "x" ) Sharon Machlis, IDG Scatter plot with only some points labeled. Customizing labels and lines with ggrepel There is more customization you can do with ggrepel. For example, you can set the width and color of labels’ pointer lines with segment.size and segment.color. You can even turn label lines into arrows with the arrow argument: ma_graph2 + geom_label_repel(aes(label = Place, size = NULL), arrow = arrow(length = unit(0.03, "npc"), type = "closed", ends = "last"), nudge_y = 3, segment.size = 0.3 ) Sharon Machlis, IDG Scatter plot with ggrepel labels and arrows. And you can use ggrepel to label lines in a multi-series line graph as well as points in a scatter plot. For this demo, I’ll use another data frame, mydf, which has some quarterly unemployment data for four US states. The code for that data frame is also on page 2. mydf has three columns: Rate, State, and Quarter. In the graph below, I find it a little hard to see which line goes with what state, because I have to look back and forth between the lines and the legend. graph2 <- ggplot(mydf, aes(x = Quarter, y = Rate, color = State, group = State)) + geom_line() + theme_minimal() + scale_y_continuous(expand = c(0, 0), limits = c(0, NA)) graph2 Sharon Machlis, IDG ggplot line graph. In the next code block, I’ll add a label for each line in the series, and I’ll have geom_label_repel() point to the second-to-last quarter and not the last quarter. The code calculates what the second-to-last quarter is and then tells geom_label_repel() to use filtered data for only that quarter. The code uses the State column as the label, “nudges” the data .75 horizontally, removes all the other data points, and gets rid of the graph’s default legend. second_to_last_quarter <- max(mydf$Quarter[mydf$Quarter != max(mydf$Quarter)]) graph2 + geom_label_repel(data = filter(mydf, Quarter == second_to_last_quarter), aes(label = State), nudge_x = .75, na.rm = TRUE) + theme(legend.position = "none") Sharon Machlis, IDG Line graph with ggrepel labels. Why not label the last quarter instead of the second-to-last one? I tried that first, and the pointer lines ended up looking like a continuation of the graph’s data: Sharon Machlis, IDG Line graph with confusing label pointing lines. The top two lines should not be starting to trend downward at the end! If you want to find out more about ggrepel, check out the ggrepel vignette with vignette("ggrepel", "ggrepel") Code to create data used in this demo: ma_data <- data.frame( stringsAsFactors = FALSE, Place = c("Barnstable","Berkshire","Bristol", "Dukes and Nantucket","Essex","Franklin", "Hampden","Hampshire","Middlesex", "Norfolk","Plymouth","Suffolk", "Worcester"), AdultPop = c(165336L,92946L,390230L,20756L, 538981L,53210L,316312L,99377L,1116442L, 488612L,355335L,546850L,564408L), Bachelors = c(70795L,31034L,109080L,9769L,212106L, 19786L,85913L,46210L,616179L,258768L, 130354L,244827L,202881L), PctBachelors = c(0.428188658,0.333892798,0.279527458, 0.470659087,0.393531497,0.371847397, 0.271608412,0.464996931,0.551913131, 0.529598127,0.366848186,0.447704124, 0.359458052), CovidPer100K = c(7,9,30.8,25.3,29.5,4.7,28.1,10.4, 16.7,13.9,14.5,27.4,20), Positivity = c(0.0188,0.0095,0.0457,0.0294,0.0406, 0.0052,0.0446,0.0063,0.0165,0.0184, 0.0288,0.0171,0.0251), Region = c("Southeast","West","Southeast", "Southeast","Northeast","West","West", "West","MetroBoston","MetroBoston", "Southeast","MetroBoston","Central") ) ma_data_fake <- data.frame( stringsAsFactors = FALSE, Place = c("Barnstable","Berkshire","Bristol", "Dukes and Nantucket","Essex","Franklin", "Hampden","Hampshire","Middlesex", "Norfolk","Plymouth","Suffolk", "Worcester","Fake"), AdultPop = c(165336L,92946L,390230L,20756L, 538981L,53210L,316312L,99377L,1116442L, 488612L,355335L,546850L,564408L,1106400L), Bachelors = c(70795L,31034L,109080L,9769L,212106L, 19786L,85913L,46210L,616179L,258768L, 130354L,244827L,202881L,610100L), PctBachelors = c(0.428188658,0.333892798,0.279527458, 0.470659087,0.393531497,0.371847397, 0.271608412,0.464996931,0.551913131, 0.529598127,0.366848186,0.447704124, 0.359458052,0.5394678), CovidPer100K = c(7,9,30.8,25.3,29.5,4.7,28.1,10.4, 16.7,13.9,14.5,27.4,20,16.3), Positivity = c(0.0188,0.0095,0.0457,0.0294,0.0406, 0.0052,0.0446,0.0063,0.0165,0.0184, 0.0288,0.0171,0.0251,0.0155), Region = c("Southeast","West","Southeast", "Southeast","Northeast","West","West", "West","MetroBoston","MetroBoston", "Southeast","MetroBoston","Central", "MetroBoston") ) mydf <- structure(list(Rate = c(4.5999999999999996, 4.5, 4.2000000000000002, 4.2000000000000002, 4.2999999999999998, 4.0999999999999996, 4.0999999999999996, 4.0999999999999996, 4.0999999999999996, 7, 8.9000000000000004, 4.7000000000000002, 4.5999999999999996, 4.2999999999999998, 4.0999999999999996, 4, 3.8999999999999999, 4, 4, 3.7999999999999998, 6.5999999999999996, 8.6999999999999993, 3.7999999999999998, 3.6000000000000001, 3.5, 3.3999999999999999, 3.2000000000000002, 3.1000000000000001, 3, 2.8999999999999999, 3, 6.0999999999999996, 8.0999999999999996, 2.7999999999999998, 2.8999999999999999, 3, 2.7999999999999998, 3.1000000000000001, 3.1000000000000001, 3.2000000000000002, 3.2999999999999998, 3.2000000000000002, 4, 4.2999999999999998), State = c("CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "NY", "NY", "NY", "NY", "NY", "NY", "NY", "NY", "NY", "NY", "NY", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "NE", "NE", "NE", "NE", "NE", "NE", "NE", "NE", "NE", "NE", "NE"), Quarter = c("2018-01-01", "2018-04-01", "2018-07-01", "2018-10-01", "2019-01-01", "2019-04-01", "2019-07-01", "2019-10-01", "2020-01-01", "2020-04-01", "2020-07-01", "2018-01-01", "2018-04-01", "2018-07-01", "2018-10-01", "2019-01-01", "2019-04-01", "2019-07-01", "2019-10-01", "2020-01-01", "2020-04-01", "2020-07-01", "2018-01-01", "2018-04-01", "2018-07-01", "2018-10-01", "2019-01-01", "2019-04-01", "2019-07-01", "2019-10-01", "2020-01-01", "2020-04-01", "2020-07-01", "2018-01-01", "2018-04-01", "2018-07-01", "2018-10-01", "2019-01-01", "2019-04-01", "2019-07-01", "2019-10-01", "2020-01-01", "2020-04-01", "2020-07-01")), row.names = c(NA, -44L), class = "data.frame") Related content analysis 7 steps to improve analytics for data-driven organizations Effective data-driven decision-making requires good tools, high-quality data, efficient processes, and prepared people. Here’s how to achieve it. By Isaac Sacolick Jul 01, 2024 10 mins Analytics news Maker of RStudio launches new R and Python IDE Posit, formerly RStudio, has released a beta of Positron, a ‘next generation’ data science development environment based on Visual Studio Code. By Sharon Machlis Jun 27, 2024 3 mins Integrated Development Environments Python R Language feature 4 highlights from EDB Postgres AI New platform product supports transactional, analytical, and AI workloads. By Aislinn Shea Wright Jun 13, 2024 6 mins PostgreSQL Generative AI Databases analysis Microsoft Fabric evolves from data lake to application platform Microsoft delivers a one-stop shop for big data applications with its latest updates to its data platform. By Simon Bisson Jun 13, 2024 7 mins Microsoft Azure Natural Language Processing Data Architecture Resources Videos