Sharon Machlis
Executive Editor, Data & Analytics

How to count by group in R

how-to
Sep 10, 20209 mins
AnalyticsR Language

Discover quick and easy ways to count by groups in R, including reports as data frames, graphics, and ggplot graphs

Do More With R [video teaser/video series] - R Programming Guide - Tips & Tricks
Credit: Thinkstock

Counting by multiple groups — sometimes called crosstab reports — can be a useful way to look at data ranging from public opinion surveys to medical tests. For example, how did people vote by gender and age group? How many software developers who use both R and Python are men vs. women?

There are a lot of ways to do this kind of counting by categories in R. Here, I’d like to share some of my favorites.

For the demos in this article, I’ll use a subset of the Stack Overflow Developers survey, which surveys developers on dozens of topics ranging from salaries to technologies used. I’ll whittle it down with columns for languages used, gender, and if they code as a hobby. I also added my own LanguageGroup column for whether a developer reported using R, Python, both, or neither.

If you’d like to follow along, the last page of this article has instructions on how to download and wrangle the data to get the same data set I’m using.

The data has one row for each survey response, and the four columns are all characters.

str(mydata)
'data.frame':	83379 obs. of  4 variables:
 $ Gender            : chr  "Man" "Man" "Man" "Man" ...
 $ LanguageWorkedWith: chr  "HTML/CSS;Java;JavaScript;Python" "C++;HTML/CSS;Python" "HTML/CSS" "C;C++;C#;Python;SQL" ...
 $ Hobbyist          : chr  "Yes" "No" "Yes" "No" ...
 $ LanguageGroup     : chr  "Python" "Python" "Neither" "Python" ...

I filtered the raw data to make the crosstabs more manageable, including removing missing values and taking the two largest genders only, Man and Woman.

The janitor package

So, what’s the gender breakdown within each language group? For this type of reporting in a data frame, one of my go-to tools is the janitor package’s tabyl() function. 

The basic tabyl() function returns a data frame with counts. The first column name you add to a tabyl() argument becomes the row, and the second one the column

library(janitor)
tabyl(mydata, Gender, LanguageGroup)

 Gender Both Neither Python   R
    Man 3264   43908  29044 969
  Woman  374    3705   1940 175

What’s nice about tabyl() is it’s very easy to generate percents, too. If you want to see percents for each column instead of raw totals, add adorn_percentages("col"). You can then pipe those results into a formatting function such as adorn_pct_formatting().

tabyl(mydata, Gender, LanguageGroup) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1)

 Gender  Both Neither Python     R
    Man 89.7%   92.2%  93.7% 84.7%
  Woman 10.3%    7.8%   6.3% 15.3%

To see percents by row, add adorn_percentages("row")

If you want to add a third variable, such as Hobbyist, that’s easy too.

tabyl(mydata, Gender, LanguageGroup, Hobbyist) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1)

However, it gets a little harder to visually compare results in more than two levels this way. This code returns a list with one data frame for each third-level choice:

$No
 Gender  Both Neither Python     R
    Man 79.6%   86.7%  86.4% 74.6%
  Woman 20.4%   13.3%  13.6% 25.4%

$Yes
 Gender  Both Neither Python     R
    Man 91.6%   93.9%  95.0% 88.0%
  Woman  8.4%    6.1%   5.0% 12.0%

The CGPfunctions package

The CGPfunctions package is worth a look for some quick and easy ways to visualize crosstab data. Install it from CRAN with the usual install.packages("CGPfunctions").

The package has two functions of interest for examining crosstabs: PlotXTabs() and PlotXTabs2(). This code returns bar graphs of the data (first graph below):

library(CGPfunctions)
PlotXTabs(mydata) 
Bar graphs of language use by men and women Screen shot by Sharon Machlis, IDG

Result of PlotXTabs(mydata).

PlotXTabs2(mydata) creates a graph with a different look, and some statistical summaries (second graph at left).

If you don’t need or want those summaries, you can remove them with results.subtitle = FALSE, such as PlotXTabs2(mydata, LanguageGroup, Gender, results.subtitle = FALSE).

Bar graphs of language use by men and women with additional statistics Screen shot by Sharon Machlis, IDG

Result of PlotXTabs(mydata). 

PlotXTabs2() has a couple of dozen argument options, including title, caption, legends, color scheme, and one of four plot types: side, stack, mosaic, or percent. There are also options familiar to ggplot2 users, such as ggtheme and palette. You can see more details in the function’s help file.

The vtree package

The vtree package generates graphics for crosstabs as opposed to graphs. Running the main vtree() function on one variable, such as 

library(vtree)
vtree(mydata, "LanguageGroup")

gets you this basic response:

vtree graphic tree of counts by one variable Sharon Machlis, IDG

Basic vtree() function on one variable.

I’m not keen on the color defaults here, but you can swap in an RColorBrewer palette. vtree’s palette argument uses palette numbers, not names; you can see how they’re numbered in the vtree package documentation. I could choose 3 for Greens and 5 for Purples, for example. Unfortunately, those defaults give you a more intense color for lower count numbers, which doesn’t always make sense (and doesn’t work well for me in this example). I can change that default behavior with sortfill = TRUE to use the more intense color for the higher value. 

vtree(mydata, "LanguageGroup", palette = 3, sortfill = TRUE)

vtree() graphic with green palette Sharon Machlis, IDG

vtree() after changing to a new palette.

If you find the dark color makes it hard to read text, there are some options. One option is to use the plain argument, such as vtree(mydata, "LanguageGroup", plain = TRUE). Another option is to set a single fill color instead of a palette, using the fillcolor argument, such as vtree(mydata, LanguageGroup", fillcolor = "#99d8c9").

To look at two variables in a crosstab report, simply add a second column name and palette or color if you don’t want the default. You can use the plain option or specify two palettes or two colors. Below I chose specific colors instead of palettes, and I also rotated the graph to read vertically.

vtree(mydata, c("LanguageGroup", "Gender"), 
   fillcolor = c( LanguageGroup = "#e7d4e8", Gender = "#99d8c9"),
   horiz = FALSE)
vtree graphic split by two variables Sharon Machlis, IDG

vtree() for two variables.

You can add more than two categories, although it gets a bit harder to read and follow as the tree grows. If you’re only interested in some of the branches, you can specify which to display with the keep argument. Below, I set vtree() to show only people who use R without Python or who use both R and Python.

vtree(mydata, c("Gender", "LanguageGroup", "Hobbyist"), 
  horiz = FALSE, fillcolor = c(LanguageGroup = "#e7d4e8", 
  Gender = "#99d8c9", Hobbyist = "#9ecae1"), 
  keep = list(LanguageGroup = c("R", "Both")), showcount = FALSE) 

With the tree getting so busy, I think it helps to have either the count or the percent as node labels, not both. So that last argument in the code above,  showcount = FALSE, sets the graph to display only percents and not counts.

Part of a vtree graphic with only certain nodes showing Sharon Machlis, IDG

Three-level vtree graphic with a subset of nodes, displaying percents only. 

More count by group options

There are other useful ways to group and count in R, including base R, dplyr, and data.table. Base R has the xtabs() function specifically for this task. Note the formula syntax below: a tilde and then one variable plus another variable.

xtabs(~ LanguageGroup + Gender, data = mydata)
             Gender
LanguageGroup   Man Woman
      Both     3264   374
      Neither 43908  3705
      Python  29044  1940
      R         969   175

dplyr’s count() function combines “group by” and “count rows in each group” into a single function.

library(dplyr)
my_summary <- mydata %>%
  count(LanguageGroup, Gender, Hobbyist, sort = TRUE) 

my_summary
   LanguageGroup Gender Hobbyist     n
1        Neither    Man      Yes 34419
2         Python    Man      Yes 25093
3        Neither    Man       No  9489
4         Python    Man       No  3951
5           Both    Man      Yes  2807
6        Neither  Woman      Yes  2250
7        Neither  Woman       No  1455
8         Python  Woman      Yes  1317
9              R    Man      Yes   757
10        Python  Woman       No   623
11          Both    Man       No   457
12          Both  Woman      Yes   257
13             R    Man       No   212
14          Both  Woman       No   117
15             R  Woman      Yes   103
16             R  Woman       No    72

In the three lines of code below, I load the data.table package, create a data.table from my data, and then use the special .N data.table symbol that stands for number of rows in a group. 

library(data.table)
mydt <- setDT(mydata)
mydt[, .N, by = .(LanguageGroup, Gender, Hobbyist)]

Visualizing with ggplot2

As with most data, ggplot2 is a good choice to visualize summarized results. The first ggplot graph below plots LanguageGroup on the X axis and the count for each on the Y axis. Fill color represents whether someone says they code as a hobby. And, facet_wrap says: Make a separate graph for each value in the Gender column.

library(ggplot2)
ggplot(my_summary, aes(LanguageGroup, n, fill = Hobbyist)) +
  geom_bar(stat = "identity") +
  facet_wrap(facets = vars(Gender))

ggplot2 graphs of language use by gender Sharon Machlis, IDG

Using ggplot2 to compare language use by gender.

Because there are relatively few women in the sample, it’s difficult to compare percentages across genders when both graphs use the same Y-axis scale. I can change that, though, so each graph uses a separate scale, by adding the argument scales = “free_y” to the facet_wrap() function:

ggplot(my_summary, aes(LanguageGroup, n, fill = Hobbyist)) +
  geom_bar(stat = "identity") +
  facet_wrap(facets = vars(Gender), scales = "free_y") 

Now it’s easier to compare multiple variables by gender.

For more R tips, head to the “Do More With R” page on InfoWorld or check out the “Do More With R” YouTube playlist.

See the next page for info on how to download and wrangle data used in this demo.

Get the data

You can see more about the survey and download the complete data at https://insights.stackoverflow.com/survey. I used the 2019 Stack Overflow Developers Survey data for this article because I’ve worked with that data set before, but the 2020 data will likely work as well.

Here is the code I used to get the data subset I worked with in this article:

library(dplyr)
library(janitor)
library(stringr)
# Also need rio, tidyr installed on your system
raw_data <- rio::import("data/survey_results_public.csv")
mydata <- raw_data %>%
  select(Gender, LanguageWorkedWith, Hobbyist) %>%
  mutate(
    LanguageGroup = case_when(
      str_detect(LanguageWorkedWith, "bRb") & str_detect(LanguageWorkedWith, "bPythonb") ~ "Both",
      str_detect(LanguageWorkedWith, "bRb") & !str_detect(LanguageWorkedWith, "bPythonb") ~ "R",
      !str_detect(LanguageWorkedWith, "bRb") & str_detect(LanguageWorkedWith, "bPythonb") ~ "Python",
      !str_detect(LanguageWorkedWith, "bRb") & !str_detect(LanguageWorkedWith, "bPythonb") ~ "Neither"
    )
  ) %>%
  filter(Gender %in% c("Man", "Woman")) %>%
  tidyr::drop_na()
Sharon Machlis
Executive Editor, Data & Analytics

Sharon Machlis is Director of Editorial Data & Analytics at Foundry (the IDG, Inc. company that publishes websites including Computerworld and InfoWorld), where she analyzes data, codes in-house tools, and writes about data analysis tools and tips. She holds an Extra class amateur radio license and is somewhat obsessed with R. Her book Practical R for Mass Communication and Journalism was published by CRC Press.

More from this author