See how to generate weekly and monthly reports in R including month-over-month, week-over-week, and year-over-year changes. Credit: Thinkstock If you need to calculate changes such as last month versus the prior month or last month versus the same month a year earlier, R is a good choice. It’s easy to do those calculations — and you don’t have to worry whether a spreadsheet formula was properly clicked and dragged to cover all the necessary cells. Like so many things in R, there are multiple ways to do this. I’ll cover two of them. First, I’ll import some data about daily cycling trips on Bluebikes, the bicycle-share system in Boston, Cambridge, and three other nearby cities. If you want to follow along, download this zip file of CSV data and unzip it. download Bicycle Trip Data for Do More With R In the code below, I load a couple of packages and then read in the data from the daily_cycling_trips.csv file with the readr package’s read_csv() function. library(readr) library(dplyr) daily_trips <- readr::read_csv("daily_cycling_trips.csv") In the next code block, I’ll add a new column to the data, called YearMonth, using base R’s format() function to create a yyyy-mm format for each trip starting date. That is followed by typical dplyr group_by() and summarize() operations. In this case, I’m grouping by YearMonth and then, within each group, creating a column called MonthlyTrips that has the sum of all of that month’s trips. Finally, I make sure the data is arranged by YearMonth. daily_trips <- daily_trips %>% mutate( YearMonth = format(TripStartDate, "%Y-%m") ) # Usual dplyr group_by() and summarize() by month. monthly_trips <- daily_trips %>% group_by(YearMonth) %>% summarize( MonthlyTrips = sum(Trips) ) %>% arrange(YearMonth) Calculating monthly and yearly changes Now that I’ve got monthly subtotals, I can calculate the month-over-month and year-over-year changes by using dplyr’s lag() function. lag() defaults to the previous value in the data frame column as it is currently ordered. You can change the number of lag items so it’s more than one item back. To compare with the previous month, a default of one is fine. To compare with the previous year, I’ll want the lag to be 12, for 12 items back. Note that this works only if there aren’t any missing months. You might want to add some code to check for missing data if you’re not sure of the data’s integrity. With lag(), you can set the column you want to order and lag by if your data frame isn’t ordered the way you need. (I don’t need to in this case, because my data is already ordered by YearMonth.) monthly_report <- monthly_trips %>% mutate( MoM = (MonthlyTrips - lag(MonthlyTrips)) / lag(MonthlyTrips), YoY = (MonthlyTrips - lag(MonthlyTrips, 12)) / lag(MonthlyTrips, 12) ) If you look at the monthly_report object, the data is all there, although the format could be nicer. Sharon Machlis, IDG Monthly report data frame with month-over-month and year-over-year comparisons. The code below multiplies the fractions by 100 to create a percent format, and then rounds to one decimal place. monthly_report <- monthly_report %>% mutate( MoM = round(MoM * 100, 1), YoY = round(YoY * 100, 1) ) That’s the easiest example of monthly reporting — only one data point per month. Sometimes, though, you’ll have multiple categories you want to track each month, such as comparisons by city, or age group, or website. It’s easy to tweak this code for that: You just group by your category in addition to the month before running a dplyr summarize(). Here is an example: # Read in second data file daily_trips_by_usertype <- readr::read_csv("daily_cycling_trips_by_usertype.csv") # Add YearMonth column and get totals by month monthly_trips_by_usertype <- daily_trips_by_usertype %>% mutate( YearMonth = format(TripStartDate, "%Y-%m") ) %>% group_by(YearMonth, usertype) %>% summarize( MonthlyTrips = sum(Trips) ) %>% arrange(YearMonth, usertype) # Calculate MoM and YoY monthly_report_by_usertype <- monthly_trips_by_usertype %>% group_by(usertype) %>% mutate( MoM = (MonthlyTrips - lag(MonthlyTrips)) / lag(MonthlyTrips), YoY = (MonthlyTrips - lag(MonthlyTrips, 12)) / lag(MonthlyTrips, 12) ) In the code block above, I’m doing the same thing as I did before, except I’m comparing trips by month and user type. There are two types of users in this data set, Customer and Subscriber. If you run the code and then look at monthly_report_by_user_type, you will see month-over-month and year-over-year comparisons by both Subscriber and Customer user types. This data also makes it easy to graph monthly percent changes with ggplot2. Sharon Machlis, IDG ggplot2 graph of month-over-month changes in trips by customer types at Bluebikes. Instead of changing the data by multiplying by 100 and rounding, I used the scales package and scale_y_continuous() to do that for me in the ggplot2 graph code below: library(ggplot2) library(scales) ggplot(monthly_report_by_usertype, aes(x=YearMonth, y=MoM, fill=usertype)) + geom_col(position="dodge") + scale_y_continuous(labels = scales::percent) Note: If you just need “last month” for a specific report and not comparisons for every month, use dplyr’s filter() function on your report data frame and set the YearMonth column to equal whatever the maximum value is for that data. For example: filter(monthly_report_by_usertype, YearMonth == max(YearMonth)) Week-over-week calculations Week over week is slightly different from monthly. For weekly, I generally use each week’s starting date instead of a format like yyyy-ww. That’s because using numbers for weeks can get complicated when there are partial weeks at the start and end of a year. In the code below, I use the lubridate package and its floor_date() function. You could use base R’s cut.Date() function instead, but that’s a bit more complicated since it returns factors instead of dates. (With cut.Date() I usually end up running extra code to turn the factors back to dates. floor_date() is one step for what I want.) Next come the usual calculations, this time for week-over-week and year-over-year. Note that for year over year, the lag is 52 instead of 12. It’s important here to be sure that I’ve got rows for every week, too. # install.packages("lubridate") library(lubridate) weekly_trips <- daily_trips %>% mutate( WeekStarting = floor_date(TripStartDate, unit = "weeks") ) %>% group_by(WeekStarting) %>% summarize( WeeklyTrips = sum(Trips) ) %>% arrange(WeekStarting) weekly_report <- weekly_trips %>% mutate( WoW = (WeeklyTrips - lag(WeeklyTrips)) / lag(WeeklyTrips), YoY = (WeeklyTrips - lag(WeeklyTrips, 52)) / lag(WeeklyTrips, 52) ) By now you may be thinking: OK, that’s straightforward, but it’s a lot of typing for a simple weekly or monthly report. To streamline this process, make an RStudio code snippet! If you don’t know how they work, check out the “Do More With R” episode on code snippets. Here’s the snippet I made for monthly reports. Remember that every indented line needs to be indented with a tab, not just spaces, and the snippet needs to live in your RStudio snippet file, which you can access with usethis::edit_rstudio_snippets(). snippet my_monthly_reports monthly_report <- ${1:mydf} %>% mutate( YearMonth = format(${2:MyDateColumn}, "%Y-%m") ) %>% group_by(YearMonth, ${3:MyCategory}) %>% summarize( MonthlyTotal = sum(${4:MyValueColumn}) ) %>% arrange(YearMonth, ${3:MyCategory}) %>% ungroup() %>% group_by(${3:MyCategory}) %>% mutate( MoM = (MonthlyTotal - lag(MonthlyTotal)) / lag(MonthlyTotal), YoY = (MonthlyTotal - lag(MonthlyTotal, 12)) / lag(MonthlyTotal, 12) ) You can see how the snippet works in the video embedded at the top of this article. Once you’ve set up a code snippet, doing month-over-month comparisons may even be faster in R than in Excel. For more R tips, head to the “Do More With R” page at InfoWorld or the “Do More With R” playlist on YouTube. Related content analysis 7 steps to improve analytics for data-driven organizations Effective data-driven decision-making requires good tools, high-quality data, efficient processes, and prepared people. Here’s how to achieve it. By Isaac Sacolick Jul 01, 2024 10 mins Analytics news Maker of RStudio launches new R and Python IDE Posit, formerly RStudio, has released a beta of Positron, a ‘next generation’ data science development environment based on Visual Studio Code. By Sharon Machlis Jun 27, 2024 3 mins Integrated Development Environments Python R Language feature 4 highlights from EDB Postgres AI New platform product supports transactional, analytical, and AI workloads. By Aislinn Shea Wright Jun 13, 2024 6 mins PostgreSQL Generative AI Databases analysis Microsoft Fabric evolves from data lake to application platform Microsoft delivers a one-stop shop for big data applications with its latest updates to its data platform. By Simon Bisson Jun 13, 2024 7 mins Microsoft Azure Natural Language Processing Data Architecture Resources Videos