video

How to calculate month-over-month changes in R

See how to generate weekly and monthly reports in R including month-over-month, week-over-week, and year-over-year changes.

Do More With R [video teaser/video series] - R Programming Guide - Tips & Tricks
Thinkstock

If you need to calculate changes such as last month versus the prior month or last month versus the same month a year earlier, R is a good choice. It’s easy to do those calculations — and you don’t have to worry whether a spreadsheet formula was properly clicked and dragged to cover all the necessary cells.

Like so many things in R, there are multiple ways to do this. I’ll cover two of them.

First, I’ll import some data about daily cycling trips on Bluebikes, the bicycle-share system in Boston, Cambridge, and three other nearby cities. If you want to follow along, download this zip file of CSV data and unzip it.

download

 

In the code below, I load a couple of packages and then read in the data from the daily_cycling_trips.csv file with the readr package’s read_csv() function.

library(readr)
library(dplyr)
daily_trips <- readr::read_csv("daily_cycling_trips.csv")

In the next code block, I’ll add a new column to the data, called YearMonth, using base R’s format() function to create a yyyy-mm format for each trip starting date. That is followed by typical dplyr group_by() and summarize() operations. In this case, I’m grouping by YearMonth and then, within each group, creating a column called MonthlyTrips that has the sum of all of that month’s trips. Finally, I make sure the data is arranged by YearMonth.

daily_trips <- daily_trips %>%
mutate(
YearMonth = format(TripStartDate, "%Y-%m")
)
# Usual dplyr group_by() and summarize() by month.
monthly_trips <- daily_trips %>%
group_by(YearMonth) %>%
summarize(
MonthlyTrips = sum(Trips)
) %>%
arrange(YearMonth)

Calculating monthly and yearly changes

Now that I’ve got monthly subtotals, I can calculate the month-over-month and year-over-year changes by using dplyr’s lag() function. lag() defaults to the previous value in the data frame column as it is currently ordered. You can change the number of lag items so it’s more than one item back. To compare with the previous month, a default of one is fine. To compare with the previous year, I’ll want the lag to be 12, for 12 items back. Note that this works only if there aren’t any missing months. You might want to add some code to check for missing data if you’re not sure of the data’s integrity.

With lag(), you can set the column you want to order and lag by if your data frame isn’t ordered the way you need. (I don’t need to in this case, because my data is already ordered by YearMonth.) 

monthly_report <- monthly_trips %>%
mutate(
MoM = (MonthlyTrips - lag(MonthlyTrips)) / lag(MonthlyTrips),
YoY = (MonthlyTrips - lag(MonthlyTrips, 12)) / lag(MonthlyTrips, 12)
)

If you look at the monthly_report object, the data is all there, although the format could be nicer.

Monthly report data frame Sharon Machlis, IDG

Monthly report data frame with month-over-month and year-over-year comparisons.

The code below multiplies the fractions by 100 to create a percent format, and then rounds to one decimal place. 

monthly_report <- monthly_report %>%
mutate(
MoM = round(MoM * 100, 1),
YoY = round(YoY * 100, 1)
)

That’s the easiest example of monthly reporting — only one data point per month. Sometimes, though, you’ll have multiple categories you want to track each month, such as comparisons by city, or age group, or website. It’s easy to tweak this code for that: You just group by your category in addition to the month before running a dplyr summarize(). Here is an example:

# Read in second data file
daily_trips_by_usertype <- readr::read_csv("daily_cycling_trips_by_usertype.csv")
# Add YearMonth column and get totals by month
monthly_trips_by_usertype <- daily_trips_by_usertype %>%
mutate(
YearMonth = format(TripStartDate, "%Y-%m")
) %>%
group_by(YearMonth, usertype) %>%
summarize(
MonthlyTrips = sum(Trips)
) %>%
arrange(YearMonth, usertype)
# Calculate MoM and YoY
monthly_report_by_usertype <- monthly_trips_by_usertype %>%
group_by(usertype) %>%
mutate(
MoM = (MonthlyTrips - lag(MonthlyTrips)) / lag(MonthlyTrips),
YoY = (MonthlyTrips - lag(MonthlyTrips, 12)) / lag(MonthlyTrips, 12)
)

In the code block above, I’m doing the same thing as I did before, except I’m comparing trips by month and user type. There are two types of users in this data set, Customer and Subscriber. If you run the code and then look at monthly_report_by_user_type, you will see month-over-month and year-over-year comparisons by both Subscriber and Customer user types.

This data also makes it easy to graph monthly percent changes with ggplot2.

Graph of month-over-month changes in number of trips by customer type at Bluebiles. Sharon Machlis, IDG

ggplot2 graph of month-over-month changes in trips by customer types at Bluebikes.

Instead of changing the data by multiplying by 100 and rounding, I used the scales package and scale_y_continuous() to do that for me in the ggplot2 graph code below:

library(ggplot2)
library(scales)
ggplot(monthly_report_by_usertype, aes(x=YearMonth, y=MoM, fill=usertype)) +
geom_col(position="dodge") +
scale_y_continuous(labels = scales::percent)

Note: If you just need “last month” for a specific report and not comparisons for every month, use dplyr’s filter() function on your report data frame and set the YearMonth column to equal whatever the maximum value is for that data. For example:

filter(monthly_report_by_usertype, YearMonth == max(YearMonth))

Week-over-week calculations

Week over week is slightly different from monthly. For weekly, I generally use each week’s starting date instead of a format like yyyy-ww. That’s because using numbers for weeks can get complicated when there are partial weeks at the start and end of a year. 

In the code below, I use the lubridate package and its floor_date() function. You could use base R’s cut.Date() function instead, but that’s a bit more complicated since it returns factors instead of dates. (With cut.Date() I usually end up running extra code to turn the factors back to dates. floor_date() is one step for what I want.)

Next come the usual calculations, this time for week-over-week and year-over-year. Note that for year over year, the lag is 52 instead of 12. It’s important here to be sure that I’ve got rows for every week, too. 

# install.packages("lubridate")
library(lubridate)
weekly_trips <- daily_trips %>%
mutate(
WeekStarting = floor_date(TripStartDate, unit = "weeks")
) %>%
group_by(WeekStarting) %>%
summarize(
WeeklyTrips = sum(Trips)
) %>%
arrange(WeekStarting)

weekly_report <- weekly_trips %>%
mutate(
WoW = (WeeklyTrips - lag(WeeklyTrips)) / lag(WeeklyTrips),
YoY = (WeeklyTrips - lag(WeeklyTrips, 52)) / lag(WeeklyTrips, 52)
)

By now you may be thinking: OK, that’s straightforward, but it’s a lot of typing for a simple weekly or monthly report. To streamline this process, make an RStudio code snippet! If you don’t know how they work, check out the “Do More With R” episode on code snippets

Here’s the snippet I made for monthly reports. Remember that every indented line needs to be indented with a tab, not just spaces, and the snippet needs to live in your RStudio snippet file, which you can access with usethis::edit_rstudio_snippets()

snippet my_monthly_reports
monthly_report <- ${1:mydf} %>%
mutate(
YearMonth = format(${2:MyDateColumn}, "%Y-%m")
) %>%
group_by(YearMonth, ${3:MyCategory}) %>%
summarize(
MonthlyTotal = sum(${4:MyValueColumn})
) %>%
arrange(YearMonth, ${3:MyCategory}) %>%
ungroup() %>%
group_by(${3:MyCategory}) %>%
mutate(
MoM = (MonthlyTotal - lag(MonthlyTotal)) / lag(MonthlyTotal),
YoY = (MonthlyTotal - lag(MonthlyTotal, 12)) / lag(MonthlyTotal, 12)
)

You can see how the snippet works in the video embedded at the top of this article.

Once you’ve set up a code snippet, doing month-over-month comparisons may even be faster in R than in Excel.

For more R tips, head to the “Do More With R” page at InfoWorld or the “Do More With R” playlist on YouTube.

Copyright © 2019 IDG Communications, Inc.