How to merge data in R using R merge, dplyr, or data.table

See how to join two data sets by one or more common columns using base R’s merge function, dplyr join functions, and the speedy data.table package.

Executive Editor, Data & Analytics, InfoWorld |

Previous 1 2 Page 2

Page 2 of 2

Joins with data.table

The data.table package is best known for its speed, so it can be a good choice for dealing with large data sets.

In the code below, I load data.table and then use its fread() function to import the zip file. To read the zipped file, I use fread’s ability to call shell commands directly. That’s what the unzip -cq part of the argument is doing in the following fread()—you won't need that unless your file is zipped.


library(data.table)

mydt <- fread('unzip -cq 673598238_T_ONTIME_REPORTING.zip')
mylookup_dt <- fread("L_UNIQUE_CARRIERS.csv_")

fread() creates a data.table object—a data frame with extra functionality, especially within brackets after the object name. There are several ways to do joins with data.table.

One is to use the exact same merge() syntax as base R. You write it the same way, but it executes a lot faster:


joined_dt1 <- merge(mydt, mylookup_dt, 
              by.x = "OP_UNIQUE_CARRIER", by.y = "Code", 
              all.x = TRUE, all.y = FALSE)

If you want to use specific data.table syntax, you can first use the setkey() function to specify which columns you want to join on. Then, the syntax is simply mylookup_dt[mydt], as you can see in the code below:


setkey(mydt, "OP_UNIQUE_CARRIER")
setkey(mylookup_dt, "Code")
joined_dt2 <- mylookup_dt[mydt]

The creator of data.table, Matt Dowle, explained the format as “X[Y] looks up X rows using Y as an index.” But note that if you think of mylookup_dt as a lookup table, the lookup table is outside the brackets while the primary data is within the brackets.

There’s another data.table syntax that doesn’t require setkey(), and that’s adding an on argument within brackets. The syntax for the on vector is on = c(lookupColName = "dataColName"), with the lookup column name unquoted and the data column name in quotation marks:


dt3 <- mylookup_dt[mydt, on = c(Code = "OP_UNIQUE_CARRIER")]

Joins with dtplyr: dplyr syntax and data.table speed

I want to mention one other option: using dplyr syntax but with data.table on the back-end. You can do that with the dtplyr package, which is ideal for people who like dplyr syntax, or who are used to SQL database syntax, but want the speedy data.table performance.

To use dtplyr, you need to turn data frames or tibbles into special lazy data table objects. You do that with dtplyr’s lazy_dt() function.

In the code below, I use a %>% pipe to send the result of read_csv to the lazy_dt() function and then join the two objects the usual dplyr left_join() way:


my_lazy_dt <- readr::read_csv("673598238_T_ONTIME_REPORTING.zip") %>%
           dtplyr::lazy_dt() 
my_lazy_lookup <- readr::read_csv("L_UNIQUE_CARRIERS.csv_") %>%
               dtplyr::lazy_dt() 

joined_lazy_dt <- left_join(my_lazy_dt, my_lazy_lookup, 
                  by = c("OP_UNIQUE_CARRIER" = "Code"))

As of this writing, dtplyr is not yet using the new by = joinby() syntax.

That joined_lazy_dt variable is a special dtplyr step object. If you print it, you can see the data.table code that created the object—look at the Call: line in the print() results below. That can be handy! You also see the first few rows of data, and a message that you need to turn that object into a data frame, tibble, or data.table if you want to use the data in there:


print(joined_lazy_dt)
Source: local data table [658,461 x 7]
Call:   setnames(setcolorder(`_DT5`[`_DT4`, on = .(Code = OP_UNIQUE_CARRIER), 
    allow.cartesian = TRUE], c(3L, 1L, 4L, 5L, 6L, 7L, 2L)), 
    "Code", "OP_UNIQUE_CARRIER")

  FL_DATE    OP_UNIQUE_CARRIER ORIGIN DEST  DEP_DELAY_NEW ...6  Description         
  <date>     <chr>             <chr>  <chr>         <dbl> <lgl> <chr>               
1 2019-08-01 DL                ATL    DFW              31 NA    Delta Air Lines Inc.
2 2019-08-01 DL                DFW    ATL               0 NA    Delta Air Lines Inc.
3 2019-08-01 DL                IAH    ATL              40 NA    Delta Air Lines Inc.
4 2019-08-01 DL                PDX    SLC               0 NA    Delta Air Lines Inc.
5 2019-08-01 DL                SLC    PDX               0 NA    Delta Air Lines Inc.
6 2019-08-01 DL                DTW    ATL              10 NA    Delta Air Lines Inc.
# … with 658,455 more rows
# ℹ Use `print(n = ...)` to see more rows

# Use as.data.table()/as.data.frame()/as_tibble() to access results

My complete code using a base R pipe:


joined_tibble <- left_join(my_lazy_dt, my_lazy_lookup, 
                       by = c("OP_UNIQUE_CARRIER" = "Code")) |>
                 as_tibble()

After running some rather crude benchmarking several years ago, data.table code was fastest; dtplyr was almost as fast; dplyr took about twice as long; and base R was 15 or 20 times slower. Major caution here: that performance depends on the structure and size of your data and can vary wildly depending on your task. But it’s safe to say that base R isn’t a great choice for large data sets.

How to merge when you only want rows with matches

For the rest of these examples, I’m going to use two new data sets: Home values by U.S. zip code from Zillow and some population and other data by zip code from Simplemaps. If you’d like to follow along, download Median Home Values per Square Foot by Zip code from the Zillow research data page and the free basic database from Simplemaps' US Zip Codes Database page. (Because of licensing issues surrounding private data, these data sets are not included in the code and data download.)

In the code below, I’ve renamed files. If your data has different names or locations, adjust the code accordingly.

Base R and data.table

For both base R and data.table, if you want only rows that match, tell the merge() function all = FALSE. The code below reads in files and then runs merge() with all = FALSE:

# Import data
home_values_dt <- fread("Zip_MedianValuePerSqft_AllHomes.csv", 
               colClasses = c(RegionName = "character"))
pop_density_dt <- fread("simplemaps_uszips_basicv1.6/uszips.csv", 
               colClasses = c(zip = "character"))

# Merge data
matches1 <- merge(home_values_dt, pop_density_dt, 
                  by.x = "RegionName", 
                  by.y = "zip", all = FALSE)

Note that I set column classes as "character" when using fread() so that five-digit zip codes starting with 0 didn’t end up as four-digit numbers.

dplyr

With dplyr, selecting only rows that match is an inner join:

Here's the code:


# Import data
home_values <- read_csv("Zip_MedianValuePerSqft_AllHomes.csv")
pop_density <- read_csv("simplemaps_uszips_basicv1.6/uszips.csv")

# Merge data with inner join
matches2 <- inner_join(home_values, pop_density, 
                       by = join_by(RegionName == zip))
# OR with older syntax:
matches2 <- inner_join(home_values, pop_density, 
c("RegionName" = "zip"))

With readr’s read_csv(), the zip code columns automatically come in as character strings.

data.table bracket syntax

If you want to use data.table’s bracket syntax, add the argument nomatch=0 to exclude rows that don’t have a match:


matchesdt <- home_values_dt[pop_density_dt, nomatch = 0]

How to merge when you want all rows

Next, I'll show you the three ways to merge with all rows.

Base R and data.table

With base R or data.table, use merge() with all = TRUE:


all_rows1 <- merge(home_values_dt, pop_density_dt, 
                   by.x = "RegionName", by.y = "zip", all = TRUE)

dplyr

With dplyr, this is a full join:


all_rows2 <- full_join(home_values, pop_density,
                       by = join_by(RegionName == zip))
# OR
all_rows2 <- full_join(home_values, pop_density, 
                       by = c("RegionName" = "zip"))

How to view rows in a data set without a match

It can be useful to check for rows in a data set that didn’t match, since that can help you understand the limitations of your data and whether something you expected is missing.

dplyr

In dplyr, finding rows that didn’t match is an anti join:

Here, as with left joins, order matters. To see all the rows in home_values without a match in pop_density, enter:


home_values_no_match <- anti_join(home_values, pop_density, 
                         by = join_by(RegionName == zip))
# OR
home_values_no_match <- anti_join(home_values, pop_density, 
                         by = c("RegionName" = "zip"))

And, to see all rows in pop_density that don’t have a match in home_values, it would be:


pop_density_no_match <- anti_join(pop_density, home_values, 
                                  by = join_by(zip == RegionName))

data.table

While merge() syntax is fairly easy for most types of merges, in this case it gets a bit complex if you want rows without a match. I use dplyr’s anti_join() for this type of task, never merge().

If you want to use data.table bracket syntax, use this code to see all the rows in home_values_dt that don’t have a match in pop_density_dt:


home_values_no_match <- home_values_dt[!pop_density_dt]

Note that the code above assumes I’ve already run setkey() on both data sets. Pay attention to the order.

As we saw earlier, all rows in home_values, including those that have a match:

pop_density_dt[home_values_dt]

Only rows in home_values that don't have a match:


home_values_dt[!pop_density_dt]

Dowle’s full explanation to me about this syntax was: “X[Y] looks up X rows using Y as an index. X[!Y] returns all the rows in X that don’t match the Y index. Analogous to X[-c(3,6,10), ] returns all X rows except rows 3, 6, and 10.”

How to merge on multiple columns

Finally, one question I often see is, How do you combine data sets on two common columns?

merge()

To use merge() on multiple columns, the syntax is:


merge(dt1, dt2, by.x = c(”dt1_ColA", dt1_ColB"), by.y = c("dt2_cola", "dt2_colb"))

So, to merge home_values_dt and pop_density_dt on the zip code and city columns, the code is:


merge2_dt <- merge(home_values_dt, pop_density_dt, 
                   by.x = c("RegionName", "City"), 
                   by.y = c("zip", "city"), 
                   all.x = TRUE, all.y = FALSE)

setting all.x and all.y as needed.

data.table

data.table has a couple of ways to set multiple keys in a data set. There’s setkey() to refer to column names unquoted, and setkeyv() if you want the names quoted in a vector (useful for when this task is within a function).

The formats are:


setkey(pop_density_dt, ZipCode, City)
# or 
setkeyv(pop_density_dt, c("ZipCode", "City"))

And:


setkey(pop_density_dt, zip, city)
# or 
setkeyv(pop_density_dt, c("zip", "city"))

Then, use your bracket syntax as usual, such as:


pop_density_dt[home_values_dt]

dplyr join_by() syntax is

join_by(id1 == id2, colA == colB)

Want more R tips? Head to the Do More With R page!

Next read this:

Sharon Machlis is Director of Editorial Data & Analytics at Foundry, where she works on data analysis and in-house editor tools in addition to writing. Her book Practical R for Mass Communication and Journalism was published by CRC Press. You can also find her on Mastodon at masto.machlis.com/@smach

Previous 1 2 Page 2

Page 2 of 2