It’s frustrating to see your code choke part of the way through while trying to apply a function in R. You may know that something in one of those objects caused a problem, but how do you track down the offender?
The purrr package’s possibly()
function is one easy way.
In this example, I’ll demo code that imports multiple CSV files. Most files’ value columns import as characters, but one of these comes in as numbers. Running a function that expects characters as input will cause an error.
For setup, the code below loads several libraries I need and then uses base R’s list.files()
function to return a sorted vector with names of all the files in my data directory.
library(purrr)
library(readr)
library(rio)
library(dplyr)
my_data_files <- list.files("data_files", full.names = TRUE) %>%
sort()
I can then import the first file and look at its structure.
x <- rio::import("data_files/file1.csv") str(x) 'data.frame': 3 obs. of 3 variables: $ Category : chr "A" "B" "C" $ Value : chr "$4,256.48 " "$438.22" "$945.12" $ MonthStarting: chr "12/1/20" "12/1/20" "12/1/20"
Both the Value and Month columns are importing as character strings. What I ultimately want is Value as numbers and MonthStarting as dates.
I sometimes deal with issues like this by writing a small function, such as the one below, to make changes in a file after import. It uses dplyr’s transmute()
to create a new Month column from MonthStarting as Date objects, and a new Total column from Value as numbers. I also make sure to keep the Category column (transmute()
drops all columns not explicity mentioned).
library(dplyr)
library(lubridate)
process_file <- function(myfile) {
rio::import(myfile) %>%
dplyr::transmute(
Category = as.character(Category),
Month = lubridate::mdy(MonthStarting),
Total = readr::parse_number(Value)
)
}
I like to use readr’s parse_number()
function for converting values that come in as character strings because it deals with commas, dollar signs, or percent signs in numbers. However, parse_number()
requires character strings as input. If a value is already a number, parse_number()
will throw an error.
My new function works fine when I test it on the first two files in my data directory using purrr’s map_df()
function.
my_results <- map_df(my_data_files[1:2], process_file)
But if I try running my function on all the files, including the one where Value imports as numbers, it will choke.
all_results <- map_df(my_data_files, process_file) Error: Problem with `mutate()` input `Total`. x is.character(x) is not TRUE ℹ Input `Total` is `readr::parse_number(Value)`. Run `rlang::last_error()` to see where the error occurred.
That error tells me Total is not a character column in one of the files, but I’m not sure which one. Ideally, I’d like to run through all the files, marking the one(s) with problems as errors but still processing all of them instead of stopping at the error.
possibly()
lets me do this by creating a brand new function from my original function:
safer_process_file <- possibly(process_file, otherwise = "Error in file")
The first argument for possibly()
is my original function, process_file
. The second argument, otherwise
, tells possibly()
what to return if there’s an error.
To apply my new safer_process_file()
function to all my files, I’ll use the map()
function and not purrr’s map_df()
function. That’s because safer_process_file()
needs to return a list, not a data frame. And that’s because if there’s an error, those error results won’t be a data frame; they’ll be the character string that I told otherwise
to generate.
all_results <- map(my_data_files, safer_process_file)
str(all_results, max.level = 1) List of 5 $ :'data.frame': 3 obs. of 3 variables: $ :'data.frame': 3 obs. of 3 variables: $ :'data.frame': 3 obs. of 3 variables: $ : chr "Error in file" $ :'data.frame': 3 obs. of 3 variables:
You can see here that the fourth item, from my fourth file, is the one with the error. That’s easy to see with only five items, but wouldn’t be quite so easy if I had a thousand files to import and three had errors.
If I name the list with my original file names, it’s easier to identify the problem file:
names(all_results) <- my_data_files str(all_results, max.level = 1) List of 5 $ data_files/file1.csv:'data.frame': 3 obs. of 3 variables: $ data_files/file2.csv:'data.frame': 3 obs. of 3 variables: $ data_files/file3.csv:'data.frame': 3 obs. of 3 variables: $ data_files/file4.csv: chr "Error in file" $ data_files/file5.csv:'data.frame': 3 obs. of 3 variables:
I can even save the results of str()
to a text file for further examination.
str(all_results, max.level = 1) %>%
capture.output(file = "results.txt")
Now that I know file4.csv is the problem, I can import just that one and confirm what the issue is.
x4 <- rio::import(my_data_files[4]) str(x4) 'data.frame': 3 obs. of 3 variables: $ Category : chr "A" "B" "C" $ Value : num 3738 723 5494 $ MonthStarting: chr "9/1/20" "9/1/20" "9/1/20"
Ah, Value is indeed coming in as numeric. I’ll revise my process_file()
function to account for the possibility that Value isn’t a character string with an ifelse()
check:
process_file2 <- function(myfile) {
rio::import(myfile) %>%
dplyr::transmute(
Category = as.character(Category),
Month = lubridate::mdy(MonthStarting),
Total = ifelse(is.character(Value), readr::parse_number(Value), Value)
)
}
Now if I use purrr’s map_df()
with my new process_file2()
function, it should work and give me a single data frame.
all_results2 <- map_df(my_data_files, process_file2) str(all_results2) 'data.frame': 15 obs. of 3 variables: $ Category: chr "A" "B" "C" "A" ... $ Month : Date, format: "2020-12-01" "2020-12-01" "2020-12-01" ... $ Total : num 4256 4256 4256 3156 3156 ...
That’s just the data and format I wanted, thanks to wrapping my original function in possibly()
to create a new, error-handling function.
For more R tips, head to the “Do More With R” page on InfoWorld or check out the “Do More With R” YouTube playlist.