Like all functions in the data.table R package, fread is fast. Very fast. But there’s more to fread than speed. It has several helpful features and options when importing external data into R. Here are five of the most useful.
Note: If you’d like to follow along, download the New York Times CSV file of daily Covid-19 cases by U.S. county at https://github.com/nytimes/covid-19-data/raw/master/us-counties.csv.
Use fread’s nrows option
Is your file large? Would you like to examine its structure before importing the whole thing – without having to open it in a text editor or Excel? Use fread’s nrows
option to import only a portion of a file for exploration.
The code below imports just the first 10 rows of the CSV.
mydt10 <- fread("us-counties.csv", nrows = 10)
If you just want to see column names without any data at all, you can use nrows = 0
.
Use fread’s select option
Once you know the file structure, you can choose which columns to import. fread’s select
option lets you pick columns you want to keep. select
takes a vector of either column names or column-position numbers. If names, they need to be in quotation marks, like most vectors of character strings:
mydt <- fread("us-counties.csv",
select = c("date", "county", "state", "cases"))
As always, numbers don’t need quotation marks:
mydt <- fread("us-counties.csv", select = c(1,2,3,5))
You can use an R object with a vector of column names inside fread, as you can see in this next group of code. I create a vector my_cols with date, county, state, and cases; then I use that vector inside fread.
my_cols <- c("date", "county", "state", "cases")
mydt <- fread("us-counties.csv", select = my_cols)
The opposite of select
is drop
. You can choose to import all columns except the ones you specify with drop
, such as:
mydt <- fread("us-counties.csv", drop = c("fips", "deaths"))
Like with select
, drop
takes a vector of column names or numerical positions.
Use fread with grep
If you’re familiar with Unix, you can execute command-line tools right from inside fread. For example, if I just wanted California data, I could use grep to only import lines that contain the text “California.” Note that this searches each entire row as a text string, not a specific column, so your data has to be in a format where that makes sense.
ca <- fread("grep California us-counties.csv")
Unfortunately, grep doesn’t understand the original file’s column names, so you end up with default names.
head(ca) V1 V2 V3 V4 V5 V6 1: 2020-01-25 Orange California 6059 1 0 2: 2020-01-26 Los Angeles California 6037 1 0 3: 2020-01-26 Orange California 6059 1 0 4: 2020-01-27 Los Angeles California 6037 1 0 5: 2020-01-27 Orange California 6059 1 0 6: 2020-01-28 Los Angeles California 6037 1 0
However, fread lets us specify column names with the col.names
option. I can set the names based on names from mydt10 that I created above.
ca <- fread("grep California us-counties.csv", col.names = names(mydt10)) > head(ca) date county state fips cases deaths 1: 2020-01-25 Orange California 6059 1 0 2: 2020-01-26 Los Angeles California 6037 1 0 3: 2020-01-26 Orange California 6059 1 0 4: 2020-01-27 Los Angeles California 6037 1 0 5: 2020-01-27 Orange California 6059 1 0 6: 2020-01-28 Los Angeles California 6037 1 0
We can also use regular expressions, with grep’s -E
option, letting us do more complex searches, such as looking for four states at once.
states4 <- fread(cmd = "grep -E 'Texas|Arizona|Florida|South Carolina' us-counties.csv",
col.names = names(mydt10))
Once again, a reminder: This is looking for each of those state names anywhere in the row, not just in the state column. If you run the code above and check what states are included in the results with unique(states4$state)
, you’ll see Oklahoma and Missouri in the states column along with Texas, Arizona, Florida, and South Carolina. That’s because both Oklahoma and Missouri have counties named Texas.
So, grep during file import is a way to filter out a lot of data you don’t want from a very large data set; but it doesn’t guarantee you only get what you want. After this kind of import, you should still filter specifically on column data to make sure you didn’t get anything unexpected.
Use fread’s colClasses option
You can set column classes during import – for just a few columns, not every one. For example, the date column in this data is coming in as character strings, even though it’s in year-month-day format. We can set the column named date to the data type Date during import using the colClasses
option.
mydt <- fread("us-counties.csv", colClasses = c("date" = "Date"))
Now, dates are Dates.
> str(mydt) Classes ‘data.table’ and 'data.frame': 322651 obs. of 6 variables: $ date : Date, format: "2020-01-21" "2020-01-22" "2020-01-23" ... $ county: chr "Snohomish" "Snohomish" "Snohomish" "Cook" ... $ state : chr "Washington" "Washington" "Washington" "Illinois" ... $ fips : int 53061 53061 53061 17031 53061 6059 17031 53061 4013 6037 ... $ cases : int 1 1 1 1 1 1 1 1 1 1 ... $ deaths: int 0 0 0 0 0 0 0 0 0 0 ...
Use fread on zipped files
You can import a zipped file without unzipping it first. fread can import gz and bz2 files directly, such as mydt <- fread("myfile.gz")
. If you need to import a zip file, you can unzip it with the unzip
system command within fread, using the syntax mydt <- fread(cmd = 'unzip -cq myfile.zip')
.
For more R tips, head to InfoWorld’s Do More With R page.