Not just a speedy way to import data into R, fread has useful options for importing subsets of data, setting data types for columns, running system commands, and more Credit: Thinkstock Like all functions in the data.table R package, fread is fast. Very fast. But there’s more to fread than speed. It has several helpful features and options when importing external data into R. Here are five of the most useful. Note: If you’d like to follow along, download the New York Times CSV file of daily Covid-19 cases by U.S. county at https://github.com/nytimes/covid-19-data/raw/master/us-counties.csv. Use fread’s nrows option Is your file large? Would you like to examine its structure before importing the whole thing – without having to open it in a text editor or Excel? Use fread’s nrows option to import only a portion of a file for exploration. The code below imports just the first 10 rows of the CSV. mydt10 <- fread("us-counties.csv", nrows = 10) If you just want to see column names without any data at all, you can use nrows = 0. Use fread’s select option Once you know the file structure, you can choose which columns to import. fread’s select option lets you pick columns you want to keep. select takes a vector of either column names or column-position numbers. If names, they need to be in quotation marks, like most vectors of character strings: mydt <- fread("us-counties.csv", select = c("date", "county", "state", "cases")) As always, numbers don’t need quotation marks: mydt <- fread("us-counties.csv", select = c(1,2,3,5)) You can use an R object with a vector of column names inside fread, as you can see in this next group of code. I create a vector my_cols with date, county, state, and cases; then I use that vector inside fread. my_cols <- c("date", "county", "state", "cases") mydt <- fread("us-counties.csv", select = my_cols) The opposite of select is drop. You can choose to import all columns except the ones you specify with drop, such as: mydt <- fread("us-counties.csv", drop = c("fips", "deaths")) Like with select, drop takes a vector of column names or numerical positions. Use fread with grep If you’re familiar with Unix, you can execute command-line tools right from inside fread. For example, if I just wanted California data, I could use grep to only import lines that contain the text “California.” Note that this searches each entire row as a text string, not a specific column, so your data has to be in a format where that makes sense. ca <- fread("grep California us-counties.csv") Unfortunately, grep doesn’t understand the original file’s column names, so you end up with default names. head(ca) V1 V2 V3 V4 V5 V6 1: 2020-01-25 Orange California 6059 1 0 2: 2020-01-26 Los Angeles California 6037 1 0 3: 2020-01-26 Orange California 6059 1 0 4: 2020-01-27 Los Angeles California 6037 1 0 5: 2020-01-27 Orange California 6059 1 0 6: 2020-01-28 Los Angeles California 6037 1 0 However, fread lets us specify column names with the col.names option. I can set the names based on names from mydt10 that I created above. ca <- fread("grep California us-counties.csv", col.names = names(mydt10)) > head(ca) date county state fips cases deaths 1: 2020-01-25 Orange California 6059 1 0 2: 2020-01-26 Los Angeles California 6037 1 0 3: 2020-01-26 Orange California 6059 1 0 4: 2020-01-27 Los Angeles California 6037 1 0 5: 2020-01-27 Orange California 6059 1 0 6: 2020-01-28 Los Angeles California 6037 1 0 We can also use regular expressions, with grep’s -E option, letting us do more complex searches, such as looking for four states at once. states4 <- fread(cmd = "grep -E 'Texas|Arizona|Florida|South Carolina' us-counties.csv", col.names = names(mydt10)) Once again, a reminder: This is looking for each of those state names anywhere in the row, not just in the state column. If you run the code above and check what states are included in the results with unique(states4$state), you’ll see Oklahoma and Missouri in the states column along with Texas, Arizona, Florida, and South Carolina. That’s because both Oklahoma and Missouri have counties named Texas. So, grep during file import is a way to filter out a lot of data you don’t want from a very large data set; but it doesn’t guarantee you only get what you want. After this kind of import, you should still filter specifically on column data to make sure you didn’t get anything unexpected. Use fread’s colClasses option You can set column classes during import – for just a few columns, not every one. For example, the date column in this data is coming in as character strings, even though it’s in year-month-day format. We can set the column named date to the data type Date during import using the colClasses option. mydt <- fread("us-counties.csv", colClasses = c("date" = "Date")) Now, dates are Dates. > str(mydt) Classes ‘data.table’ and 'data.frame': 322651 obs. of 6 variables: $ date : Date, format: "2020-01-21" "2020-01-22" "2020-01-23" ... $ county: chr "Snohomish" "Snohomish" "Snohomish" "Cook" ... $ state : chr "Washington" "Washington" "Washington" "Illinois" ... $ fips : int 53061 53061 53061 17031 53061 6059 17031 53061 4013 6037 ... $ cases : int 1 1 1 1 1 1 1 1 1 1 ... $ deaths: int 0 0 0 0 0 0 0 0 0 0 ... Use fread on zipped files You can import a zipped file without unzipping it first. fread can import gz and bz2 files directly, such as mydt <- fread("myfile.gz"). If you need to import a zip file, you can unzip it with the unzip system command within fread, using the syntax mydt <- fread(cmd = 'unzip -cq myfile.zip'). For more R tips, head to InfoWorld’s Do More With R page. Related content analysis 7 steps to improve analytics for data-driven organizations Effective data-driven decision-making requires good tools, high-quality data, efficient processes, and prepared people. Here’s how to achieve it. By Isaac Sacolick Jul 01, 2024 10 mins Analytics news Maker of RStudio launches new R and Python IDE Posit, formerly RStudio, has released a beta of Positron, a ‘next generation’ data science development environment based on Visual Studio Code. By Sharon Machlis Jun 27, 2024 3 mins Integrated Development Environments Python R Language feature 4 highlights from EDB Postgres AI New platform product supports transactional, analytical, and AI workloads. By Aislinn Shea Wright Jun 13, 2024 6 mins PostgreSQL Generative AI Databases analysis Microsoft Fabric evolves from data lake to application platform Microsoft delivers a one-stop shop for big data applications with its latest updates to its data platform. By Simon Bisson Jun 13, 2024 7 mins Microsoft Azure Natural Language Processing Data Architecture Resources Videos