Before we can start working with data we need to import it into the R environment. Base R (the language as installed without any additional libraries) provides a number of functions for reading data in multiple formats, but the performance benefits and ease of use of some functions provided by external libraries makes them the go to choice for many users.
The following examples cover reading some of the most common types of data into R. Whilst by no means an exhaustive list of all the available options, the functions detailed are performant, stable, and easy to use.
CSV
One of the most common cross-platform methods for storing typically tabular data.
readr::read_csv()
The tidyverse package readr
offers the read_csv()
function. It’s relatively performant, it makes a good job of inferring types, and for users planning to use the tidyverse it helpfully returns a tibble.
During interactive use calling the function prints some useful information regarding the delimiter used and column types detected.
read_csv
is fast, predictable, and stable. The additional information provided can be useful during interactive analysis. The readr
package is very widely used and is actively maintained and developed, providing some confidence for developers aiming for long term reproducibility.
A full list of read_csv
’s arguments can be found by running ?readr::read_csv
.
data.table::fread()
When performance and speed is of the utmost importance nothing really comes close to the data.table
package. The fread()
function is simple to use, blazingly fast, and the 10,000 row sample it takes of equally spaced data throughout the target file allows it to make very accurate guesses at column type.
fread
returns a data.table
class object by default, however, we can easily have the function return a data.frame
by passing the argument data.table=FALSE
. data.table
’s can be used anywhere you would use a data.frame
.
The data.table
package is widely used and is actively maintained and developed, like with readr
this provides some confidence for developers aiming for long term reproducibility. Given its speed data.table
is a good option for any process going into a production environment and long running work flows.
A full list of fread
’s arguments can be found by running ?data.table::fread
.
Which should I use?
readr::read_csv()
and data.table::fread()
are both solid choices for reading a CSV file. The most pressing factors may come down to speed and the packages you intend to use for the rest of your process. If relying heavily upon the tidyverse it makes sense to lean toward readr
, whereas when working with large datasets data.table
will likely make more sense.
XLSX
The ubiquitous format for excel spreadsheets since 2007. Still pervasively popular and sometimes misused despite a wide array of alternatives. Being able to confidently read .xlsx
files will definitely be come in handy.
openxlsx::read.xlsx()
openxlsx
is a well established package, with fewer dependencies than some of the alternatives (no Java dependency). The read.xlsx()
function is straightforward to use.
Dates can sometimes be read as integers when reading .xlsx files, but this is easily overcome by supplying the argument detectDates=TRUE
. Generally, the column type detection is adequate.
When working with excel workbooks we often find notes or titles contained in the first few lines of a worksheet, followed by the tabular data we want to import. We may also have workbooks containing multiple worksheets, where only specific worksheets are of interest.
openxlsx::read.xlsx()
makes it easy for us to specify and read data from specific worksheets and cell ranges.
A full list of read.xlsx
’s arguments can be found by running ?openxlsx::read.xlsx
.
RDS
readRDS()
One of R’s own formats for storing data, RDS files store a single R object. Reading them is achieved with the base function readRDS
.
An RDS file captures the state of an object and as such each column of our data reads in with the same type and class attributes as when it was created.
A full list of readRDS
’s arguments can be found by running ?readRDS
.
ZIP
ZIP is an archive format used for data compression. A ZIP file can contain multiple files and even directories that have been compressed, however, we may often encounter a ZIP file containing a single CSV file.
We have a couple of options for unzipping and reading the CSV for which we welcome back the readr
and data.table
packages.
readr::read_csv()
readr
offers the simplest option with the read_csv
function, it detects that the target is compressed and using its internal method dispatch handles the unzip and import.
data.table::fread()
The data.table
option once again uses fread
. The usage is a little more complex but is also highly customizable. We can use the cmd
argument to pass a shell command to handle the preprocessing of the target file.
Our shell command will contain;
unzip
to extract the file-p
an option to extract files to stdoutexample.zip
the path to the target file
JSON
jsonlite::fromJSON()
JSON (JavaScript Object Notation) is a standardized file format used to store key value attributes in a format specifically designed to be human readable. It is commonly used for configuration files and transmitting data over the internet.
We can read JSON files using fromJSON
from the jsonlite
package.
A full list of fromJSON
’s arguments can be found by running ?jsonlite::fromJSON
.
YAML
yaml::yaml.load_file()
YAML (Yet Another Markup Language YAML Ain’t Markup Language) is another format commonly used for configuration files.
We can read YAML using the yaml.load_file
function from the yaml
package.
A full list of yaml.load_file
’s arguments can be found by running ?yaml::yaml.load_file
.
RData
load()
Another format specific to R, RData can be used to store a complete workspace or selected objects. Its a less widely used format, but simple enough to read using the base function load
. The load
function is a little different to the other examples given as we don’t assign its output, rather the function loads the objects contained within the RData file straight into out environment.
A full list of load
’s arguments can be found by running ?load
.
Proprietary formats
The haven
package provides functionality for reading a number of proprietary file formats, including those used by SPSS, SAS, and Stata. The full package documentation provides further details.
To read a SPSS SAV file we can use the package’s read_sav
function. Being a part of the tidyverse family, haven
functions return a tibble.
Next Steps
The above functions will allow us to read many of the most commonly encountered file types used for storing data. The R community is very active in developing solutions to common challenges and packages exist for reading data from many more sources than those covered here. Stack Overflow and the rstats subreddit are often useful resources for browsing solutions and asking questions.