This part of the course is dedicated to data manipulation in R at the RDBMS level. I review different levels of manipulation from importing to actual in memory manipulation. A somewhat dated source on this topic is provided by an official R manual. A better source is the Efficient R Programming book (chapter 5 about input/output and chapter 6 about data carpentry).
The basic tool for data "manipulation" in R is the data.frame
format
coupled with the read.table
function. The former is the standard way of
representing data tables (a.k.a. rectangular grids) in R while the later
enables users to load spreadsheet like data from text files.
While this pair is convenient and easy to use, it has major problems, especially when dealing with large scale data (even medium scale data, in fact). In particular, users tend to confuse the need for importing data into R with the need for storing data in a R workflow. While importing data, reliance on text files is sometimes mandatory. However, once the data has been imported in R, there is no reason to rely on text files, apart for exporting data to e.g. another program.
We focus here in the importation aspect as it is important for interaction with non R world, but keep in mind that using the RDS native format is the best solution for R only processing.
read.table
The read.table
function and its friends, read.csv
, read.delim
,
etc. are inherently limited by the text format on which they
rely. Limitations include:
Notice than variations of the CSV format should be avoided (use the RFC
and write.csv
when exporting). However, in our context, the biggest
problem is the inefficiency of the function which can be very slow at
loading even medium size data.
For instance, reading a medium size matrix as in the following code takes roughly 110 seconds (on a Xeon CPU E3-1271 v3 @ 3.60GHz (4 cores) with a standard hard drive and 16 GB of main memory).
a <- matrix(rnorm(5000*5000), 5000, 5000)
write.csv(a,file="demo-a.csv",row.names=FALSE)
a <- as.matrix(read.csv("demo-a.csv"))
Notice in addition that the file itself is large: 450 MB against around 190 MB when in memory. Moreover, the function uses a very amount of memory during its execution. For instance, when only the reading operation is performed above, the peak use of memory is 2.5 GB (notice the occupation is not a consequence of converting to a matrix).
As a comparison reading from the binary representation as in the following code takes only 1 second on the same computer. In addition the file uses 190 MB as expected and the memory usage is only 230 MB, as expected.
a <- matrix(rnorm(5000*5000), 5000, 5000)
saveRDS(a,file="demo-a.Rds")
a <- readRDS("demo-a.Rds")
There are ways to improve read.table
. A few tricks can be use at the
read.table
level. For instance one can use the colClasses
parameter to
specify the nature of the columns as a way to reduce memory
usage. Overestimating the number of rows also helps (nrows
parameter). Finally removing the possibility of comments by setting the
comments
parameter to "" speeds up loading. However, those improvements
are somewhat minor. For instance the following code runs in 60 seconds but
still uses around 2.5 GB of memory.
a <- as.matrix(read.csv("demo-a.csv"),colClasses="numeric",
nrows=6000,comment.char="")
The canonical solution in plain R consists in using the scan
function. It's a low level function but it's much more adapted to reading
large matrices, for instance. In our context, scan
can be used as
follows.
a <- matrix(scan("demo-a.csv",skip=1,what=numeric(0),n=5000*5000,
sep=","),
5000,5000,byrow=TRUE)
This code runs in 12 seconds and uses 430 MB of memory. While this uses
quite a lot of resources compared to the binary format solution, the
improvement over the naive use of read.table
is significative.
read.table
and scan
Fortunately, importing text files into R can be done more efficient
functions than read.table
and scan
. They come from external packages.
The readr package is specifically designed to read CSV files and related format. A simple way to use it is given in the following snippet:
a <- read_csv("demo-a.csv")
This takes around 24 seconds to read the matrix with a memory usage of
500 MB. As in the case of read.csv
and friends, the efficiency of the
function can be improved by providing hints to it such as the type of the
columns and the size of the resulting data frame. For instance the
following code runs in only 10 seconds (and uses 450 MB of memory).
a <- read_csv("demo-a.csv",
col_types=cols(.default=col_double()))
In fact, most of the overhead of the base call comes from reading a lot
of lines in order to guess the structure of the columns. A very simple
way of reducing the overhead is to tweak the guess_max
parameter to a
small number of lines (default is 1000). That might fail on complex data,
but on numerical matrices, it works nicely. For instance, the following
code runs in 10 seconds.
require(readr)
a <- read_csv("demo-a.csv",guess_max=10)
Notice than read_csv
outputs a R object that is slightly richer than a
data frame. For instance a column specification can be extracted from
such an object with the spec
function (that can be useful for reading
several files with a similar format).
Another solution is provided by the data.table package. The main goal is
this package is not importing data into R, but it offers the fread
function which load efficiently CSV files. This function is efficient by
default. For instance, the following code runs in 5 seconds and uses 240
MB of memory.
require(data.table)
a <- fread("demo-a.csv")
For normal CSV files, the best solution is fread
from
data.table
, by far. On large files (several GB), fread
and read_csv
(from
readr
) are roughly equivalent, while the latter can be quite slow on
very small files (with no consequences). The only limitation of those
fast methods is that they might fail to detect or to report
inconsistencies in the data format because they base their analysis on
the first lines of the file.
Obviously, importing data into R is only the first step of data analysis. In order to perform data mining, one needs to manipulate the data performing was it is sometimes called data carpentry. The goal is basically to prepare the data for data mining, by reorganizing them, cleaning them, etc. Many of those tasks can be done via simple R scripts. However, there are much more efficient solutions provided by several packages, especially by dplyr and data.table.
One should be particularly careful about R's tendency to copy data structures, for instance columns in data frames. On small data, that's not an issue, but on large ones, that can become very problematic. As a consequence, an unwritten rule about R is to avoid working with data structures that are larger than roughly 20% to 33% of your available memory.
We work here with a simple data set from governmental open data, the HM Land Registry Price Paid Data which contains land property transactions in the UK. The 2016 data is of a reasonable size, roughly 1 millions observations, while the full data set starts in 1995 and contains currently almost 23 millions of transactions (with year 2016 included).
We want to perform on those data very standard data carpentry such date formatting and all sorts of basic aggregate calculations. Those calculations are presented in the next section with the basic plain R implementation.
As in the case of data importation, the basic approach relies on data
frames and on associated functions. Notice in particular that R includes
numerous "data base like" functions, such as aggregate
which resembles
the GROUP BY
SQL statement and merge
which is comparable to the JOIN
clauses.
Surprisingly, with minimal precautions, the 2016 data file can be read
rather quickly (a few seconds) in R with the base read.csv
function as
follows:
pp2016 <- read.csv("data sets/pp-2016.csv",
colClasses=rep("character",16),
nrows=1040000, comment.char="",
stringsAsFactors=FALSE,
header=FALSE)
In this data set, the second column gives the price paid for a property,
while the fourth column gives the post code of the property. A basic data
manipulation would be to ask for the average price paid per post code. In
plain R this can be done via the aggregate
function.
pp2016$V2 <- as.integer(pp2016$V2)
price.per.pc <- aggregate(V2~V4,data=pp2016,mean)
This takes roughly 16 seconds. As this is a quite high resolution, one can aggregate the price over other geographical zones. For instance, variable number 14 gives the county which the coarser aggregation level directly available in those data.
price.per.county <- aggregate(V2~V14,data=pp2016,mean)
Here the result is obtained almost instantaneously because of the small number of counties.
Another natural question concerns the evolution of the prices through
time. The third variable contains the date of the transaction but its
loaded in a text format. So we need first to convert the text into R
Date
objects and then extract from those objects different variables
useful for aggregating the transactions at a different time resolutions.
pp2016$Date <- as.POSIXct(pp2016$V3)
pp2016$Month <- months(pp2016$Date)
pp2016$yday <- as.integer(format(pp2016$Date,"%j"))
price.per.month <- aggregate(V2~Month,data=pp2016,mean)
price.per.yday <- aggregate(V2~yday,data=pp2016,mean)
All those operations run quickly and the whole processing time is only of a few seconds.
However, the whole data set consists in more than 20 times more data than the 2016 data and thus the limitations of bare R appear:
read.csv
takes 210 seconds and uses 8 GB of
memory (this is a 3.7 GB file and the internal representation uses 5
GB);
This last problem can be circumvented by using the IDate
facility of
data.table
, as follows (for the 2016 case):
require(data.table)
pp2016$Date <- as.IDate(pp2016$V3)
pp2016$Month <- month(pp2016$Date)
pp2016$yday <- yday(pp2016$Date)
However, the memory usage remains enormous, with peaks around 12 GB for any "complex" operation carried on the data. When the consumption peaks slightly higher, the R process can be killed on a 16 GB computer. When things remain under control, monthly aggregation takes around 13 seconds while the day of year oriented one takes slightly less time around 12 seconds.
Thus base R is more limited here by memory issues than by pure performance problems.
The data.table package has been designed to ease data carpentry in R by introducing SQL like facilities in the data frame model. In addition, the package is very efficient at handling large data, both during loading (as demonstrated before) and during processing.
Loading the short data set is almost instantaneous (1.5 seconds) and uses only a limited memory. Notice that the loaded object is both a data frame and a data table.
require(data.table)
pp2016 <- fread("data sets/pp-2016.csv",header=FALSE)
A very efficient aspect of data.table
is its ability to avoid useless
data copies via a reference semantics that enables in place modification
of a data table. Here, we use it to translate the values in the second
columns to integers.
pp2016[,V2:=as.numeric(V2)]
The key point is the use of the := operator which tells R to update the column rather than to copy it as would be the case using the standard = operator.
Moreover, data.table
implements a general indexing/aggregating facility
that is summarize by the following construct:
DT[i, j, by]
in which DT
stands for a data table. In this construction i
is a
selection operator that filter the data table based on some
condition. Then j
stands for some operation to perform on the selected
rows, for instance an aggregation operation. And finally by
corresponds
to a grouping of the results according to some criterion.
This can be used to compute the mean transaction price aggregated over post codes as follows;
pp2016[,mean(V2),by=V4]
Here, we use no filtering/selection, then we compute averages over
variable V2
while grouping values based on variable V4
. This
implementation is very fast as the result is obtained almost
instantaneously (contrast with 40 seconds in the aggregate
case).
In order to do date based processing, we add columns to the data table using the reference semantics:
pp2016[,`:=`(Date = as.IDate(V3))]
pp2016[,`:=`(Month = month(Date), yday = yday(Date))]
Then we can use the data table main feature to compute the aggregates
price.per.month <- pp2016[,mean(V2),by=Month]
price.per.yday <- pp2016[,mean(V2),by=yday]
Results are obtained instantaneously. It's also very easy in this case to
filter by whatever interesting criteria (this is also possible in
aggregate
via the subset parameter). For instance, the Greater London
county has the highest transaction price and one might be interested by
their monthly evolution. This is obtained as follows:
pp2016[V14=="GREATER LONDON",mean(V2),by=Month]
aggregate(V2~Month,mean,data=pp2016,subset=pp2016$V14=="GREATER LONDON")
Both calculations are quite fast on this data set, but the data table one is (10 times) faster and arguably much more readable.
Moving on to the full data set:
fread
, and uses 5.3 GB);aggregate
);aggregate
As data.table, dplyr has been designed to ease data manipulation in R but
also to speed them up. Dplyr follows data table in providing filtering,
aggregation and other high level operations inspired by the SQL model. It
operates on extended data frames, the tibbles. The companion package
readr
can be used to read directly tabular data into the tibble
format. For our running example, this gives
require(readr)
pp2016 <- read_csv("data sets/pp-2016.csv",col_names=FALSE)
Reading is quite fast for the small data set (8 seconds) and do not waste
memory (461 MB). In addition, read__csv
guesses correctly the type of
the columns, especially dates, something that is not done by fread
.
In order to aggregate with a GROUP BY
like approach, one used group_by
in dplyr
. In our running example, this gives
require(dplyr)
pp2016_by_pc <- group_by(pp2016,X4)
Once this operation has been done (this take around 7 seconds on the small
data set), one can apply other dplyr
operations: they will respect the
grouping structure. For instance the average value per postal code is
obtained as follows
price.per.pc <- summarise(pp2016_by_pc,price=mean(X2))
This operation is instantaneous on the small data set. The county level summary can be obtained very quickly by combining directly the two operations as follows.
price.per.county <- summarise(group_by(pp2016,X14),price=mean(X2))
As data table, dplyr
can easily add columns to an existing data
frame. We use it here to add the temporal columns.
pp2016 <- mutate(pp2016,Month=months(X3),
yday=as.integer(format(X3,"%j")))
Then we run the standard aggregation calculations.
price.per.month <- summarise(group_by(pp2016,Month),price=mean(X2))
price.per.yday <- summarise(group_by(pp2016,yday),price=mean(X2))
Moving on to the full data set:
We summarize in the following table the run time of different operations on the full data set.
Operation | Plain R | dplyr | data.table |
---|---|---|---|
Data loading from CSV | 210s | 170s | 160s |
Post code aggregation | 40s | 19s | 3s |
County aggregation | 12s | 1s | <1s |
Date conversion | Impossible | 12s | 26s |
Month aggregation | Impossible | 5s | <1s |
Day aggregation | Impossible | 1s | <1s |
Month + County | Impossible | 7s | <1s |
Obviously, data table and dplyr are much better solutions than plain R. data table seem to perform better than dplyr, but the latter as advantages in other contexts. In addition, dplyr can work on top of data tables, which somehow brings together the best of both worlds.