Data Management and Querying in R

This part of the course is dedicated to data manipulation in R at the RDBMS level. I review different levels of manipulation from importing to actual in memory manipulation. A somewhat dated source on this topic is provided by an official R manual. A better source is the Efficient R Programming book (chapter 5 about input/output and chapter 6 about data carpentry).

Elementary level (importation)

The basic tool for data "manipulation" in R is the data.frame format coupled with the read.table function. The former is the standard way of representing data tables (a.k.a. rectangular grids) in R while the later enables users to load spreadsheet like data from text files.

While this pair is convenient and easy to use, it has major problems, especially when dealing with large scale data (even medium scale data, in fact). In particular, users tend to confuse the need for importing data into R with the need for storing data in a R workflow. While importing data, reliance on text files is sometimes mandatory. However, once the data has been imported in R, there is no reason to rely on text files, apart for exporting data to e.g. another program.

We focus here in the importation aspect as it is important for interaction with non R world, but keep in mind that using the RDS native format is the best solution for R only processing.

Limitations of `read.table`

The read.table function and its friends, read.csv, read.delim, etc. are inherently limited by the text format on which they rely. Limitations include:

non fixed data format (separator, quotes, numerical encoding, etc.);
no meta data (what is the type of a data column?);
encoding issues between platforms.

Notice than variations of the CSV format should be avoided (use the RFC and write.csv when exporting). However, in our context, the biggest problem is the inefficiency of the function which can be very slow at loading even medium size data.

For instance, reading a medium size matrix as in the following code takes roughly 110 seconds (on a Xeon CPU E3-1271 v3 @ 3.60GHz (4 cores) with a standard hard drive and 16 GB of main memory).

a <- matrix(rnorm(5000*5000), 5000, 5000)
write.csv(a,file="demo-a.csv",row.names=FALSE)
a <- as.matrix(read.csv("demo-a.csv"))

Notice in addition that the file itself is large: 450 MB against around 190 MB when in memory. Moreover, the function uses a very amount of memory during its execution. For instance, when only the reading operation is performed above, the peak use of memory is 2.5 GB (notice the occupation is not a consequence of converting to a matrix).

As a comparison reading from the binary representation as in the following code takes only 1 second on the same computer. In addition the file uses 190 MB as expected and the memory usage is only 230 MB, as expected.

a <- matrix(rnorm(5000*5000), 5000, 5000)
saveRDS(a,file="demo-a.Rds")
a <- readRDS("demo-a.Rds")

There are ways to improve read.table. A few tricks can be use at the read.table level. For instance one can use the colClasses parameter to specify the nature of the columns as a way to reduce memory usage. Overestimating the number of rows also helps (nrows parameter). Finally removing the possibility of comments by setting the comments parameter to "" speeds up loading. However, those improvements are somewhat minor. For instance the following code runs in 60 seconds but still uses around 2.5 GB of memory.

a <- as.matrix(read.csv("demo-a.csv"),colClasses="numeric",
	       nrows=6000,comment.char="")

The canonical solution in plain R consists in using the scan function. It's a low level function but it's much more adapted to reading large matrices, for instance. In our context, scan can be used as follows.

a <- matrix(scan("demo-a.csv",skip=1,what=numeric(0),n=5000*5000,
		 sep=","),
	    5000,5000,byrow=TRUE)

This code runs in 12 seconds and uses 430 MB of memory. While this uses quite a lot of resources compared to the binary format solution, the improvement over the naive use of read.table is significative.

Alternatives to `read.table` and `scan`

Fortunately, importing text files into R can be done more efficient functions than read.table and scan. They come from external packages.

Readr

The readr package is specifically designed to read CSV files and related format. A simple way to use it is given in the following snippet:

a <- read_csv("demo-a.csv")

This takes around 24 seconds to read the matrix with a memory usage of 500 MB. As in the case of read.csv and friends, the efficiency of the function can be improved by providing hints to it such as the type of the columns and the size of the resulting data frame. For instance the following code runs in only 10 seconds (and uses 450 MB of memory).

a <- read_csv("demo-a.csv",
	      col_types=cols(.default=col_double()))

In fact, most of the overhead of the base call comes from reading a lot of lines in order to guess the structure of the columns. A very simple way of reducing the overhead is to tweak the guess_max parameter to a small number of lines (default is 1000). That might fail on complex data, but on numerical matrices, it works nicely. For instance, the following code runs in 10 seconds.

require(readr)
a <- read_csv("demo-a.csv",guess_max=10)

Notice than read_csv outputs a R object that is slightly richer than a data frame. For instance a column specification can be extracted from such an object with the spec function (that can be useful for reading several files with a similar format).

Data.table

Another solution is provided by the data.table package. The main goal is this package is not importing data into R, but it offers the fread function which load efficiently CSV files. This function is efficient by default. For instance, the following code runs in 5 seconds and uses 240 MB of memory.

require(data.table)
a <- fread("demo-a.csv")

Discussion

For normal CSV files, the best solution is fread from data.table, by far. On large files (several GB), fread and read_csv (from readr) are roughly equivalent, while the latter can be quite slow on very small files (with no consequences). The only limitation of those fast methods is that they might fail to detect or to report inconsistencies in the data format because they base their analysis on the first lines of the file.

In memory manipulation

Obviously, importing data into R is only the first step of data analysis. In order to perform data mining, one needs to manipulate the data performing was it is sometimes called data carpentry. The goal is basically to prepare the data for data mining, by reorganizing them, cleaning them, etc. Many of those tasks can be done via simple R scripts. However, there are much more efficient solutions provided by several packages, especially by dplyr and data.table.

One should be particularly careful about R's tendency to copy data structures, for instance columns in data frames. On small data, that's not an issue, but on large ones, that can become very problematic. As a consequence, an unwritten rule about R is to avoid working with data structures that are larger than roughly 20% to 33% of your available memory.

Running example

We work here with a simple data set from governmental open data, the HM Land Registry Price Paid Data which contains land property transactions in the UK. The 2016 data is of a reasonable size, roughly 1 millions observations, while the full data set starts in 1995 and contains currently almost 23 millions of transactions (with year 2016 included).

We want to perform on those data very standard data carpentry such date formatting and all sorts of basic aggregate calculations. Those calculations are presented in the next section with the basic plain R implementation.

Basic approach

As in the case of data importation, the basic approach relies on data frames and on associated functions. Notice in particular that R includes numerous "data base like" functions, such as aggregate which resembles the GROUP BY SQL statement and merge which is comparable to the JOIN clauses.

Surprisingly, with minimal precautions, the 2016 data file can be read rather quickly (a few seconds) in R with the base read.csv function as follows:

pp2016 <- read.csv("data sets/pp-2016.csv",
		   colClasses=rep("character",16),
		   nrows=1040000, comment.char="",
		   stringsAsFactors=FALSE,
		   header=FALSE)

In this data set, the second column gives the price paid for a property, while the fourth column gives the post code of the property. A basic data manipulation would be to ask for the average price paid per post code. In plain R this can be done via the aggregate function.

pp2016$V2 <- as.integer(pp2016$V2)
price.per.pc <- aggregate(V2~V4,data=pp2016,mean)

This takes roughly 16 seconds. As this is a quite high resolution, one can aggregate the price over other geographical zones. For instance, variable number 14 gives the county which the coarser aggregation level directly available in those data.

price.per.county  <- aggregate(V2~V14,data=pp2016,mean)

Here the result is obtained almost instantaneously because of the small number of counties.

Another natural question concerns the evolution of the prices through time. The third variable contains the date of the transaction but its loaded in a text format. So we need first to convert the text into R Date objects and then extract from those objects different variables useful for aggregating the transactions at a different time resolutions.

pp2016$Date <- as.POSIXct(pp2016$V3)
pp2016$Month <- months(pp2016$Date)
pp2016$yday <- as.integer(format(pp2016$Date,"%j"))
price.per.month <- aggregate(V2~Month,data=pp2016,mean)
price.per.yday <- aggregate(V2~yday,data=pp2016,mean)

All those operations run quickly and the whole processing time is only of a few seconds.

However, the whole data set consists in more than 20 times more data than the 2016 data and thus the limitations of bare R appear:

reading the full file with read.csv takes 210 seconds and uses 8 GB of memory (this is a 3.7 GB file and the internal representation uses 5 GB);
aggregating over the post codes takes around 40 seconds;
aggregating over the counties takes around 12 seconds;
the date conversion is very slow (120 seconds), increases the memory consumption to 11 GB and finally breaks R during subsequent operations.

This last problem can be circumvented by using the IDate facility of data.table, as follows (for the 2016 case):

require(data.table)
pp2016$Date <- as.IDate(pp2016$V3)
pp2016$Month <- month(pp2016$Date)
pp2016$yday <- yday(pp2016$Date)

However, the memory usage remains enormous, with peaks around 12 GB for any "complex" operation carried on the data. When the consumption peaks slightly higher, the R process can be killed on a 16 GB computer. When things remain under control, monthly aggregation takes around 13 seconds while the day of year oriented one takes slightly less time around 12 seconds.

Thus base R is more limited here by memory issues than by pure performance problems.

data.table

The data.table package has been designed to ease data carpentry in R by introducing SQL like facilities in the data frame model. In addition, the package is very efficient at handling large data, both during loading (as demonstrated before) and during processing.

Loading the short data set is almost instantaneous (1.5 seconds) and uses only a limited memory. Notice that the loaded object is both a data frame and a data table.

require(data.table)
pp2016 <- fread("data sets/pp-2016.csv",header=FALSE)

A very efficient aspect of data.table is its ability to avoid useless data copies via a reference semantics that enables in place modification of a data table. Here, we use it to translate the values in the second columns to integers.

pp2016[,V2:=as.numeric(V2)]

The key point is the use of the := operator which tells R to update the column rather than to copy it as would be the case using the standard = operator.

Moreover, data.table implements a general indexing/aggregating facility that is summarize by the following construct:

DT[i, j, by]

in which DT stands for a data table. In this construction i is a selection operator that filter the data table based on some condition. Then j stands for some operation to perform on the selected rows, for instance an aggregation operation. And finally by corresponds to a grouping of the results according to some criterion.

This can be used to compute the mean transaction price aggregated over post codes as follows;

pp2016[,mean(V2),by=V4]

Here, we use no filtering/selection, then we compute averages over variable V2 while grouping values based on variable V4. This implementation is very fast as the result is obtained almost instantaneously (contrast with 40 seconds in the aggregate case).

In order to do date based processing, we add columns to the data table using the reference semantics:

pp2016[,`:=`(Date = as.IDate(V3))]
pp2016[,`:=`(Month = month(Date), yday = yday(Date))]

Then we can use the data table main feature to compute the aggregates

price.per.month <- pp2016[,mean(V2),by=Month]
price.per.yday <- pp2016[,mean(V2),by=yday]

Results are obtained instantaneously. It's also very easy in this case to filter by whatever interesting criteria (this is also possible in aggregate via the subset parameter). For instance, the Greater London county has the highest transaction price and one might be interested by their monthly evolution. This is obtained as follows:

pp2016[V14=="GREATER LONDON",mean(V2),by=Month]
aggregate(V2~Month,mean,data=pp2016,subset=pp2016$V14=="GREATER LONDON")

Both calculations are quite fast on this data set, but the data table one is (10 times) faster and arguably much more readable.

Moving on to the full data set:

the full data takes 160 seconds without any tricks (plain call to fread, and uses 5.3 GB);
aggregating over the post codes takes around 3 seconds (compare to 40 seconds with aggregate);
aggregating over counties takes less than 1 second;
date conversions are reasonably fast (26 seconds) and do not increase the peak memory usage of 8 GB (5.7 GB after garbage collection);
aggregating over month and day of the year are almost instantaneous and, more importantly, do not break R!
in addition, more complex aggregations are fast and reliable. For instance aggregating over two variables (month and county) is very fast with data table while it breaks R because of a too large memory consumption with aggregate

dplyr

As data.table, dplyr has been designed to ease data manipulation in R but also to speed them up. Dplyr follows data table in providing filtering, aggregation and other high level operations inspired by the SQL model. It operates on extended data frames, the tibbles. The companion package readr can be used to read directly tabular data into the tibble format. For our running example, this gives

require(readr)
pp2016 <- read_csv("data sets/pp-2016.csv",col_names=FALSE)

Reading is quite fast for the small data set (8 seconds) and do not waste memory (461 MB). In addition, read__csv guesses correctly the type of the columns, especially dates, something that is not done by fread.

In order to aggregate with a GROUP BY like approach, one used group_by in dplyr. In our running example, this gives

require(dplyr)
pp2016_by_pc <- group_by(pp2016,X4)

Once this operation has been done (this take around 7 seconds on the small data set), one can apply other dplyr operations: they will respect the grouping structure. For instance the average value per postal code is obtained as follows

price.per.pc <- summarise(pp2016_by_pc,price=mean(X2))

This operation is instantaneous on the small data set. The county level summary can be obtained very quickly by combining directly the two operations as follows.

price.per.county  <- summarise(group_by(pp2016,X14),price=mean(X2))

As data table, dplyr can easily add columns to an existing data frame. We use it here to add the temporal columns.

pp2016 <- mutate(pp2016,Month=months(X3),
		 yday=as.integer(format(X3,"%j")))

Then we run the standard aggregation calculations.

price.per.month <- summarise(group_by(pp2016,Month),price=mean(X2))
price.per.yday <- summarise(group_by(pp2016,yday),price=mean(X2))