Shared memory parallel programming in R

This part of the course is dedicated to the use of shared memory systems in R.

Implicit parallel programming in R

The simplest way to use parallel processing in R is to rely on implicit parallelism, that is to leverage libraries/packages that have been designed to use multiple cores. The main example of this is linear algebra (matrix and vector).

Because of the importance of linear algebra in mathematics, standards have been designed in order to provide efficient API and associated implementations. The standard low level API is BLAS. It provides the elementary operations around matrices and vectors (dot product, matrix vector product, etc.). Numerous efficient implementations of BLAS are available. Most of them are designed for multicore CPU and can reach the theoretical peak performances of those CPU. Two open source very efficient implementations are well known:

R can be configured to use any BLAS implementation instead of its integrated one. This can bring tremendous improvement in computation time for programs that make heavy use of linear algebra. For instance consider the very simple following code:

a <- matrix(rnorm(5000*5000), 5000, 5000)
b <- matrix(rnorm(5000*5000), 5000, 5000)
c <- a%*%b

It runs in roughly 80 seconds on a Xeon CPU E3-1271 v3 @ 3.60GHz using the single threaded BLAS from R (this is a 4 cores CPU). With Atlas in single threaded mode, the program runs in around 20 seconds. With parallel Atlas the program runs in 6 seconds (with less than 3 seconds in the matrix multiplication). The theoretical peak for general purpose calculation is at 28.8 \(10^9\) flops while the computational load is of 250 \(10^9\) operations, leading to an observed value of almost 84 \(10^9\) flops. Larger matrices lead to even better flops. It turns out that recent CPU include specific instructions that are highly efficient for matrix computation. The CPU used in this test is from the Haswell family which can do up to 16 operations on double precision real number per core and per cycle. Thus the theoretical peak for matrix calculation is around 230 \(10^9\) flops which explains the observed performances. The latest Intel CPU family can perform up to 32 operations per core and per cycle.

Explicit parallel programming in R

When a program does not rely heavily on linear algebra, implicit parallel programming is not sufficient and one has to rely on explicit parallel programming. The core facility for this is provided by the parallel package. However, parallel is rather low level and its simpler to use the foreach package that provides a programming interface that is quite different from the fork-join principle. The foreach principle belongs to the data parallelism model (same code different data) rather than to the task parallelism model (different codes).

The \(\pi\) example: base R code

Let's first translate the C code into R:

## loop version, very slow
compute.pi.loop <- function(nsteps) {
    sum <- 0
    step <- 1.0/nsteps
    for(i in 1:nsteps) {
	x <- (i-0.5)*step
	sum <- sum + 4/(1+x^2)
    }
    step * sum
}

This is not idiomatic (and in addition quite slow), so we move on to a more idiomatic version.

## idiomatic version much faster
compute.pi.idiomatic <- function(nsteps) {
    sum(4/(1+(((1:nsteps)-0.5)/nsteps)^2))/nsteps
}

This version is very fast (by R standard) but be warned that it allocates a lot of memory. We obtain the following running time.

`nstep`	\(10^7\)	\(10^8\)
loop	8s	83s
idiomatic	0.11s	1.2s

Foreach version

The foreach loop construct in R is quite different from the standard for loop in the sense that it does not iterate over things but rather apply a block of code to different values. For instance the result of

library(foreach)
foreach(i=1:5) %do% i^2

is a list containing the following values:

The values of variables in the block are set via the parameters of the foreach call while the results are aggregated via a combination function, for instance as in:

foreach(i=1:5,.combine=sum) %do% i^2

with returns 55. Then the \(\pi\) calculation example can be done with foreach as follows:

compute.pi.foreach.loop <- function(nsteps) {
    step <- 1.0/nsteps
    sum <- foreach(i=1:nsteps,.combine=sum) %do% {
	4/(1+((i-0.5)*step)^2)
    }
    step * sum
}

Unfortunately, this is amazingly slow: 2.4 seconds for nstep = \(10^4\). This is because foreach is optimized for large tasks while here each task is a very basic calculation. The correct solution consists in wrapping the idiomatic calculation into a foreach loop, for instance as follows:

compute.pi.foreach.idiomatic <- function(nsteps,ntimes) {
    step <- 1.0/nsteps
    foreach(i=1:ntimes,.combine=sum) %do% {
	sum(4/(1+((seq(i,nsteps,by=ntimes)-0.5)/nsteps)^2))/nsteps
    }
}

Then the effects of the foreach remain very small as long as ntimes is also small, as shown in this timing table.

`nstep`	\(10^7\)	\(10^8\)
`ntimes` = 1	0.37s	3.76s
`ntimes` = 10	0.37s	4.2s
`ntimes` = 100	0.42s	4.1s
`ntimes` = 1000	0.62s	3.8s

Parallel foreach

The main interest of foreach is that its calculation can be done in parallel. One has only to replace the %do% construction by a %dopar% construction and to load a parallel backend. So the modified code is

compute.pi.foreach.idiomatic.parallel <- function(nsteps,ntimes) {
    step <- 1.0/nsteps
    foreach(i=1:ntimes,.combine=sum) %dopar% {
	sum(4/(1+((seq(i,nsteps,by=ntimes)-0.5)/nsteps)^2))/nsteps
    }
}

The standard parallel backend on linux and MacOS is doMC which is registered as follows:

# load the library
library(doMC)
# register the backend
registerDoMC()

Then, one can test the parallel version. The natural value for ntimes is here the number of cores, but it can be interesting to use higher counts. Indeed with nstep = \(10^8\) and times = 4 (on a 4 cores CPU), we get a run time around 1.4s slightly larger than the sequential code. But the sequential code cannot run with nstep = \(10^{10}\) because it allocates too much memory (75 GB!) and thus the calculation has to be cut in several parts. This is provided by foreach in the idiomatic design which can be run with nstep = \(10^{8}\) and e.g. times = 100. This takes approximately 116s (as expected).

Foreach do and don't

The foreach package is very useful to bring concurrency to R in a quite simple way. It has additional features that we will study latter in the course. However, it can also be misused as shown above.

things to avoid:
1. do not use foreach for short tasks
2. do not nest foreach loops without using the specialized feature
things to do:
1. optimize the R code before going to foreach
2. test the foreach code with a sequential backend first
3. test different numbers of cores
4. leverage the combine facility when possible
5. to maximize portability use doParallel rather than doMC but be aware that performances under windows are inferior than under Mac OS and linux.