This part of the course is dedicated to the use of shared memory systems in R.
The simplest way to use parallel processing in R is to rely on implicit parallelism, that is to leverage libraries/packages that have been designed to use multiple cores. The main example of this is linear algebra (matrix and vector).
Because of the importance of linear algebra in mathematics, standards have been designed in order to provide efficient API and associated implementations. The standard low level API is BLAS. It provides the elementary operations around matrices and vectors (dot product, matrix vector product, etc.). Numerous efficient implementations of BLAS are available. Most of them are designed for multicore CPU and can reach the theoretical peak performances of those CPU. Two open source very efficient implementations are well known:
R can be configured to use any BLAS implementation instead of its integrated one. This can bring tremendous improvement in computation time for programs that make heavy use of linear algebra. For instance consider the very simple following code:
a <- matrix(rnorm(5000*5000), 5000, 5000)
b <- matrix(rnorm(5000*5000), 5000, 5000)
c <- a%*%b
It runs in roughly 80 seconds on a Xeon CPU E3-1271 v3 @ 3.60GHz using the single threaded BLAS from R (this is a 4 cores CPU). With Atlas in single threaded mode, the program runs in around 20 seconds. With parallel Atlas the program runs in 6 seconds (with less than 3 seconds in the matrix multiplication). The theoretical peak for general purpose calculation is at 28.8 \(10^9\) flops while the computational load is of 250 \(10^9\) operations, leading to an observed value of almost 84 \(10^9\) flops. Larger matrices lead to even better flops. It turns out that recent CPU include specific instructions that are highly efficient for matrix computation. The CPU used in this test is from the Haswell family which can do up to 16 operations on double precision real number per core and per cycle. Thus the theoretical peak for matrix calculation is around 230 \(10^9\) flops which explains the observed performances. The latest Intel CPU family can perform up to 32 operations per core and per cycle.
When a program does not rely heavily on linear algebra, implicit parallel
programming is not sufficient and one has to rely on explicit parallel
programming. The core facility for this is provided by the parallel
package. However, parallel is rather low level and its simpler to use the
foreach
package that provides a programming interface that is quite
different from the fork-join principle. The foreach
principle belongs to
the data parallelism model (same code different data) rather than to the
task parallelism model (different codes).
Let's first translate the C code into R:
## loop version, very slow
compute.pi.loop <- function(nsteps) {
sum <- 0
step <- 1.0/nsteps
for(i in 1:nsteps) {
x <- (i-0.5)*step
sum <- sum + 4/(1+x^2)
}
step * sum
}
This is not idiomatic (and in addition quite slow), so we move on to a more idiomatic version.
## idiomatic version much faster
compute.pi.idiomatic <- function(nsteps) {
sum(4/(1+(((1:nsteps)-0.5)/nsteps)^2))/nsteps
}
This version is very fast (by R standard) but be warned that it allocates a lot of memory. We obtain the following running time.
nstep |
\(10^7\) | \(10^8\) |
---|---|---|
loop | 8s | 83s |
idiomatic | 0.11s | 1.2s |
The foreach loop construct in R is quite different from the standard for loop in the sense that it does not iterate over things but rather apply a block of code to different values. For instance the result of
library(foreach)
foreach(i=1:5) %do% i^2
is a list containing the following values:
1 | 4 | 9 | 16 | 25 |
The values of variables in the block are set via the parameters of the
foreach
call while the results are aggregated via a combination
function, for instance as in:
foreach(i=1:5,.combine=sum) %do% i^2
with returns 55. Then the \(\pi\) calculation example can be done with
foreach
as follows:
compute.pi.foreach.loop <- function(nsteps) {
step <- 1.0/nsteps
sum <- foreach(i=1:nsteps,.combine=sum) %do% {
4/(1+((i-0.5)*step)^2)
}
step * sum
}
Unfortunately, this is amazingly slow: 2.4 seconds for nstep
=
\(10^4\). This is because foreach
is optimized for large tasks while here
each task is a very basic calculation. The correct solution consists in
wrapping the idiomatic calculation into a foreach
loop, for instance as follows:
compute.pi.foreach.idiomatic <- function(nsteps,ntimes) {
step <- 1.0/nsteps
foreach(i=1:ntimes,.combine=sum) %do% {
sum(4/(1+((seq(i,nsteps,by=ntimes)-0.5)/nsteps)^2))/nsteps
}
}
Then the effects of the foreach
remain very small as long as ntimes
is
also small, as shown in this timing table.
nstep |
\(10^7\) | \(10^8\) |
---|---|---|
ntimes = 1 |
0.37s | 3.76s |
ntimes = 10 |
0.37s | 4.2s |
ntimes = 100 |
0.42s | 4.1s |
ntimes = 1000 |
0.62s | 3.8s |
The main interest of foreach
is that its calculation can be done in
parallel. One has only to replace the %do%
construction by a %dopar%
construction and to load a parallel backend. So the modified code is
compute.pi.foreach.idiomatic.parallel <- function(nsteps,ntimes) {
step <- 1.0/nsteps
foreach(i=1:ntimes,.combine=sum) %dopar% {
sum(4/(1+((seq(i,nsteps,by=ntimes)-0.5)/nsteps)^2))/nsteps
}
}
The standard parallel backend on linux and MacOS is doMC
which is registered as follows:
# load the library
library(doMC)
# register the backend
registerDoMC()
Then, one can test the parallel version. The natural value for ntimes
is here the number of cores, but it can be interesting to use higher
counts. Indeed with nstep
= \(10^8\) and times
= 4 (on a 4 cores CPU), we
get a run time around 1.4s slightly larger than the sequential
code. But the sequential code cannot run with nstep
= \(10^{10}\) because it
allocates too much memory (75 GB!) and thus the calculation has to be
cut in several parts. This is provided by foreach
in the idiomatic
design which can be run with nstep
= \(10^{8}\) and e.g. times
= 100. This takes approximately 116s (as expected).
The foreach
package is very useful to bring concurrency to R in a quite
simple way. It has additional features that we will study latter in the
course. However, it can also be misused as shown above.
foreach
for short tasksforeach
loops without using the specialized featureforeach
code with a sequential backend firstdoParallel
rather than doMC
but be
aware that performances under windows are inferior than under Mac OS
and linux.