Shared memory parallel programming

Introduction

The only way to deal with large to big data is to use some form of parallel processing. There is simply no computational unit that is able to deliver the number of flops that is needed for modern data handling. Unfortunately, parallel processing is difficult. I discuss first the easiest situation, the case of shared memory parallel programming (see HerlihyShavit2008ArtOfMultiprocessorProgramming for an excellent textbook on efficient concurrent data structures).

Shared memory hardware structure

A computer is made of processing units (PU) and of memory (among other things). Current computers (even phones) have multiple PU (the cores) that all use the same memory. This is the standard shared memory structure. Sharing is done at the hardware level which guarantees good performances, at least in theory. Software level shared memory is also available, but it comes with a higher programming cost and with lower performances when there is no direct hardware support.

Threads, processes and memory

The simplest way to use shared memory is via the thread model. The general unit for a program is a process which regroups several execution contexts (the threads) and a unique memory context. Each process runs under its unique virtual memory space, which is mapped by the Memory Management Unit (MMU) to the physical memory. This is completely transparent for the programmer. Threads have access to the full virtual memory space, leading to a fully shared memory structure with hardware support.

MMU and caches

The MMU is responsible off the virtual to real mapping. It also handles its caches (the Translation lookaside buffer). This cache and the CPU caches are crucial to reach high performances because the cheap big memory has enormous latency (somewhat mitigated by a very high bandwidth). I cannot cover the details in this course (see Drepper2007Memory and my course in French), but even getting single threaded high performance code is difficult because of the caches. Some numbers, the typical latency of a memory request at each level of the cache (intel):

register (core): 1 cycle
L1 (core): 3 cycles
L2 (core): 12 cycles
L3 (CPU): 38 cycles
local RAM (CPU): 65 ns (130 cycles at least)
non local RAM (NUMA, see below): 105 ns (40ns from QPI)

Non-uniform memory access

There is a limit to the number of cores that can be integrated on a single CPU (18 for the current Intel architecture). When a server needs more PU, it has to integrate several CPU. In order to maintain the shared memory principle in this case, two solutions exist.

The simplest one consists in having a unique shared memory for all the processors and interconnection bus to which the processors send requests for memory access. This does not scale well to a large number of CPU as the bus becomes a contention point.

A more advanced solution the Non-uniform memory access (NUMA) principle. The main idea is that each CPU has its own memory. When a process uses more memory than the one attached directly to its preferential CPU, it will use memory from the other CPUs transparently (via the MMU). Of course, access to the distant memory is slightly less efficient than access to the local memory. This explains the idea of a preferred CPU.

Programming difficulties

There are essentially two issues with parallel programming on shared memory architectures:

the correction issue: is a parallel program really producing the same results as its sequential version?
the efficiency issue: does the parallel program make a good use of the hardware by reducing the user time needed to perform a task?

The first issue is very complex and cannot be solved without hardware guarantees, such as locks and related tools (in particular compare and set primitives). The key point is to enable an a priori unbounded number of threads to agree on a common value.

The second issue is somehow even more annoying in the sense that solving the first issue with obvious solutions (such as locks) leads frequently to inefficient solutions: they do work but the user time (the wall clock time) is not significantly reduced compared to a sequential program, while in theory one could expect the wall clock time to be divided by the number of threads (as long as each thread maps to a core).

While those two problems are extremely important, they should be overemphasized in the context of machine learning algorithms, for instance. Many machine learning algorithms have "embarrassingly parallel" structures in the sense that the result is obtained by combining a series of almost independent calculations (this is the case of resampling methods, for instance). In fact, the difficulties outlined above are a consequence of coupling between the calculations, either via some shared data structures (a database) or because of important communication needs.

Let consider for instance the base problem of computing a distance matrix between N points in \(\mathbb{R}^P\). The square distance between two points is calculated as the sum of squared differences and involves \(3P\) operations. This has to be done for \(M=\frac{N(N-1)}{2}\) pairs, so the total sequential cost is \(\frac{3PN(N-1)}{2}\). Obviously, distances can be computed independently. If we number the distances from 1 to \(M=\frac{N(N-1)}{2}\), we can then split the calculation on \(T\) threads which will be responsible of compute \(M/T\) distances. No communication will be needed between the threads and we can expect the wall clock time to be divided by the number of threads.

Standard API

In order to program for shared memory systems, one needs some Application Programming Interface (API) that allows to either to manipulate threads and locks (low level API) or to express that some parts of the program can be executed concurrently (high level API). High level languages such as Java and C# provide both levels, including very high level tools that mask to some extent the difficulties outlined above.

In particular the very popular Fork-Join paradigm is readily available. In this paradigm a master thread forks into several tasks (which are mapped transparently to threads), wait for them to terminate, and collects the results in a sequential joint phase. For instance in the distance calculation above, the master thread would load the data set and forks on the pairwise distance calculations themselves.

For C/C++, the standard solution is OpenMP. This is an API integrated into compilers and their runtime libraries. It is based on pragmas of the form #pragma (this is a standard way to introduce compiler level extensions in C). Examples of pragmas include:

#pragma omp parallel: this creates additional threads to execute the current construct
#pragma omp parallel for: this will execute the subsequent for loop over multiple threads
#pragma omp barrier: this corresponds to the join part of the fork join paradigm, all threads will wait for each other before passing the pragma

OpenMP is very declarative in the sense that explicit fine management of the underlying threads is not really needed (and is not possible). An example from Tim Mattson follows: the program computes \(\pi\) by integration.

Sequential program

static long num_steps = 100000;
double step;
int main ()
{
    int i; double x, pi, sum = 0.0;
    step = 1.0/(double) num_steps;
    for (i = 0; i < num_steps; i++){
	x = (i + 0.5) * step;
	sum = sum + 4.0 / (1.0 + x * x);
    }
    pi = step * sum;
    printf("%g\n",pi);
}

A parallel version in OpenMP

#include <stdio.h>
#include <omp.h>
static long num_steps = 100000000;
double step;
#define NUM_THREADS 4
int main ()
{
    double pi;
    int nthreads;
    step = 1.0/(double) num_steps;
    omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
    {
	int i, id,nthrds; double x, sum;
	id = omp_get_thread_num();
	nthrds = omp_get_num_threads();
	if (id == 0) nthreads = nthrds;
	for (i = id, sum = 0.0; i < num_steps; i = i + nthreads){
	    x = (i + 0.5) * step;
	    sum += 4.0 / (1.0 + x * x);
	}
	sum = sum*step;
#pragma atomic
	pi += sum;
    }
    printf("%f\n",pi);
    return 0;
}

Take home message

Even in the case of shared memory, parallel programming is very difficult: getting quickly correct results is very challenging. The problem revolves around communications between parts of the program: it's easy to lock everything (to get correct results) but then efficiency suffers a lot. Fortunately, many problems can be handled with some simplified approaches: the fork-join principle that underlines OpenMP and the data parallelism that appears in foreach. The latter principle is particularly adapted to many learning tasks, for example.

Bibliography

[Drepper2007Memory] @MiscDrepper2007Memory, author = Drepper, Ulrich, title = What every programmer should know about memory, howpublished = \urlhttp://people.redhat.com/drepper/cpumemory.pdf, month = 11, year = 2007
[HerlihyShavit2008ArtOfMultiprocessorProgramming] Herlihy & Shavit, The Art of Multiprocessor Programming, Morgan Kaufmann (2008).