Introduction

When the data cannot be handled efficiently on a single computer, one has to rely on a distributed system which consists in several computers interconnected by a network. The main difference between a distributed system and a multi-core/CPU system is that in the former, the memory is not shared directly between threads in a single process. On the contrary, several processes are running (on different computers) and communicate with messages (even if this communication can be implicit when using high level API). Notice that distributed computing is sometimes reserved for tightly coupled machines (with very fast network for instance) and opposed to cluster computing where the machines can be loosely coupled. I tend to avoid this distinction.

The main advantage of distributed systems over shared memory systems is the reduced risk of inconsistent behavior induced by the explicit aspect of sharing (which is done via message passing). However, this is probably the only advantage as distributed systems are very difficult to master and have significant overhead compared to SMP systems. As such, numerous API have been proposed in order to ease programming (to some extent).

At the low level, programming for a distributed system means having some way of knowing the structure of the system (e.g., how many computing nodes are available) and some communication means. At a higher level, one can hope to describe a complex task as a set of simple tasks with dependencies and then let a framework turn this description into a fast distributed program (this could apply to SMP).

Some API and programming models

  • API:
    Parallel Virtual Machine (PVM)
    PVM is the ancestor of parallel programming API designed for distributed systems. It was created in the late 80's and is based on message passing. It is still used but as been somehow superseded by MPI.
    Message Passing Interface (MPI)
    MPI is standard created in the 90's. It focuses on message passing and can be seen as a communication protocol.
  • programming paradigm:
    Map Reduce
    Map Reduce is a quite simple paradigm which was popularized by Google in 2004 (DeanGhemawat2004MapReduce). It was initially the main computational model for the Hadoop framework. It suffers from major performance problems in many situations.
    Spark
    Spark is one of the recent paradigms that address Map Reduce limitations.

Message Passing Interface

MPI (Message Passing Interface) is the current standard for high performance computing. It has commercial and open source implementations that are generally optimized for some hardware. Interestingly, MPI implementations are sufficiently optimized to be used also for shared memory computers. In fact the first version of MPI was limited to message passing, but starting from the second version (in the late 90's), distributed memory operations are available. They can easily be implemented as shared memory operations.

The API of MPI is very rich and of quite low level. It can be seen as the distributed equivalent of low level thread API for SMP systems. A MPI program consists in several processes that are mapped to possibly different computers. The API allows processes to communicate in various ways:

  • point to point direct communication
  • broadcasting (one to all)
  • reducing (all to one)
  • etc.

Communication is used to synchronize processes via rendez-vous (as part of point to point communication) and barrier (global rendez-vous). Communications can be organized in a quite convenient way with some form of virtual topology.

The API corresponds to a library to which some runtime tools are associated. As in the case of threads based programs (e.g. openMP), one starts a unique program that is replicated on the different machines that form the cluster. The runtime configuration can be quite complex as the user must have access to all the machines used in the distributed system. Some form of distributed identification has therefore to be in place. In addition, having a shared file system is recommended. Finally, firewall can introduce a lot of problems and it is recommended to avoid using filtering inside the cluster.

Map reduce

Map reduce is a simple parallel processing paradigm proposed by Google in 2004 (DeanGhemawat2004MapReduce). The novelty does not lay in the idea of separating computation between map operations and reduce operations (this is fairly standard and was available in 1995 in MPI) but rather in the underlying implementation.

High level principles

The base idea of Map Reduce is to decompose a computational task into map operations and reduce operations:

map
the map operation is applied to some local data by each computational node
reduce
the reduce operation is applied to groups of data produced by the map operations

More specifically, map reduce operates on indexed data. Each data point consists in a pair (key, value). The map operation takes one pair and outputs a list of pairs. The reduce operation takes one key and a set of values associated to that key, resulting from map operations. It outputs a list of values (with no key).

The classical example is the word counting one:

  • the map operation takes a document as input (with the key of the document, for instance its title) and outputs a list of pairs (word, 1);
  • the reduce operation takes as input a word (keys from the map step) and all associated values and outputs the sum of the obtained values.

This can be obviously improved by having the map operation to count words. This can be somewhat automated by a Combiner functionality in Google's map reduce.

Some examples

Co-buying
assume we have shopping carts (orders) given as sets of items. The goal is to count the pairs of items (for correlation analysis). Two solutions
  1. simple pairwise:
    • map each set to a series of 1 keyed by item pairs
    • reduce on pairs
  2. hashmap based:
    • map each set to a series of hashmaps. Each hashmap maps items to 1 and is keyed by another item
    • reduce emits pairs
Conditional statistics
assume we have records with two variables: one integer (or nominal) variable \(X\) and one numerical variable \(Y\). The goal is to compute some statistics on \(Y\) conditionally on \(X\). It could be for instance the average revenue based on age. A possible solution:
  • map each record to the value of \(Y\) keyed by \(X\)
  • reduce on \(X\) and compute the statistics

Design principle

While Map Reduce is frequently presented as a general purpose paradigm for distributed system (something that it is not), its main interest resides in the features of its implementations that are quite different from what is expected for a typical MPI cluster, for instance. In addition, its design is quite specific and in a way is taking a very different path from the one taken by MPI. One could say that MPI targets computationally intensive tasks while map reduce is more adapted to data intensive tasks. In particular, map reduce is adapted to situations where each datum is processed once while MPI is well suited for iterative algorithms, for instance. Map reduce is also designed to take advantage of local storage while MPI is generally built on the idea of a distributed storage.

In practice Map reduce is designed to be run on a very large number of machines that provide a distributed storage. Each machine should be aware of what it stores locally. When a map reduce job is run on the cluster, maps are performed locally: each machine is given a copy of the map task and will run it on the data is stores. Then a shuffle operation allows partial results to be shared and transferred from machine to machine in an optimized way. Reduce operations are handled by idle machines using the shuffling facility.

It should be noted that in theory, map reduce could be run on some computers that do not implemented a distributed storage. However, the main advantage of map reduce is the automated leveraging of locality which cannot be done if the computational backend is not aware of the fine details of the storage backend.

As work distribution and communications are handled by the map reduce framework, a typical implementation can be fault tolerant. The original Google use case targeted a loosely integrated cluster of standard computers with a standard network. In such a cluster, machine failure is common, a fact that motivated this fault tolerance feature.

Hadoop MapReduce

Hadoop is an open source framework for distributed storage and distributed processing. It provides in particular:

Hadoop Distributed File System (HDFS)
distributed storage component
Hadoop MapReduce
distributed processing

As explained above, Map reduce is only efficient if it has some knowledge of data location. This is why the execution engine of Hadoop is tightly coupled with its storage engine.

Hadoop is implemented in Java and provides therefore a native Java API for MapReduce. This is basically done by implementing a Mapper interface and a Reducer interface. The main way to use other languages instead of Java is via the Hadoop Streaming approach which provides a basic text based interface.

Spark

Spark is an advanced processing paradigm that overcomes many of the limitations of map reduce. One of the main idea of Spark is the introduction of the notion of a working set of data on which operations are conducted. This eases the implementation of both iterative methods and interactive analyses, while improving a lot their performances over a map reduce implementation.

Spark is implemented in Scala, a very high level functional and object oriented programming language. Spark has also a Java and a Python interface that are quite close to the Scala one. It offers also an official R interface.

Resilient Distributed Datasets (RDD)

The concept of working set is implemented by the Resilient Distributed Dataset (RDD). A RDD is a read only collection of objects that are distributed on the underlying cluster. In general, a Spark program start by getting a RDD from a distributed storage and then produces a series of RDD via transformation operators. A very interesting aspects of RDD is that they can be kept in the main memory of the computers of the cluster (this is called caching) which generally improves (a lot) the performances.

The term resilient refers to the fact RDD can be recomputed from their starting point in case of a machine failure.

Operations on RDD

A spark program consists in performing operations on RDD. Typical operations include:

mapping
apply a transformation to each object of the RDD to built another object (in another RDD)
flatMap
similar to map in map reduce (man one object to possibly many objects)
join
the join operation from data bases

None of those operations is actually performed on the fly during the program execution. On the contrary, they are only recorded. The actual computation takes place when an action is performed on a RDD. This enables optimizing the series of operations. Actions include:

count
the size of a RDD
collect
allows to iterate on a RDD
reduce
accumulation function

A word count example:

val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)

Bibliography