I'm separating the data storage part from the data management/querying part, while I'm grouping data generation and data acquisition. This is somewhat different from some surveys, e.g. HuWenEtAl2014TowardScalable.

Data Generation and Acquisition

The Big Data "revolution" is based on massive data collection capabilities which are needed because of the massive data generation/production this is done daily. This is enabled by several important elements:

  1. data are produced essentially in digital format, nothing is analog anymore (music, image, video, text)
  2. data collection is either automated or crowdsourced
  3. mobile connected devices enable generation and collection from everywhere
  4. part of the mobile evolution simplifies even further point 1, e.g. voice recognition, music indexing, etc.

The ongoing evolution will lead to even more data via the "internet of things", i.e. connected devices ranging from activity trackers to "smart" thermostats.

Data acquisition is generally part of the generation process, for instance a connection to a web site generates a log entry in the web server (ditto for e.g. e-commerce). For some applications, data acquisition can prove complex because of the needed transmission rates from distributed sensors (internet of things) and servers.

Both generation and acquisition are simplified by local processing. For instance, activity trackers detect automatically the type of activity (sitting, standing, etc.) and can therefore transmit only time intervals. Music recognition (a.k.a. Shazam) is based on the local calculation of a signature.

Problems in a nutshell

  1. scalable data collection (enough servers)
  2. scalable data transmission (enough pipes to the servers)

Those can be solved via e.g. cloud offers.

Big Data Design Patterns

  1. "crowdsourcing": make sure data are produced as part of the service, preferably by the users
  2. distributed computation: summarize the local data using local resources (signature, etc.) before sending them

Data Storage

Collecting data is nice, but they must be stored somewhere. It should be noted that while the streaming model has been studied a lot, in practice data get stored. They might not be usable readily, especially for short latency applications, but they are still stored somewhere.

As pointed out above, storage is now very cheap. There are 3 levels of storage, in increasing order of latency, capacity and cheapness:

  1. RAM
  2. SSD
  3. HDD

In order to give the illusion of a very large storage capacity with low latency at a decent price, a hierarchical scheme is used with caching. Perhaps more importantly, storage is abstracted from the hardware via high level file systems and/or via high level systems (NAS) or lower level ones (SAN). The key idea is to separate computational resources from storage resources and to connect them via a (local) network. This allows, among other things, to add more and more storage capacity (this is part of horizontal scaling).

Price examples in 2015

  • super high end SAN, Dell PowerVault MD3820f:
    • with 24 TB of hyper fast HDD: 20 000 €
    • with 40 TB of super fast HDD: 32 000 €
  • high end SAN, Dell PowerVault MD3800i:
    • with 48 TB of hyper fast HDD: 20 000 €
  • high end NAS, Synology RS18016xs+:
    • with 48 TB of high end HDD: 8 700 €

NAS and SAN are integrated solutions which can be replaced by basic off the shelf PC with local storage and a distributed file system. We will discuss this aspect further when we will talk about the hadoop ecosystem.

Problems in a nutshell

It is generally considered that this aspect of big data is mature enough to provide off the shelf solutions adapted to most of the real word situations (as in the case of collection and acquisition, via e.g. cloud computing offers).

Data Management and Querying

Stored data are useless if they cannot be manipulated. The standard way of doing that is to use a relational database management system (RDBMS). The so-called NoSQL movement provides alternative solutions.

On the one hand, RDBMS seem to be the best solution, notably via ACID transactions: they should provide the properties needed in order to ensure that parallel data production, access, modification, etc. do not break the whole system. On the other hand, the CAP "theorem" says that they can't provide what's needed, thus alternative should be explored.

Relational database management system

They have three major elements:

  1. they are based on the relational model which represents interlinked data in a way that both avoids redundancy in the data storage and embeds logical constraints on the data
  2. they provide ACID transactions:
    Atomicity
    all or nothing
    Consistency
    valid states only
    Isolation
    serial
    Durability
    completed transactions are done, like forever
  3. almost all RDBMS implement a variant of SQL

The CAP "theorem"

The CAP "theorem" is a conjecture (or a general observation/principle) made by Eric Brewer in 1998-2000 about distributed systems (see Brewer2012CAP12YearsLater for a recent point of view of Brewer himself on the principle). The CAP theorem states that any networked shared-data system can have at most two of three desirable properties:

consistency (C)
equivalent to having a single up-to-date copy of the data (this is not the C of ACID);
high availability (A)
we can access data in a reasonable time;
tolerance to network partitions (P)
if we break up the system into pieces that can't communicate, it continues to work "correctly".

Unfortunately, this cannot be a theorem because of the vagueness of the properties. Some theorems have been proved by interpreting the properties in specific theoretical contexts, but the overal principle remains a form of principle.

That said the principle is very simple, especially viewed on the partition point of view (as in Brewer2012CAP12YearsLater). Indeed if a distributed system is partitioned, it contains at least 2 parts that are not connected at a certain point. Then if a query arrives to one part, this part cannot decide if is has the latest available version of the data. Thus, it has either to refuse the query (breaking A) or to risk inconsistency (breaking C).

Because traditional RDBMS operate under the ACID principle, they cannot be partition tolerant. One of the basis of the NoSQL movement is then to move away from the RDBMS model in a tentative to be more tolerant to the partition issue. As pointed out in Brewer2012CAP12YearsLater, the actual issue is whether during a partition event one should favor consistency or availability, and then how to recover a fully consistent system when the partition disappears.

NoSQL

Another way to approach the CAP "theorem" consequences is to replace ACID by other properties. The most popular ones are the BASE ones:

Basically Available
availability is the main goal;
Soft state
no formal definition (again!). It could mean that the state can change without user intervention (to reach consistency);
Eventually consistent
if updates stop, the system will eventually become consistent.

BASE has been the driving design philosophy of NoSQL DBMS. However, it should be noted that the NoSQL term is well chosen but poorly understood. In fact, the common point of most of the NoSQL DBMS is that they do not use the relational model but other forms of organizations (which are not adapted to the SQL language, hence the NoSQL acronym).

This does not mean that all NoSQL DBMS are distributed. In particular, the key-value database model, which is a typical non relational database structure, can be implemented as a local system or in a distributed fashion.

Nevertheless, there are numerous large scale distributed data store that can be used in big data contexts, both to store the data and to query them, albeit in a limited way compared to SQL possibilities. There are also data store dedicated to some types of data, for instance for documents (XML, JSON, etc.) and for graphs.

Problems in a nutshell

The database aspect of big data is a very active research area. The movement has been originally to move away from traditional relational DBMS to the NoSQL distributed DBMS. However, many of them offer little or no consistency guarantee, leading to e.g. data loss. Thus, many real world solutions are based on hybrid systems that combine relational DBMS for crucial data with NoSQL distributed DBMS for other data. The current trend in research seems to be the NewSQL class, exemplified by Google's Spanner. It's a move back to almost relational systems with SQL interface and strong consistency guarantees. At a smaller scale, distributed versions of standard RDBMS seem to work quite well (e.g. MySQL Cluster).

Data Mining

Obviously, one of the main goal of the Big Data approach is to extract information from the pile of data collected. The application range is enormous and contains, among many others:

  • system monitoring (dashboard): this is very important for large web sites, for instance, and more generally in order to understand how the users are actually using the service (see the retweet interface added by Tweeter, for instance)
  • user modeling for:
    • recommendations: amazon, netflix, etc.
    • targeted advertisement: google, facebook, etc.

In addition to those obvious applications, it should be noted that the data are also used to build models that are in turn used to offer services to users. For instance (recommendation is in between):

  • content searching (recommendation outside of the commercial context)
  • machine translation
  • face detection/recognition (in picture)
  • automatic summarizing
  • etc.

The main difficulty of data mining in the context of big data is the need for distributed programming. Real word data mining is mostly tailor made and cannot generally be done via graphical user interfaces (GUI). Some form of programming is generally needed. Thus one of the main research work that has been done in the context of big data mining was to provide simple yet powerful data mining framework that hide the underlying complexity of distributed systems. The most famous one is the Hadoop ecosystem.

Take home message

Handling Big Data is complex. The processing/value chain of Big Data starts with production/acquisition, followed by storage (low level aspects), then management and querying and finally mining. The first two steps are now handled via rather mature solutions. The database step (the third one) is still under heavy research with no obvious solution, even if the NewSQL paradigm seems to provide the best of the RDBMS world with the scalability of the NoSQL world. The data mining step is under heavy research and has to face the tension between allowing easy development of specialized data mining tasks and the utter difficulty of programming efficient and sound distributed algorithms.

Bibliography