I'm separating the data storage part from the data management/querying part, while I'm grouping data generation and data acquisition. This is somewhat different from some surveys, e.g. HuWenEtAl2014TowardScalable.
The Big Data "revolution" is based on massive data collection capabilities which are needed because of the massive data generation/production this is done daily. This is enabled by several important elements:
The ongoing evolution will lead to even more data via the "internet of things", i.e. connected devices ranging from activity trackers to "smart" thermostats.
Data acquisition is generally part of the generation process, for instance a connection to a web site generates a log entry in the web server (ditto for e.g. e-commerce). For some applications, data acquisition can prove complex because of the needed transmission rates from distributed sensors (internet of things) and servers.
Both generation and acquisition are simplified by local processing. For instance, activity trackers detect automatically the type of activity (sitting, standing, etc.) and can therefore transmit only time intervals. Music recognition (a.k.a. Shazam) is based on the local calculation of a signature.
Those can be solved via e.g. cloud offers.
Collecting data is nice, but they must be stored somewhere. It should be noted that while the streaming model has been studied a lot, in practice data get stored. They might not be usable readily, especially for short latency applications, but they are still stored somewhere.
As pointed out above, storage is now very cheap. There are 3 levels of storage, in increasing order of latency, capacity and cheapness:
In order to give the illusion of a very large storage capacity with low latency at a decent price, a hierarchical scheme is used with caching. Perhaps more importantly, storage is abstracted from the hardware via high level file systems and/or via high level systems (NAS) or lower level ones (SAN). The key idea is to separate computational resources from storage resources and to connect them via a (local) network. This allows, among other things, to add more and more storage capacity (this is part of horizontal scaling).
NAS and SAN are integrated solutions which can be replaced by basic off the shelf PC with local storage and a distributed file system. We will discuss this aspect further when we will talk about the hadoop ecosystem.
Stored data are useless if they cannot be manipulated. The standard way of doing that is to use a relational database management system (RDBMS). The so-called NoSQL movement provides alternative solutions.
On the one hand, RDBMS seem to be the best solution, notably via ACID transactions: they should provide the properties needed in order to ensure that parallel data production, access, modification, etc. do not break the whole system. On the other hand, the CAP "theorem" says that they can't provide what's needed, thus alternative should be explored.
They have three major elements:
The CAP "theorem" is a conjecture (or a general observation/principle) made by Eric Brewer in 1998-2000 about distributed systems (see Brewer2012CAP12YearsLater for a recent point of view of Brewer himself on the principle). The CAP theorem states that any networked shared-data system can have at most two of three desirable properties:
Unfortunately, this cannot be a theorem because of the vagueness of the properties. Some theorems have been proved by interpreting the properties in specific theoretical contexts, but the overal principle remains a form of principle.
That said the principle is very simple, especially viewed on the partition point of view (as in Brewer2012CAP12YearsLater). Indeed if a distributed system is partitioned, it contains at least 2 parts that are not connected at a certain point. Then if a query arrives to one part, this part cannot decide if is has the latest available version of the data. Thus, it has either to refuse the query (breaking A) or to risk inconsistency (breaking C).
Because traditional RDBMS operate under the ACID principle, they cannot be partition tolerant. One of the basis of the NoSQL movement is then to move away from the RDBMS model in a tentative to be more tolerant to the partition issue. As pointed out in Brewer2012CAP12YearsLater, the actual issue is whether during a partition event one should favor consistency or availability, and then how to recover a fully consistent system when the partition disappears.
Another way to approach the CAP "theorem" consequences is to replace ACID by other properties. The most popular ones are the BASE ones:
BASE has been the driving design philosophy of NoSQL DBMS. However, it should be noted that the NoSQL term is well chosen but poorly understood. In fact, the common point of most of the NoSQL DBMS is that they do not use the relational model but other forms of organizations (which are not adapted to the SQL language, hence the NoSQL acronym).
This does not mean that all NoSQL DBMS are distributed. In particular, the key-value database model, which is a typical non relational database structure, can be implemented as a local system or in a distributed fashion.
Nevertheless, there are numerous large scale distributed data store that can be used in big data contexts, both to store the data and to query them, albeit in a limited way compared to SQL possibilities. There are also data store dedicated to some types of data, for instance for documents (XML, JSON, etc.) and for graphs.
The database aspect of big data is a very active research area. The movement has been originally to move away from traditional relational DBMS to the NoSQL distributed DBMS. However, many of them offer little or no consistency guarantee, leading to e.g. data loss. Thus, many real world solutions are based on hybrid systems that combine relational DBMS for crucial data with NoSQL distributed DBMS for other data. The current trend in research seems to be the NewSQL class, exemplified by Google's Spanner. It's a move back to almost relational systems with SQL interface and strong consistency guarantees. At a smaller scale, distributed versions of standard RDBMS seem to work quite well (e.g. MySQL Cluster).
Obviously, one of the main goal of the Big Data approach is to extract information from the pile of data collected. The application range is enormous and contains, among many others:
In addition to those obvious applications, it should be noted that the data are also used to build models that are in turn used to offer services to users. For instance (recommendation is in between):
The main difficulty of data mining in the context of big data is the need for distributed programming. Real word data mining is mostly tailor made and cannot generally be done via graphical user interfaces (GUI). Some form of programming is generally needed. Thus one of the main research work that has been done in the context of big data mining was to provide simple yet powerful data mining framework that hide the underlying complexity of distributed systems. The most famous one is the Hadoop ecosystem.
Handling Big Data is complex. The processing/value chain of Big Data starts with production/acquisition, followed by storage (low level aspects), then management and querying and finally mining. The first two steps are now handled via rather mature solutions. The database step (the third one) is still under heavy research with no obvious solution, even if the NewSQL paradigm seems to provide the best of the RDBMS world with the scalability of the NoSQL world. The data mining step is under heavy research and has to face the tension between allowing easy development of specialized data mining tasks and the utter difficulty of programming efficient and sound distributed algorithms.