Introduction to the Big Data course

Big Data?

What is Big Data for you?

Introductory interactive discussion about Big Data.

Objectives of the course

intellectual self-defense
clearing the confusion around big data
when do you not need big data solution
a tour of the most common solutions

The Big Data confusion

The terms "Big Data" should be used only to refer to very large data sets and related applications. It should not be used to describe predictive modeling in general. This confusion is damaging because solutions designed for super large data sets are generally not adapted to normal size data sets.

Doug Laney's definition

Analyst at META group (now Gartner). Proposed in 2001 the 3 V's:

Volume: data size
Velocity: streaming context
Variety: text, image, video, etc.

Sometimes complemented by Veracity (data quality, confidence in the results). The really important points are Volume and Velocity. Variety has been around for ages and Veracity is addressed in other contexts. Keeping the 3/4 V's is a manifestation (among others) of the confusion.

Consequences of the confusion

Very high volume data sets cannot be stored, queried and processed by standard solutions (e.g. on a single yet powerful computer). Thus specialized solutions have been developed. Main example: the Map Reduce paradigm by Google, Hadoop as open source solution.

However, they have a quite high entry cost, both in human terms and in processing power terms. In other words, those tools are not adapted to small to medium scale data sets.

Standard confusion story:

we want to do predictive modelling but we confuse that with big data
we read tutorials and books about big data finding discussions about predictive modelling, reinforcing our confusion
we use the dominant open source solution, hadoop
performances are very bad so we add more computers, reinforcing our belief that we do big data magic
wash, rinse, repeat

What has changed?

Are there really a before and an after? an emergence of Big Data?

This is indeed the case with the massive use of computers to handle every day life. Computer use produces traces (a.k.a. activity log). The most basic example is maybe the logs of web servers which contain:

the ip of the client (your computer) -> this gives many facts about you such as your location and your internet provider;
the request date and time;
the requested page;
a user agent (a description of your web browser);
the referrer (if available, this is where you come from).

Numerous other examples can be given. They all revolve around the following facts:

data collection is easy via internet services (web sites and specialized protocols and apps)
there is an incentive to provide accurate personal information (e.g. on facebook when discussing with your family and friends, on whatsapp/telegram where you give your phone number, on amazon where your provide your coordinates, etc.)
most of the providers are US based and do not have to satisfy very stringent data privacy rules (contrarily to e.g. in Germany or in France)
computer resources are super cheap, for instance storage:
- in 1956 the first HDD by IBM costed 50 k $ (equivalent to roughly 430 $ K of 2013) and stored 5 MB
- in 1995 a typical 1 GB HDD costed 850 $ (equivalent to 1300 $ in 2013)
- in 2005 a typical 200 GB HDD costed 200 $ (equivalent to 120 $ in 2013)
- in 2010 a typical 1 TB HDD costed 80 $ (equivalent to 85 $ in 2013)
- in 2013 a typical 2 TB HDD costed 110 $
- the cost has dropped from 10000 $ per MB to 6.33 cents per GB!
computer resources are available especially via cloud computing systems (and their ancestors like grid computing systems)

Notice that while internet is major provider of personal data, big data appear in other contexts, such as:

pre-internet services (e.g. credit/debit card, air company frequent flyer programs, etc.)
physics: the Large Hadron Collider (LHC) and its ancestors
biology:
- in 2001, the estimated cost of sequencing the full human genome was estimated to have been around 2 billions of $ for 15 years of work
- in 2009, it costed 100 k $ for 1 year of work
- in 2014, it cost 1000 $ for 24 hours of work
smart grids: ERDF linky
the open data movement

Big?

A.k.a. Volume.

Some orders of magnitude for storage

See this wikipedia entry.

one bit: 0 or 1
one byte: 8 bits, 0 to 255 (for instance)
four bytes (32 bits): integers, old memory address space, IPv4 address, single precision floating point number
eight bytes (64 bits): long integers, current memory address space, double precision floating point number
sixteen bytes (128 bits): IPv6 address, minimum length of private keys in state-of-the-art symmetric encryption algorithms
640 kB ($640\times 2^{10}\times 8$ bits): typical size of a book written in English (500 pages, 2000 characters per page, compressed representation), maximum memory size in the original IBM PC standard
4 MB ($4\times 2^{20}\times 8$ bits): typical size of a mp3 encoded song
10 MB: typical size of a jpeg encoded high quality image
40 MB: typical size of a raw image from a high end DSLR
650 MB: capacity of a compact disk
2 GB ($2\times 2^{30}\times 8$ bits): Iphone 6s main memory size
3 GB: Galaxy S6 main memory size
4 GB: maximum addressable memory on a standard 32 bits architecture
4,7 GB: capacity of DVD (single layer, single side)
8 GB: typical main memory size of a high end laptop / medium range desktop
16 GB: typical main memory size of a workstation
25 GB: capacity of a Blu-ray disk
128 GB: typical main memory size of a 16 core server
200 GB: maximal capacity (in 2015) of a micro SD card
512 GB: maximal capacity (in 2015) of a SD card, typical capacity of a SSD
1 TB ($2^{40}\times 8$ bits): typical size of a HDD
10 TB: maximal capacity (in 2015) of a HDD
16 TB: maximal capacity of a SSD (just announced)
300 PB ($300\times 2^{50}\times 8$ bits): capacity of Facebook data warehouse in April 2014, corresponds to 300 000 1 TB HDD or to 1.2 millions of BR disks
15 EB ($15\times 2^{60}\times 8$ bits): estimated capacity of Google data warehouse in 2013 (unconfirmed but realistic), corresponds to 15 millions of 1 TB HDD or 600 millions of BR disks

Computational capabilities

Expressed in flops: floating point operations per second. There are essentially two ways of doing floating point calculations: with a CPU (general purpose microprocessor) and with a GPU (graphical processing unit, specialized originally for graphics).

Current Intel architecture (as of 2015):

Haswell and Broadwell microarchitecture
commercial names: core i3, i5, i7 and M; pentium and celeron; xeon
from 2 to 18 cores (4 is the standard)
clock frequency between 2 GHz and 3.5 GHz
roughly two operations per tick per core
around 100 Giga flops per CPU for general calculation for the best ones (around 7000 $)
specific faster operations for linear algebra
can address up to 1.5 TB of memory

Graphical processing units:

harder to program
less memory, up to 16 GB per GPU
enormous flops: around 1.5 Tera flops per GPU for the best ones (e.g. Nvidia Tesla K40, around 3000 $)

An example of theoretical cost analysis

Let us consider the case of some regression analysis on a data set with N observations in dimension P (P variables). The calculation is dominated by a matrix multiplication ($X^TX$) and the inversion of the resulting matrix. The computational cost is in $O(NP^2+P^3)$. Memory occupation is in $O(NP+P^2)$. In general $N>P$ and thus the mixed terms dominate the costs.

P	$P^3$ time in seconds with CPU	$P^3$ time in seconds with GPU
10	1e-8	6.6666667e-10
100	1e-5	6.6666667e-7
1000	0.01	6.6666667e-4
10000	10.	0.66666667
100000	10000.	666.66667

P	$P^3$ time in hours with CPU	$P^3$ time in hours with GPU
100000	2.7777778	0.18518519
1e6	2777.7778	185.18519

P	$P^3$ time in days with CPU	$P^3$ time in days with GPU
1e6	115.74074	7.7160494
1e7	115740.74	7716.0494

Of course, 1 million of features is quite large, so a more reasonnable maximum value for P might be 10000. With such a high value of P, one needs many examples to build reliable predictive models and thus N should be at least of the order of 100 000. Then we have the following timings.

N	P	$NP^2+P^3$ s (CPU)	$NP^2+P^3$ s (GPU)
1000	10	1.01e-6	6.7333333e-8
5000	100	5.1e-4	3.4e-5
10000	1000	0.11	7.3333333e-3
50000	1000	0.51	0.034
100000	1000	1.01	0.067333333
1e6	1000	10.01	0.66733333
1e8	1000	1000.01	66.667333

This analysis shows that the number of features is really the limiting factor (at least in this example). A natural question is whether the number of features can reach very high values.

Number of features

A very large number of features is natural in many contexts:

images: high resolutions DSLR have several millions of pixels
text mining: a text can be seen as a count vector in a very high dimensional space (one dimension per word)
genome mining: roughly 3 billions of bases, with 0.1% of variability between humans (so 3 millions of differences…)
log data: key typed, web page browsed, electricity consumption, position, etc.

A warning about parallelism and distribution

It should be noted that the computational capabilities given above are

difficult to achieve without very complex programs which make use of specialized instructions (from MMX in 1996 to recent AVX/FMA) and respect the cache hierarchy;
based on intrinsically parallel programs.

I use the terms parallel program to refer to programs designed for multiple cores or at least a unique computer with possibly several CPUs. I use the terms distributed program for programs that are designed to run on multiple loosely connected machines. The main point of distinction is whether memory (RAM) is shared in a efficient way or not.

In order to reach good performances on a high end machine, one needs to develop parallel programs. To achieve good performances on set of computers, one needs to develop distributed programs.

Take home message

Big Data should be used to refer to very large data sets and applications dealing with them. Technological and sociological changes have allowed the collection of enormous data sets. While modern CPU/GPU have enormous processing capabilities, some standard methods do not scale to a large number of features. Very high feature counts can be found in numerous applications. Parallel processing is mandatory to reach high performances.