Big Data?
What is Big Data for you?
Introductory interactive discussion about Big Data.
Objectives of the course
- intellectual self-defense
- clearing the confusion around big data
- when do you not need big data solution
- a tour of the most common solutions
The Big Data confusion
The terms "Big Data" should be used only to refer to very large data sets
and related applications. It should not be used to describe predictive
modeling in general. This confusion is damaging because solutions designed
for super large data sets are generally not adapted to normal size data
sets.
Doug Laney's definition
Analyst at META group (now Gartner). Proposed in 2001 the 3 V's:
- Volume: data size
- Velocity: streaming context
- Variety: text, image, video, etc.
Sometimes complemented by Veracity (data quality, confidence in the
results). The really important points are Volume and Velocity. Variety has
been around for ages and Veracity is addressed in other contexts. Keeping
the 3/4 V's is a manifestation (among others) of the confusion.
Consequences of the confusion
Very high volume data sets cannot be stored, queried and processed by standard
solutions (e.g. on a single yet powerful computer). Thus specialized
solutions have been developed. Main example: the Map Reduce paradigm by
Google, Hadoop as open source solution.
However, they have a quite high entry cost, both in human terms and in
processing power terms. In other words, those tools are not adapted to
small to medium scale data sets.
Standard confusion story:
- we want to do predictive modelling but we confuse that with big data
- we read tutorials and books about big data finding discussions about
predictive modelling, reinforcing our confusion
- we use the dominant open source solution, hadoop
- performances are very bad so we add more computers, reinforcing our
belief that we do big data magic
- wash, rinse, repeat
What has changed?
Are there really a before and an after? an emergence of Big Data?
This is indeed the case with the massive use of computers to handle every
day life. Computer use produces traces (a.k.a. activity log). The most
basic example is maybe the logs of web servers which contain:
- the ip of the client (your computer) -> this gives many facts about you
such as your location and your internet provider;
- the request date and time;
- the requested page;
- a user agent (a description of your web browser);
- the referrer (if available, this is where you come from).
Numerous other examples can be given. They all revolve around the following
facts:
- data collection is easy via internet services (web sites and specialized
protocols and apps)
- there is an incentive to provide accurate personal information (e.g. on
facebook when discussing with your family and friends, on
whatsapp/telegram where you give your phone number, on amazon where your
provide your coordinates, etc.)
- most of the providers are US based and do not have to satisfy very
stringent data privacy rules (contrarily to e.g. in Germany or in
France)
- computer resources are super cheap, for instance storage:
- in 1956 the first HDD by IBM costed 50 k $ (equivalent to roughly 430
$ K of 2013) and stored 5 MB
- in 1995 a typical 1 GB HDD costed 850 $ (equivalent to 1300 $
in 2013)
- in 2005 a typical 200 GB HDD costed 200 $ (equivalent to 120 $
in 2013)
- in 2010 a typical 1 TB HDD costed 80 $ (equivalent to 85 $ in 2013)
- in 2013 a typical 2 TB HDD costed 110 $
- the cost has dropped from 10000 $ per MB to 6.33 cents per GB!
- computer resources are available especially via cloud computing
systems (and their ancestors like grid computing systems)
Notice that while internet is major provider of personal data, big data
appear in other contexts, such as:
- pre-internet services (e.g. credit/debit card, air company frequent flyer
programs, etc.)
- physics: the Large Hadron Collider (LHC) and its ancestors
- biology:
- in 2001, the estimated cost of sequencing the full human genome was
estimated to have been around 2 billions of $ for 15 years of work
- in 2009, it costed 100 k $ for 1 year of work
- in 2014, it cost 1000 $ for 24 hours of work
- smart grids: ERDF linky
- the open data movement
Big?
A.k.a. Volume.
Some orders of magnitude for storage
See this wikipedia entry.
- one bit: 0 or 1
- one byte: 8 bits, 0 to 255 (for instance)
- four bytes (32 bits): integers, old memory address space, IPv4
address, single precision floating point number
- eight bytes (64 bits): long integers, current memory address space,
double precision floating point number
- sixteen bytes (128 bits): IPv6 address, minimum length of private keys
in state-of-the-art symmetric encryption algorithms
- 640 kB (\(640\times 2^{10}\times 8\) bits): typical size of a book written in English (500
pages, 2000 characters per page, compressed representation), maximum
memory size in the original IBM PC standard
- 4 MB (\(4\times 2^{20}\times 8\) bits): typical size of a mp3 encoded song
- 10 MB: typical size of a jpeg encoded high quality image
- 40 MB: typical size of a raw image from a high end DSLR
- 650 MB: capacity of a compact disk
- 2 GB (\(2\times 2^{30}\times 8\) bits): Iphone 6s main memory size
- 3 GB: Galaxy S6 main memory size
- 4 GB: maximum addressable memory on a
standard 32 bits architecture
- 4,7 GB: capacity of DVD (single layer, single side)
- 8 GB: typical main memory size of a high end laptop / medium range
desktop
- 16 GB: typical main memory size of a workstation
- 25 GB: capacity of a Blu-ray disk
- 128 GB: typical main memory size of a 16 core server
- 200 GB: maximal capacity (in 2015) of a micro SD card
- 512 GB: maximal capacity (in 2015) of a SD card, typical capacity of a
SSD
- 1 TB (\(2^{40}\times 8\) bits): typical size of a HDD
- 10 TB: maximal capacity (in 2015) of a HDD
- 16 TB: maximal capacity of a SSD (just announced)
- 300 PB (\(300\times 2^{50}\times 8\) bits): capacity of Facebook data
warehouse in April 2014, corresponds to 300 000 1 TB HDD or to 1.2
millions of BR disks
- 15 EB (\(15\times 2^{60}\times 8\) bits): estimated capacity of Google
data warehouse in 2013 (unconfirmed but realistic), corresponds to 15
millions of 1 TB HDD or 600 millions of BR disks
Computational capabilities
Expressed in flops: floating point operations per second. There are
essentially two ways of doing floating point calculations: with a CPU
(general purpose microprocessor) and with a GPU (graphical processing unit,
specialized originally for graphics).
Current Intel architecture (as of 2015):
- Haswell and Broadwell microarchitecture
- commercial names: core i3, i5, i7 and M; pentium and celeron; xeon
- from 2 to 18 cores (4 is the standard)
- clock frequency between 2 GHz and 3.5 GHz
- roughly two operations per tick per core
- around 100 Giga flops per CPU for general calculation for the best ones
(around 7000 $)
- specific faster operations for linear algebra
- can address up to 1.5 TB of memory
Graphical processing units:
- harder to program
- less memory, up to 16 GB per GPU
- enormous flops: around 1.5 Tera flops per GPU for the best ones
(e.g. Nvidia Tesla K40, around 3000 $)
An example of theoretical cost analysis
Let us consider the case of some regression analysis on a data set with N
observations in dimension P (P variables). The calculation is dominated by
a matrix multiplication (\(X^TX\)) and the inversion of the resulting
matrix. The computational cost is in \(O(NP^2+P^3)\). Memory occupation is in
\(O(NP+P^2)\). In general \(N>P\) and thus the mixed terms dominate the costs.
P |
\(P^3\) time in seconds with CPU |
\(P^3\) time in seconds with GPU |
10 |
1e-8 |
6.6666667e-10 |
100 |
1e-5 |
6.6666667e-7 |
1000 |
0.01 |
6.6666667e-4 |
10000 |
10. |
0.66666667 |
100000 |
10000. |
666.66667 |
P |
\(P^3\) time in hours with CPU |
\(P^3\) time in hours with GPU |
100000 |
2.7777778 |
0.18518519 |
1e6 |
2777.7778 |
185.18519 |
P |
\(P^3\) time in days with CPU |
\(P^3\) time in days with GPU |
1e6 |
115.74074 |
7.7160494 |
1e7 |
115740.74 |
7716.0494 |
Of course, 1 million of features is quite large, so a more reasonnable
maximum value for P might be 10000. With such a high value of P, one needs
many examples to build reliable predictive models and thus N should be at
least of the order of 100 000. Then we have the following timings.
N |
P |
\(NP^2+P^3\) s (CPU) |
\(NP^2+P^3\) s (GPU) |
1000 |
10 |
1.01e-6 |
6.7333333e-8 |
5000 |
100 |
5.1e-4 |
3.4e-5 |
10000 |
1000 |
0.11 |
7.3333333e-3 |
50000 |
1000 |
0.51 |
0.034 |
100000 |
1000 |
1.01 |
0.067333333 |
1e6 |
1000 |
10.01 |
0.66733333 |
1e8 |
1000 |
1000.01 |
66.667333 |
This analysis shows that the number of features is really the limiting
factor (at least in this example). A natural question is whether the number
of features can reach very high values.
Number of features
A very large number of features is natural in many contexts:
- images: high resolutions DSLR have several millions of pixels
- text mining: a text can be seen as a count vector in a very high
dimensional space (one dimension per word)
- genome mining: roughly 3 billions of bases, with 0.1% of variability
between humans (so 3 millions of differences…)
- log data: key typed, web page browsed, electricity consumption,
position, etc.
A warning about parallelism and distribution
It should be noted that the computational capabilities given above are
- difficult to achieve without very complex programs which make use of
specialized instructions (from MMX in 1996 to recent AVX/FMA) and
respect the cache hierarchy;
- based on intrinsically parallel programs.
I use the terms parallel program to refer to programs designed for multiple
cores or at least a unique computer with possibly several CPUs. I use the
terms distributed program for programs that are designed to run on
multiple loosely connected machines. The main point of distinction is
whether memory (RAM) is shared in a efficient way or not.
In order to reach good performances on a high end machine, one needs to
develop parallel programs. To achieve good performances on set of
computers, one needs to develop distributed programs.
Take home message
Big Data should be used to refer to very large data sets and applications
dealing with them. Technological and sociological changes have allowed the
collection of enormous data sets. While modern CPU/GPU have enormous
processing capabilities, some standard methods do not scale to a large
number of features. Very high feature counts can be found in numerous
applications. Parallel processing is mandatory to reach high performances.