An introduction to Big Data

Last update in 2019.

Theme

This course on the Big Data phenomenon is given in French in the TIDE master to students who follow in parallel several data mining and statistical learning courses. As such, the following notes do not cover those aspects but focus more on practical and conceptual impacts of moving from "small" data to "medium" and then "large" data. The course remains rather non technical and should therefore be accessible to master students with a reasonable background in computer science and statistics.

A short version of this course is also given in English in the MMMEF master as part of and introduction to the Big Data phenomenon in the Data Science course.

Outline and lecture notes

The lecture notes are also available as a pdf file.

Exercises

Big Data in R

Data Carpentry in R with data.table: those exercises make use of the small size data sets provided below.
Parallel programming in R

Data sets

Small size data sets for R

I provide here a series of small size data sets that are intended to show limitations of data processing in R on a single computer. The data are extracted from the HM Land Registry Price Paid Data and come in two formats (compressed csv file and Rds file) and several sizes. Files have been produced mainly with the data.table package: when reading the Rds file one can recover directly a data.table object. It is recommended to work on files that are at most one fourth of the RAM of the computer used for data exploration.

500 MB data set: csv Rds
1 GB data set: csv Rds
1.5 GB data set: csv Rds
1.9 GB data set: csv Rds
3.7 GB data set: csv Rds

It is also recommended to work directly with the Rds file to avoid long loading and data conversion times that are needed when working with CSV files.

Documentation on the content of the files is available here. Contrarily to the original format, the csv files above have headers in which column names are the ones given in the documentation.

Assessment

In order to pass the course, students are expected to study a paper (and associated papers and tutorials) and to demonstrate understanding it contents. This is done by providing a few page long written summary of the paper discussing its content and by giving an oral presentation. A selection of papers is given below.

Alternatively, students who have access to high performance computers can implement a large data processing proof of concept.

Data storage

Scalable data storage

Scalable SQL and NoSQL Data Stores: a survey on scalable data stores

Non relational data bases

Bigtable: google's distributed storage system for structured data

Execution engines

MapReduce

HaLoop: a MapReduce framework which supports iterative programs
Distributed Computing with MapReduce and Pig Latin: the Pig high level language
Basic MapReduce Algorithm Design: how to design efficient algorithms in MapReduce (chapter 3 of the book)

Spark

Spark SQL: how to handle relational data in Spark

Privacy

De-anonymization

Robust De-anonymization of Large Sparse Datasets: how to de-anonymize netflix data using imdb ones

Differential privacy

The Algorithmic Foundations of Differential Privacy: how to give mathematical guarantees of anonymity (students can limit themselves to first three chapters)
Privacy Integrated Queries: an actual implementation of DP in a data analysis context

Data Mining

Stream Mining

Mining Data Streams: general tools such as Bloom filters

Frequent Itemsets

Frequent Itemsets: algorithms for large scale data

Machine learning

Model assessment

The Big Data Bootstrap: how to implement the bootstrap method on big data
Automating model search for large scale machine learning: how to find good models in a fully automated way

Stochastic gradient descent

The Tradeoffs of Large Scale Learning: how optimization error plays a role in machine learning models
Parallelized stochastic gradient descent: how to implement stochastic gradient descent in parallel