Last update in 2019.

This course on the Big Data phenomenon is given in French in the TIDE master to students who follow in parallel several data mining and statistical learning courses. As such, the following notes do not cover those aspects but focus more on practical and conceptual impacts of moving from "small" data to "medium" and then "large" data. The course remains rather non technical and should therefore be accessible to master students with a reasonable background in computer science and statistics.

A short version of this course is also given in English in the MMMEF master as part of and introduction to the Big Data phenomenon in the Data Science course.

The lecture notes are also available as a pdf file.

- Introduction
- Big Data Needs and Applications
- Shared memory parallel programming
- Distributed systems
- Big Data in R

- Data Carpentry in R with data.table: those exercises make use of the small size data sets provided below.
- Parallel programming in R

I provide here a series of small size data sets that are intended to show limitations of data processing in R on a single computer. The data are extracted from the HM Land Registry Price Paid Data and come in two formats (compressed csv file and Rds file) and several sizes. Files have been produced mainly with the data.table package: when reading the Rds file one can recover directly a data.table object. It is recommended to work on files that are at most one fourth of the RAM of the computer used for data exploration.

- 500 MB data set: csv Rds
- 1 GB data set: csv Rds
- 1.5 GB data set: csv Rds
- 1.9 GB data set: csv Rds
- 3.7 GB data set: csv Rds

It is also recommended to work directly with the Rds file to avoid long loading and data conversion times that are needed when working with CSV files.

Documentation on the content of the files is available here. Contrarily to the original format, the csv files above have headers in which column names are the ones given in the documentation.

- Introduction to SQL by Ben Smith
Video series by Jennifer Widom, in particular:

Other videos are very interesting and relevant. I recommend to skip everything that is XML related videos, and possibly JSON related ones.

In order to pass the course, students are expected to study a paper (and associated papers and tutorials) and to demonstrate understanding it contents. This is done by providing a few page long written summary of the paper discussing its content and by giving an oral presentation. A selection of papers is given below.

Alternatively, students who have access to high performance computers can implement a large data processing proof of concept.

- Scalable SQL and NoSQL Data Stores: a survey on scalable data stores

- Bigtable: google's distributed storage system for structured data

- HaLoop: a MapReduce framework which supports iterative programs
- Distributed Computing with MapReduce and Pig Latin: the Pig high level language
- Basic MapReduce Algorithm Design: how to design efficient algorithms in MapReduce (chapter 3 of the book)

- Spark SQL: how to handle relational data in Spark

- Robust De-anonymization of Large Sparse Datasets: how to de-anonymize netflix data using imdb ones

- The Algorithmic Foundations of Differential Privacy: how to give mathematical guarantees of anonymity (students can limit themselves to first three chapters)
- Privacy Integrated Queries: an actual implementation of DP in a data analysis context

- Mining Data Streams: general tools such as Bloom filters

- Frequent Itemsets: algorithms for large scale data

- The Big Data Bootstrap: how to implement the bootstrap method on big data
- Automating model search for large scale machine learning: how to find good models in a fully automated way

- The Tradeoffs of Large Scale Learning: how optimization error plays a role in machine learning models
- Parallelized stochastic gradient descent: how to implement stochastic gradient descent in parallel