This course on the Big Data phenomenon is given in French in the TIDE master to students who follow in parallel several data mining and statistical learning courses. As such, the following notes do not cover those aspects but focus more on practical and conceptual impacts of moving from "small" data to "medium" and then "large" data. The course remains rather non technical and should therefore be accessible to master students with a reasonable background in computer science and statistics.

This course is also given in English in the MMMEF master. For this master, the course includes a primer on data mining methods.

In order to pass the course, students are expected to study a paper (and associated papers and tutorials) and to demonstrate understanding it contents. This is done by providing a few page long written summary of the paper discussing its content and by giving an oral presentation. A selection of papers is given below.

Alternatively, students who have access to high performance computers can implement a large data processing proof of concept.

- Bigtable: google's distributed storage system for structured data

- HaLoop: a MapReduce framework which supports iterative programs
- Distributed Computing with MapReduce and Pig Latin: the Pig high level language
- Basic MapReduce Algorithm Design: how to design efficient algorithms in MapReduce (chapter 3 of the book)

- Spark SQL: how to handle relational data in Spark

- Robust De-anonymization of Large Sparse Datasets: how to de-anonymize netflix data using imdb ones

- The Algorithmic Foundations of Differential Privacy: how to give mathematical guarantees of anonymity (students can limit themselves to first three chapters)

- Mining Data Streams: general tools such as Bloom filters

- Frequent Itemsets: algorithms for large scale data

- The Big Data Bootstrap: how to implement the bootstrap method on big data
- Automating model search for large scale machine learning: how to find good models in a fully automated way

- The Tradeoffs of Large Scale Learning: how optimization error plays a role in machine learning models
- Parallelized stochastic gradient descent: how to implement stochastic gradient descent in parallel