Theme

This course on the Big Data phenomenon is given in French in the TIDE master to students who follow in parallel several data mining and statistical learning courses. As such, the following notes do not cover those aspects but focus more on practical and conceptual impacts of moving from "small" data to "medium" and then "large" data. The course remains rather non technical and should therefore be accessible to master students with a reasonable background in computer science and statistics.

A short version of this course is also given in English in the MMMEF master as part of and introduction to the Big Data phenomenon in the Data Science course.

Outline and lecture notes

  1. Introduction
  2. Big Data Needs and Applications
  3. Shared memory parallel programming
  4. Distributed systems

Recommended reading/viewing

General papers

Relational databases (and SQL)

Assessment

In order to pass the course, students are expected to study a paper (and associated papers and tutorials) and to demonstrate understanding it contents. This is done by providing a few page long written summary of the paper discussing its content and by giving an oral presentation. A selection of papers is given below.

Alternatively, students who have access to high performance computers can implement a large data processing proof of concept.

Data storage

Non relational data bases

  • Bigtable: google's distributed storage system for structured data

Execution engines

MapReduce

Spark

  • Spark SQL: how to handle relational data in Spark

Privacy

De-anonymization

Differential privacy

Data Mining

Stream Mining

Frequent Itemsets

Machine learning

Model assessment

Stochastic gradient descent

Deep learning

Use cases

Twitter