Last update in 2019.

Theme

This course on the Big Data phenomenon is given in French in the TIDE master to students who follow in parallel several data mining and statistical learning courses. As such, the following notes do not cover those aspects but focus more on practical and conceptual impacts of moving from "small" data to "medium" and then "large" data. The course remains rather non technical and should therefore be accessible to master students with a reasonable background in computer science and statistics.

A short version of this course is also given in English in the MMMEF master as part of and introduction to the Big Data phenomenon in the Data Science course.

Outline and lecture notes

The lecture notes are also available as a pdf file.

  1. Introduction
  2. Big Data Needs and Applications
  3. Shared memory parallel programming
  4. Distributed systems
  5. Big Data in R

Exercises

Big Data in R

  1. Data Carpentry in R with data.table: those exercises make use of the small size data sets provided below.
  2. Parallel programming in R

Data sets

Small size data sets for R

I provide here a series of small size data sets that are intended to show limitations of data processing in R on a single computer. The data are extracted from the HM Land Registry Price Paid Data and come in two formats (compressed csv file and Rds file) and several sizes. Files have been produced mainly with the data.table package: when reading the Rds file one can recover directly a data.table object. It is recommended to work on files that are at most one fourth of the RAM of the computer used for data exploration.

It is also recommended to work directly with the Rds file to avoid long loading and data conversion times that are needed when working with CSV files.

Documentation on the content of the files is available here. Contrarily to the original format, the csv files above have headers in which column names are the ones given in the documentation.

Recommended reading/viewing

General papers

Relational databases (and SQL)

R for big data

Assessment

In order to pass the course, students are expected to study a paper (and associated papers and tutorials) and to demonstrate understanding it contents. This is done by providing a few page long written summary of the paper discussing its content and by giving an oral presentation. A selection of papers is given below.

Alternatively, students who have access to high performance computers can implement a large data processing proof of concept.

Data storage

Scalable data storage

Non relational data bases

  • Bigtable: google's distributed storage system for structured data

Execution engines

MapReduce

Spark

  • Spark SQL: how to handle relational data in Spark

Privacy

De-anonymization

Differential privacy

Data Mining

Stream Mining

Frequent Itemsets

Machine learning

Model assessment

Stochastic gradient descent

Deep learning

Use cases

Twitter