This course on the Big Data phenomenon is given in French in the TIDE master to students who follow in parallel several data mining and statistical learning courses. As such, the following notes do not cover those aspects but focus more on practical and conceptual impacts of moving from "small" data to "medium" and then "large" data. The course remains rather non technical and should therefore be accessible to master students with a reasonable background in computer science and statistics.
The lecture notes are also available as a pdf file.
I provide here a series of small size data sets that are intended to show limitations of data processing in R on a single computer. The data are extracted from the HM Land Registry Price Paid Data and come in two formats (compressed csv file and Rds file) and several sizes. Files have been produced mainly with the data.table package: when reading the Rds file one can recover directly a data.table object. It is recommended to work on files that are at most one fourth of the RAM of the computer used for data exploration.
It is also recommended to work directly with the Rds file to avoid long loading and data conversion times that are needed when working with CSV files.
Documentation on the content of the files is available here. Contrarily to the original format, the csv files above have headers in which column names are the ones given in the documentation.
Other videos are very interesting and relevant. I recommend to skip everything that is XML related videos, and possibly JSON related ones.
In order to pass the course, students are expected to study a paper (and associated papers and tutorials) and to demonstrate understanding it contents. This is done by providing a few page long written summary of the paper discussing its content and by giving an oral presentation. A selection of papers is given below.
Alternatively, students who have access to high performance computers can implement a large data processing proof of concept.