Coclustering

My Phd student Romain Guigourès worked for three years on exploratory data analysis based on co-clustering. This work was co-advised by Marc Boullé from Orange Labs and originated from Marc's work. He has been working since 2004 on a generic parameter less approach called MODL which aims at estimating densities or probability distributions with grid based techniques (a type of multidimensional histogram). Assume for instance that you have a dataset of objects described by two qualitative variables. MODL estimates the joint probability distribution of those variables via two partitions of the modalities of said variables, one per variable. Intuitively, the probability of observing a pair \((a,b)\) does then depend only on the cluster of \(a\) and of the cluster of \(b\). The theory is more complex than that, in particular because it integrates automatic choice of the number of clusters via maximum a posteriori, but at least on the surface level, it works as described in the previous sentence.

Marc has been developing MODL with supervised applications in mind, mainly classification and scoring. It was clear however that through density modelling, MODL was also doing co-clustering (and $n$-mode/$n$-way clustering in general). The goal of Romain's thesis was to investigate how useful MODL could be as an exploratory tool.

This lead to several research contributions ranging from non trivial applications of MODL to complex exploratory problems (for instance time evolving graph clustering without a priori time quantification) to the introduction of numerous exploratory tools based on MODL. This work is covered by the following publications:

  • Discovering patterns in time-varying graphs: a triclustering approach (2018) Romain Guigourès, Marc Boullé and Fabrice Rossi. Advances in Data Analysis and Classification, volume 12, number 3, pages 509-536, September 2018.
  • Country-Scale Exploratory Analysis of Call Detail Records Through the Lens of Data Grid Models (2015) Romain Guigourès, Dominique Gay, Marc Boullé, Fabrice Clérot and Fabrice Rossi. In Machine Learning and Knowledge Discovery in Databases (Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML/PKDD 2015), edited by Albert Bifet, Michael May, Bianca Zadrozny, Ricard Gavalda, Dino Pedreschi, Francesco Bonchi, Jaime Cardoso and Myra Spiliopoulou, volume 9286, pages 37-52, Porto, Portugal, September 2015.
  • Co-Clustering Network-Constrained Trajectory Data (2015) Mohamed K. El Mahrsi, Romain Guigourès, Fabrice Rossi and Marc Boullé. In Advances in Knowledge Discovery and Management, edited by Fabrice Guillet, Bruno Pinaud, Gilles Venturini and Djamel Abdelkader Zighed, volume 615, pages 19-32, September 2015.
  • A Study of the Spatio-Temporal Correlations in Mobile Calls Networks (2015) Romain Guigourès, Marc Boullé and Fabrice Rossi. In Advances in Knowledge Discovery and Management, edited by Fabrice Guillet, Bruno Pinaud, Gilles Venturini and Djamel Abdelkader Zighed, volume 615, pages 3-17, September 2015.
  • Nonparametric Hierarchical Clustering of Functional Data (2014) Marc Boullé, Romain Guigourès and Fabrice Rossi. In Advances in Knowledge Discovery and Management, edited by Fabrice Guillet, Bruno Pinaud, Gilles Venturini and Djamel Abdelkader Zighed, volume 527, pages 15-35, February 2014.
  • Étude des corrélations spatio-temporelles des appels mobiles en France (2013) Romain Guigourès, Marc Boullé and Fabrice Rossi. In Actes de la 13ème Conférence Internationale Francophone sur l'Extraction et gestion des connaissances (EGC'2013), edited by Christel Vrain, André Péninou and Florence Sedes, volume RNTI-E-24, pages 437-448, Toulouse, France, February 2013.
  • Classifications croisées de données de trajectoires contraintes par un réseau routier (2013) Mohamed Khalil El Mahrsi, Romain Guigourès, Fabrice Rossi and Marc Boullé. In Actes de 13ème Conférence Internationale Francophone sur l'Extraction et gestion des connaissances (EGC'2013), edited by Christel Vrain, André Péninou and Florence Sedes, volume RNTI-E-24, pages 341-352, Toulouse, France, February 2013.
  • A Triclustering Approach for Time Evolving Graphs (2012) Romain Guigourès, Marc Boullé and Fabrice Rossi. In Co-clustering and Applications, IEEE 12th International Conference on Data Mining Workshops (ICDMW 2012), pages 115-122, Brussels, Belgium, December 2012.
  • Triclustering pour la détection de structures temporelles dans les graphes (2012) Romain Guigourès, Marc Boullé and Fabrice Rossi. In 3ème conférence sur les modèles et l'analyse des réseaux : Approches mathématiques et informatiques (MARAMI 2012), Villetaneuse, France, October 2012.
  • Clustering hiérarchique non paramétrique de données fonctionnelles (2012) Marc Boullé, Romain Guigourès and Fabrice Rossi. In Actes de la 12ème Conférence Internationale Francophone sur l'Extraction et la Gestion des Connaissances (EGC'2012), pages 101-112, Bordeaux, France, February 2012.
  • Segmentation géographique par étude d'un journal d'appels téléphoniques (2011) Romain Guigourès, Marc Boullé and Fabrice Rossi. In 2ème Journée thématique : Fouille de grands graphes, Grenoble (France), October 2011.

The defense

took place on the 4th of December. Romain gave an excellent speech in front of the following jury:

  • Prof. Mohamed Nadif, LIPADE – Université Paris Descartes, referee
  • Prof. Gilbert Saporta, Conservatoire National des Arts et Métiers, referee
  • Prof. Emmanuel Viennet, L2TI – Université Paris 13, president of the jury
  • Dr. Gilles Bisson, LIG – Université de Grenoble
  • Prof. Vincent Blondel, Université Catholique de Louvain
  • Dr. Marc Boullé, Orange Labs Lannion, co-adviser

and myself.

The summary of the thesis follows:

Co-clustering is a clustering technique aiming at simultaneously partitioning the rows and the columns of a data matrix. Among the existing approaches, MODL is suitable for processing huge data sets with several continuous or categorical variables. We use it as the baseline approach in this thesis. We discuss the reliability of applying such an approach on data mining problems like graphs partitioning, temporal graphs segmentation or curve clustering.

MODL tracks very fine patterns in huge data sets, that makes the results difficult to study. That is why, exploratory analysis tools must be defined in order to explore them. In order to help the user in interpreting the results, we define exploratory analysis tools aiming at simplifying the results in order to make possible an overall interpretation, tracking the most interesting patterns, determining the most representative values of the clusters and visualizing the results. We investigate the asymptotic behavior of these exploratory analysis tools in order to make the connection with the existing approaches.

Finally, we highlight the value of MODL and the exploratory analysis tools owing to an application on detailed call records from the telecom operator Orange, collected in Ivory Coast.

Romain Guigourès, Utilisation des modèles de co-clustering pour l’analyse exploratoire des données

The thesis is available on TEL here (it's written French).