Co-clustering large scale mixed data

My Phd student Aichetou Bouchareb (jointly advised by Marc Boullé from Orange Labs) worked during her thesis on co-clustering mixed data. With Marc, we have been working since a few years on improvement and extension of his MODL framework. This is a fully automated non parametric density estimation technique based on grids which we extended in particular to graph clustering during the Phd thesis of Romain Guigoures.

In Aichetou's work, the main goal was to enable true mixed data co-clustering. It is indeed quite easy to extend model based co-clustering to mixed data with both binary variables and numerical ones. However, variable clusters are not mixed in this case as they are either numerical or binary. Aichetou defined an extension of MODL to the case where mixed clusters are possible. The key idea is to introduce a intermediate level of clustering: rather than clustering variables, we cluster so-called "variable part". A variable part is a cluster of values taken by a variable, for instance an interval of values for a numerical variable. Once the value space of each variable has been partitioned into variable parts, we can cluster those parts with no type constraint. Aichetou defined a generative model based on this principle. She derived a prior distribution on its parameters and an estimation strategy based on the MAP principle. The strategy optimizes everything, including value spaces partitioning.

This work is covered by the following publications:

  • Un modèle Bayésien de co-clustering de données mixtes (2018) Aichetou Bouchareb, Marc Boullé, Fabrice Rossi and Fabrice Clérot. In Actes de la 18ème Conférence Internationale Francophone sur l'Extraction et gestion des connaissances (EGC'2018), edited by Christine Largeron, Hanane Azzag and Mustapha Lebbah, volume RNTI-E-34, pages 275-280, Paris, France, January 2018.
  • Co-clustering de données mixtes à base des modèles de mélange (2017) Aichetou Bouchareb, Marc Boullé and Fabrice Rossi. In Actes de la 17ème Conférence Internationale Francophone sur l'Extraction et gestion des connaissances (EGC'2017), edited by Fabien Gandon and Gilles Bisson, volume RNTI-E-33, pages 141-152, Grenoble, France, January 2017.
  • Application du coclustering à l'analyse exploratoire d'une table de données (2017) Aichetou Bouchareb, Marc Boullé, Fabrice Clérot and Fabrice Rossi. In Actes de la 17ème Conférence Internationale Francophone sur l'Extraction et gestion des connaissances (EGC'2017), edited by Fabien Gandon and Gilles Bisson, volume RNTI-E-33, pages 177-188, Grenoble, France, January 2017.

The defense

took place on the 28th of November. Aichetou gave an inspiring speech in front of the following jury:

  • Prof. Julien Jacques, Université Lyon 1, reviewer
  • Prof. Mohamed Nadif, Université Paris Descartes, reviewer
  • Prof. Gilbert Saporta, Cnam, president of the jury
  • Dr. Gilles Bisson, CNRS, LIG
  • Mr. Fabrice Clérot, Orange Labs Lannion
  • Dr. Marc Boullé, Orange Labs Lannion, co-adviser

and myself.

The summary of the thesis follows:

Co-clustering is a class of unsupervised data analysis techniques aiming at extracting the underlying dependency structure between the rows and columns of a data table in the form of homogeneous blocks, known as co-clusters. These techniques can be distinguished into those that aim at simultaneously clustering the instances and variables, and those that aim at clustering the values of two or more variables of a data set. Most of these techniques are limited to variables of the same type, and are hardly scalable to large data sets while providing easily interpretable clusters and co-clusters. Among the existing value based co-clustering approaches, MODL is suitable for processing large data sets with several numerical or categorical variables. In this thesis, we propose a value based approach, inspired by MODL, to perform a simultaneous clustering of the instances and variables of a data set with potentially mixed-type variables. The proposed co-clustering model provides a Maximum A Posteriori based summary of the data that can be used as it is for exploratory analysis of the data. When the summary is large, exploratory analysis tools, such as model coarsening, can be used to simplify the co-clustering which facilitates the interpretation of the results. We show that the proposed co-clustering approach can handle large data and extract easily interpretable clusters from mixed data with more than 10 millions observations. We also show the robustness of the approach, its capacity to extract inter-dependence between the variables, and its good behavior in extreme cases such as in the case of pattern-less data and in the case of perfectly correlated variables.

Aichetou Bouchareb, A regularized approach of instances x variables co-clustering for exploratory data analysis

The thesis is available on TEL here.