My Phd student Romain Guigourès worked for three years on exploratory data analysis based on co-clustering. This work was co-advised by Marc Boullé from Orange Labs and originated from Marc's work. He has been working since 2004 on a generic parameter less approach called MODL which aims at estimating densities or probability distributions with grid based techniques (a type of multidimensional histogram). Assume for instance that you have a dataset of objects described by two qualitative variables. MODL estimates the joint probability distribution of those variables via two partitions of the modalities of said variables, one per variable. Intuitively, the probability of observing a pair \((a,b)\) does then depend only on the cluster of \(a\) and of the cluster of \(b\). The theory is more complex than that, in particular because it integrates automatic choice of the number of clusters via maximum a posteriori, but at least on the surface level, it works as described in the previous sentence.
Marc has been developing MODL with supervised applications in mind, mainly classification and scoring. It was clear however that through density modelling, MODL was also doing co-clustering (and $n$-mode/$n$-way clustering in general). The goal of Romain's thesis was to investigate how useful MODL could be as an exploratory tool.
This lead to several research contributions ranging from non trivial applications of MODL to complex exploratory problems (for instance time evolving graph clustering without a priori time quantification) to the introduction of numerous exploratory tools based on MODL. This work is covered by the following publications:
took place on the 4th of December. Romain gave an excellent speech in front of the following jury:
The summary of the thesis follows:
Co-clustering is a clustering technique aiming at simultaneously partitioning the rows and the columns of a data matrix. Among the existing approaches, MODL is suitable for processing huge data sets with several continuous or categorical variables. We use it as the baseline approach in this thesis. We discuss the reliability of applying such an approach on data mining problems like graphs partitioning, temporal graphs segmentation or curve clustering.
MODL tracks very fine patterns in huge data sets, that makes the results difficult to study. That is why, exploratory analysis tools must be defined in order to explore them. In order to help the user in interpreting the results, we define exploratory analysis tools aiming at simplifying the results in order to make possible an overall interpretation, tracking the most interesting patterns, determining the most representative values of the clusters and visualizing the results. We investigate the asymptotic behavior of these exploratory analysis tools in order to make the connection with the existing approaches.
Finally, we highlight the value of MODL and the exploratory analysis tools owing to an application on detailed call records from the telecom operator Orange, collected in Ivory Coast.Romain Guigourès, Utilisation des modèles de co-clustering pour l’analyse exploratoire des données
The thesis is available on TEL here (it's written French).