Research Activities

I'm working on Machine Learning (in a very broad sense), with a focus on statistical leaning methods, kernel machines (such as support vector machines) and artificial neural networks. Most of my publications are available on line. My current research interests include more specifically non standard data, more precisely non vector data (described by dissimilarity or kernel matrices) and functional data. I'm also interested in feature selection and information visualization.

I'm mostly interested by methodological and theoretical aspects of machine learning and data mining, but I also do more applicative work. I've worked in particular on spectrometric problems, on web usage mining and more recently on social network analysis. I'm also very interested in bringing scalability to machine learning methods: I'm not focusing on low complexity approximate methods but rather on implementation tricks and heuristics for extending the size of the dataset than can be handled by accurate methods.

My research activities also include reviewing for conferences and journals, as well as research contracts. I've also directed some PhD thesis.

Research themes

Functional Data

An important part of my research activity focuses on functional data analysis (FDA). In this framework, data are not finite dimensional vectors but functions from an infinite dimensional space. This introduces both theoretical and practical problems. My contribution to FDA has been to show that neural networks and support vector machines are as efficient for this type of data as they are for standard vector data. I've provided both practical and theoretical evidences to support this claim.

I've recently started to work on exploratory analysis of functional data: the main idea is to approximate functions with simple models that can be described with short texts.

Main publications on functional data analysis (a complete list is available here):

Non vector data

The second important part of my research activity concerns non vector data described by similarity or dissimilarity matrices, as well as by kernels. In this framework, the only available knowledge on the N input data consists in a NxN matrix that contains pairwise (dis)similarity between all the data. With some colleagues, I've defined new versions of Prof. Kohonen's Self Organizing Map (SOM) that can handle such data (variations of the Median SOM introduced by Prof. Kohonen and Dr. Somervuo, as well as an algorithm inspired by the relational approaches introduced by Hathaway, Davenport and Bezdek for k-means and its variants). Our aim is both to improve clustering quality, but also to reduce the computational cost of the dissimilarity based methods.

I've also explored some simple vector representation methods for some structured non vector data, such as interval data.

Main publications on non vector data analysis (a complete list is available here):

Feature selection

I've also investigated the very important theme of feature selection. My first work in this field used derivatives of the regression function as estimated by multi-layer perceptron in order to assess the predictive power of a feature. More recently, I've studied the k-nn based estimators of the mutual information proposed by Kraskov, Stögbauer and Grassberger. I'm particularly interested in assessing the actual gain provided by a feature via resampling techniques. I've also studied methods to accelerate the processing of a large number of correlated features via functional approaches and variable clustering techniques.

Main publications on feature selection (a complete list is available here):

Application fields

Spectrometry/Chemometrics

Tecator dataset

One of the main application fields of functional data analysis is Spectrometry. In this field, observations consist in spectra that are smooth functions sampled with high precision (such as 1000 samples for each spectrum). A spectrum maps wavelengths to some response, such as the absorbance for near infrared spectrometry.

I've applied neural models and support vector machines to spectrometric problems, mainly using the functional approach. The results were very good and showed that the functional framework provides very satisfactory answers to the problems induced by the high dimension of the spectra.

As an alternative to functional methods, I've also worked on variable selection applied to spectrometry and joined with non linear models.

Main publications on Spectrometry/Chemometrics (a complete list is available here; many of my publications on functional data analysis use spectrometric data for experimental evaluations):

Graph mining

Graphs are a good example of non vector data: simple vector representations of the nodes of a graph make generally a poor job at capturing the essence of the graph. Fortunately, the graph itself provides a natural way to define distances, dissimilarities or kernels for pairwise comparison of the vertices. I've started recently to work on exploratory analysis of graphs with the Self Organizing Map and other related methods.

I've worked in particular on social networks including a fancy peasant medieval social network. My research on this graph was highlighted in Nature News, in the French newspaper Le Figaro and in the CNRS journal.

Main publications on graph mining (complete list is available here):

Web mining

Web mining provides some very interesting non vector data. Web content mining focuses on the content of web site and therefore deals with texts, images and other similar contents. I've been more interested in web usage mining. In this application field, data consist in descriptions of user activities on the web server, obtained thanks to the log files of this server. I've applied an adapted Self Organizing Map to this type of data, in order to cluster and visualize the content of a web site, using the usage data only. The usage data are used to define a dissimilarity between the site content.

As an alternative to neural models, I've also applied to the same data graph based visualization methods coming from bibliography analysis. The visualizations provide complementary views on the data. I've also worked on the time dependent aspects of web usage data.

Main publications on Web usage mining (a complete list is available here):