Machine learning on the cloud

My PhD student Matthieu Durut has been studying and implementing machine learning algorithms in the cloud. More specifically, Matthieu works for Lokad, a small but very active company, which specializes on data analytics in a software as a service paradigm. As many small companies, Lokad leverages the cloud paradigm and simply fires new virtual machines when more processing power is needed.

For some strange reasons (at least, strange to me), Lokad standardized on Microsoft technologies (what would one do that in a software as a service setting is beyond my understanding, so please, don't ask ;-) and choose naturally to use Microsoft cloud offering, Azure. Unfortunately for them (but, somehow, fortunately for me), Azure is a platform as a service offering and lacks even the most standard tools for distributed computing (for instance MPI). Then it was quite a challenge to implement efficiently machine learning techniques on this crippled platform. And poor Matthieu was in charge of that…

On a scientific point of view, the goal of the thesis was then to design and implement machine learning technique in a cloud computing platform that lacks efficient communication api between processing units. We focused on clustering, more precisely on vector quantization, and obtained interesting results summarized in those three publications:

  • A Discussion on Parallelization Schemes for Stochastic Vector Quantization Algorithms (2012) Matthieu Durut, Benoît Patra and Fabrice Rossi. In Proceedings of the XXth European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2012), pages 477-482, Bruges, Belgique, April 2012.
  • Communication Challenges in Cloud K-means (2011) Matthieu Durut and Fabrice Rossi. In Proceedings of XIXth European Symposium on Artificial Neural Networks (ESANN 2011), pages 387-392, Bruges (Belgium), April 2011.
  • K-means on Azure (2010) Matthieu Durut and Fabrice Rossi. In LCCC: NIPS 2010 Workshop on Learning on Cores, Clusters and Clouds, Whistler (Canada), December 2010.

The bulk of the work is however gathered in the implementation itself which is available as an open source project CloudDALVQ.

The defense

was great! Matthieu did an excellent job of summarizing his work in clear and simple terms, while still focusing on the most technical and challenging aspects of it. The jury was:

  • Prof. Frédéric Magoulès, ECP, referee
  • Prof. Michel Verleysen, UCL, referee
  • Prof. Laurent Pautet, Télécom ParisTech, president of the jury
  • Dr. Ludovic Denoyer, UPMC
  • Joannes Vermorel, founder of Lokad

and myself.

The summary of the thesis follows:

The subjects addressed in this thesis are inspired from research problems faced by the Lokad company. These problems are related to the challenge of designing efficient parallelization techniques of clustering algorithms on a Cloud Computing platform. Chapter 2 provides an introduction to the Cloud Computing technologies, especially the ones devoted to intensive computations. Chapter 3 details more specifically Microsoft Cloud Computing offer : Windows Azure. The following chapter details technical aspects of cloud application development and provides some cloud design patterns. Chapter 5 is dedicated to the parallelization of a well-known clustering algorithm: the Batch K-Means. It provides insights on the challenges of a cloud implementation of distributed Batch K-Means, especially the impact of communication costs on the implementation efficiency. Chapters 6 and 7 are devoted to the parallelization of another clustering algorithm, the Vector Quantization (VQ). Chapter 6 provides an analysis of different parallelization schemes of VQ and presents the various speedups to convergence provided by them. Chapter 7 provides a cloud implementation of these schemes. It highlights that it is the online nature of the VQ technique that enables an asynchronous cloud implementation, which drastically reduces the communication costs introduced in Chapter 5.

Matthieu Durut, Algorithmes de classification répartis sur le cloud

The thesis is available here (exhaustive archives of Matthieu's work can be found on the TEL page of the thesis). The referees wrote that the thesis is very well written and provides, in addition to its scientific content, a very good overview of cloud computing (and I agree!).

Published

29 September 2012

Tags

research

cloud

azure

phd student

phd defense

clustering

cifre