My PhD student Matthieu Durut has been studying and implementing machine learning algorithms in the cloud. More specifically, Matthieu works for Lokad, a small but very active company, which specializes on data analytics in a software as a service paradigm. As many small companies, Lokad leverages the cloud paradigm and simply fires new virtual machines when more processing power is needed.
For some strange reasons (at least, strange to me), Lokad standardized on Microsoft technologies (what would one do that in a software as a service setting is beyond my understanding, so please, don't ask ;-) and choose naturally to use Microsoft cloud offering, Azure. Unfortunately for them (but, somehow, fortunately for me), Azure is a platform as a service offering and lacks even the most standard tools for distributed computing (for instance MPI). Then it was quite a challenge to implement efficiently machine learning techniques on this crippled platform. And poor Matthieu was in charge of that…
On a scientific point of view, the goal of the thesis was then to design and implement machine learning technique in a cloud computing platform that lacks efficient communication api between processing units. We focused on clustering, more precisely on vector quantization, and obtained interesting results summarized in those three publications:
The bulk of the work is however gathered in the implementation itself which is available as an open source project CloudDALVQ.
was great! Matthieu did an excellent job of summarizing his work in clear and simple terms, while still focusing on the most technical and challenging aspects of it. The jury was:
and myself.
The summary of the thesis follows:
The subjects addressed in this thesis are inspired from research problems faced by the Lokad company. These problems are related to the challenge of designing efficient parallelization techniques of clustering algorithms on a Cloud Computing platform. Chapter 2 provides an introduction to the Cloud Computing technologies, especially the ones devoted to intensive computations. Chapter 3 details more specifically Microsoft Cloud Computing offer : Windows Azure. The following chapter details technical aspects of cloud application development and provides some cloud design patterns. Chapter 5 is dedicated to the parallelization of a well-known clustering algorithm: the Batch K-Means. It provides insights on the challenges of a cloud implementation of distributed Batch K-Means, especially the impact of communication costs on the implementation efficiency. Chapters 6 and 7 are devoted to the parallelization of another clustering algorithm, the Vector Quantization (VQ). Chapter 6 provides an analysis of different parallelization schemes of VQ and presents the various speedups to convergence provided by them. Chapter 7 provides a cloud implementation of these schemes. It highlights that it is the online nature of the VQ technique that enables an asynchronous cloud implementation, which drastically reduces the communication costs introduced in Chapter 5.
Matthieu Durut, Algorithmes de classification répartis sur le cloud
The thesis is available here (exhaustive archives of Matthieu's work can be found on the TEL page of the thesis). The referees wrote that the thesis is very well written and provides, in addition to its scientific content, a very good overview of cloud computing (and I agree!).