You are here

T3.1 Scalable Algorithm Design


Essentially, this task deals with processing of large amounts of data, i.e., I/O intensive computation that is tolerant to delays. The input to define network data processing follows the use cases described in WP1, whereas the output of data processing is exposed to WP4 with a database layer. In this task we define and implement a multitude of basic operations and transformations on network data collected with the mPlane infrastructure. Batch data analysis algorithms, the core of this task, are delay tolerant (as they operate on large volumes of data) and I/O hungry (disk and network resources of the computing cluster play a crucial role). Such algorithms compute general statistics on application-specific network metrics (i.e., number of packets transmitted by a particular host, average delay experienced for a particular service, etc.) and store their output -- as explained in Task 3.3 -- such that it is available to the intelligent reasoner defined in WP4.

Today, very little is known concerning how to develop efficient and scalable algorithms that are executed on a parallel processing framework. A naive approach to algorithm design may result in complex, multi-stage jobs that are not efficient, or in jobs that do not appropriately exploit cluster resources. Our goal in this domain is to develop a series of “design patterns” limited to the kind of data analysis we target in mPlane: such design patterns can be used as a basis for job optimization, i.e., such that a particular job can output results within a given deadline. Examples of such design patterns include the ability to express parallel computation using complex keys to “route” information among machines in an efficient way and the ability to control the data flow in the execution framework such that recurring operations (e.g., average number of transmitted bytes per hour, day and month) can be executed in a single pass over the input thus avoiding multiple reads of large volumes of data.