You are here

WP3 - Large-scale data analysis

Work package title: Large-scale data analysis
Start date or starting event: M4
Activity Type: RTD
Leader: EURECOM - Pietro Michiardi Pietro.Michiardi@eurecom.fr
Participants: POLITO SSB TI ALBLF ENST NEC TID FW NETVISOR FTW A-LBELL

Objectives

The goal of WP3 is to work on systems to store and to process data collected through the mPlane probes developed within WP2. WP3 relies on the presence of a parallel processing framework to carry out large-scale data processing: in particular we will focus on systems akin to the open-source implementation of MapReduce, called Hadoop, which is nowadays a de-facto standard for large-scale data mining. Hadoop comprises a versatile storage layer and an orchestration framework that can be used to execute data analysis on a cluster of commodity hardware. We define the following objectives:

  • Design and implement scalable algorithms that operate on very large amounts of data. As illustrated in WP4, we expect to receive, in an asynchronous manner, input data from a multitude of probes defined in WP2. Based on use cases defined in WP1, we design a comprehensive set of network data analysis jobs, which provide aggregated results to be further analysed in WP4.
  • Design, implement and evaluate scheduling protocols for the efficient and fair allocation of computing resources to network data analysis jobs. Essentially, our objective is to design a new component for scheduling analytic tasks in parallel processing frameworks by considering the particular computational workloads generated by the mPlane infrastructure.
  • Design and deploy a distributed database system to expose aggregated and external network data to other mPlane components. The objective is to offer an interface for WP4 to access data elaborated in WP3 as well as for data available in digital repositories external to the mPlane infrastructure. Part of this objective is the design and implementation of specific indexing schemes to support a variety of queries aiming at retrieving a small subset of the data available in the repository.

Description of work

The workpackage contains three tasks:

Partners contribution

  • EURECOM (WP leader) will work on T3.1 (with a particular focus on defining “design patterns” underlying network data processing patterns), T3.2 (investigating techniques to estimate analytic job execution times and size-based scheduling protocols) and T3.3 (with a focus on distributed, column-oriented databases, the implementation of ad-hoc indexing mechanisms and the implementation of glue software to ingest external data stored in existing repositories).
  • POLITO will work on T3.1 defining and studying algorithms to correlate measurements coming from different probes at the same time, or from the same probe at different time. The goal is to study methodologies to extract the behaviour of measurements considering spatial or time diversity in the context of the “change detection” use case. We will also contribute to T3.3 on advanced indexing techniques to support aggregate queries. Furthermore, data mining techniques will be exploited to infer aggregation hierarchies from analysed data.
  • SSB will support integration and implementation of database architecture considering the access control and data protection mechanisms developed in T1.4
  • TI will mainly contribute their expertise as ISP to define interesting problems to be solved in WP3
  • ALBLF will work on data analysis algorithms to extract useful information from the large collection of data the mPlane system will have to handle
  • ENST will work on T3.1 (especially to on stress-test the pattern based algorithm design) and in minor part on T3.2 (investigating whether pattern-based design can lead to efficient workflows, e.g., exploiting caching of intermediate results by a pattern-aware task scheduler).
  • NEC will work on T3.1 with regards to the design of systems for the continuous processing of data for batch and stream processing co-existence. Furthermore, NEC will contribute comparing batch and stream processing systems (e.g., performance of Hadoop vs. packet capture and analysis frameworks in analyzing network traffic). Finally, NEC will work on T3.2 to investigate and find effective resource scheduling schemes to handle jobs in a cluster of computers.
  • TID will work on T3.2 and T3.3 to define efficient database structure and to study the data partitioning problem
  • FW will support the installation of testbed coming from WP3
  • NETVISOR will leverage its experience to support the deployment of large-scale database in a distributed infrastructure.
  • FTW will work on T3.1, studying and defining how to develop efficient and scalable data analysis algorithms that are executed on a parallel processing framework. FTW will also contribute to T3.2, studying the problem of resources scheduling and the interaction with the underlying parallel processing framework. Finally, FTW will lead T3.3, working on indexing mechanisms and efficient selective queries.
  • A-LBELL will focus on the application of on-line and sequential learning in order to perform traffic prediction at multiple timescales, the design of adaptive learning models, and means to perform cooperative monitoring by interconnecting distributed mPlane components running on individual routers.

Deliverables

  • D3.1 – (M6; editor EURECOM): Basic Network Data Analysis. This deliverable is meant to be Public. In this deliverable we describe and map basic algorithms to perform analytic tasks, which corresponds to different use cases addressed in mPlane defined in WP1. Essentially, we will define such algorithms as black boxes, focusing on input and output data, which are essential to start task T3.3.
  • D3.2 – (M12; editor FTW): Database Layer Design (Including External Repositories Selection). This deliverable is meant to be Public. This is a preliminary (i.e., query performance is not optimized) version that allows data to be accessible by WP4 and external users.
  • D3.3 – (M22; editor NEC): Algorithm and Scheduler Design and Implementation. This software deliverable is meant to be Public. It consists in two parts. In part 1, we will detail the algorithms developed in the first year after their definition (see D3.1), including their design and performance on a parallel computing testbed. This part is also dedicated to the definition of new basic algorithms that were not devised in D3.1. The second part of the deliverable is dedicated to the design and a preliminary implementation of the job scheduler. In particular, we will use basic techniques to estimate job length, which we will improve for the second release.
  • D3.4 – (M32; editor EURECOM) - Final Implementation and Evaluation of the Data Processing and Storage Layer. This software deliverable is meant to be Public. It will include the design (but not the full implementation) of a sophisticated method to infer job duration (e.g., for recurring jobs this will be based on statistical analysis, for other jobs this will be based on a training phase). Additionally, based on the input provided by WP4, we will design the key ingredients for query optimization, including indexing and data placement. Finally, we will focus on a detailed description of “design patterns” that will emerge from the implementation of several basic algorithms for data analysis.