You are here

T3.3 Access to Analytic and External Data

Lead: FTW

Network data analysis results output from T3.1, that is resulting from the mining the data collected by mPlane, need to be accessible to WP4 as well as to “external users”. This task deals with a substantially different workload than that of T3.1, whereby delay-tolerant, bulk data analysis takes place. Instead, T3.3 targets low-latency access to data resulting from T3.1, which calls for distributed data stores supporting selective queries. For example, consider the results of a simple parallel job that compute the average daily latency for all end-hosts (a user machine) to all services (e.g. YouTube) monitored by mPlane. Using the interface exposed by this task, more sophisticated algorithms designed in WP4 can fetch only data relative to a subset of the services (only YouTube) and for a selected subset of the users (only those who experience latencies larger than 300 ms) for subsequent elaboration.

Analytic data output from task 3.1 is thus stored in such a storage system: in this project we will work on columnar, distributed databases, which act as a sink for processed data. The distributed database system can be co-located with the batch processing system, or it can be deployed on a separate cluster of machines. In both cases, we proceed with a parallel approach to bulk data inserts, that contributes to improved data transfer performance and that relies on standard interfaces for communication between the two systems.

In the mPlane scenario given by the interaction of WP3 and WP4, it is important to store data according to a columnar layout for performance reasons: indeed, selective queries originating from WP4 will touch only few attributes computed for each entity in the system.

In this context, we will work on problems related to data partitioning and placement, which are key to achieve good performance in serving results. Additionally, creating indexes (which, in contrast to traditional databases, do not come for free in the systems we consider for mPlane) to speed up the look-up procedure for aggregated data is another prominent problem that we address in this task. It is worth noticing that the parallel computing framework used in task 3.1 is an ideal candidate to compute such indexes, which reinforces the need for an advanced scheduling algorithm (see task 3.2) to handle such job diversity.

Another aspect addressed by this task deals with providing access to external (3rd party) data. This problem can be approached under different angles: indeed, some repositories (e.g. CAIDA) provide APIs to query their measurement data directly from external clients; others, instead, only allow the bulk download of selected portions or the totality of their data. For the first type of such external sources, WP4 can directly issue queries and fetch relevant information that is not available in mPlane. Instead, for the second kind of external repositories, and depending on the use cases defined in WP1, our objective is to populate the database system developed in this task at regular time intervals. In practice, given a user-defined periodicity, we develop a software tool similar in nature to a web crawler that fetches external data and store it in the mPlane repository, such that it is readily available for queries issued by the tools developed in WP4.