You are here

Anomaly detection and root cause analysis in large-scale networks

Use Case Overview

This use case targets the continuous monitoring of large-scale network traffic, aiming to detect and diagnose anomalies potentially impacting a large number of users. The use case particularly focuses on the most popular web-based services (e.g., YouTube, Facebook, Google Services. etc.), delivered by complex network infrastructures maintained by omnipresent Over The Top (OTT) content providers and major Content Delivery Networks (CDNs) such as Google, Akamai, Limelight, SoftLayer, etc.. Detecting and diagnosing anomalies in such scenarios is extremely complex, due to the number of involved components or players in the end-to-end traffic delivery: the Content Provider, the CDN provider, the intermediate Autonomous Systems (ASes) of the transit Internet Service Providers (ISPs), the access ISP, and the terminals of the end-users. This high complexity motivates the usage of mPlane to improve the visibility on the traffic and on all the intermediate components. And more specifically, the diagnosis of the detected anomalies requires the coordinated guidance of the mPlane Reasoner, which shall decide the specific measurements and deeper analysis to perform, once an anomalous event is detected. In the demo we shall focus on the specific case of YouTube QoE-based traffic monitoring, detecting and diagnosing real anomalies occurring in the distribution of YouTube videos.

 

Requirements and components

This use-case relies on traffic passively monitored at the production PoP network. Traffic is monitored at the flow-level, generating a large set of flow-statistics for all the downlink and uplink traffic. Using Tstat flow filtering and traffic classification capabilities, only flows related to YouTube videos are retained for further analysis. Some of these per-flow statistics include: flow size, flow duration, average download throughput, video bit rate, server IP, RTT, etc..

Flows captured at the passive probes are periodically exported to DBStream, which is in-charged of running the Anomaly Detection analysis modules. Tstat flow measurements are combined with two other types of measurements: (i) external data coming from geo-localization services such as MaxMind (https://www.maxmind.com) and IP address analysis services such as Team Cymru Community Services\footnote (https://www.team-cymru.org), and (ii) inter-AS path performance measurements, generated through the usage of the geo-distributed active measurements framework provided by RIPE Atlas. In particular, DisNETPerf (link) is used to continuously run periodic traceroutes form the RIPE Atlas probes (topologically close to the YouTube servers) towards the PoP where the passive probe is deployed. Whereas, the mPlane RIPE Atlas proxy (link) is used to perform on-demand active measurements when a server IP is found to be involved in an anomaly.

 

Deployment

As for the use-case deployment, its demo requires components from the three layers of the mPlane (i.e., measurement layer, large-scale data analysis and storage layer, and advanced analysis and reasoning layer). In particular, the following list details the different components per layer of the mPlane:

  • Probes: at the measurement layer, we consider both passive and active measurements. Passive measurements are performed through a Tstat passive probe attached to the interconnection links toward the Internet of the production PoP Network (i.e., the monitored traffic is real customer traffic). Active measurements are performed through the RIPE Atlas monitoring framework, relying on the mPlane RIPE Atlas proxy to instantiate and coordinate the active measurements.
  • Repositories: the use case demo relies in the DBStream repository to store the passive and active measurements obtained from the aforementioned probes, as well as running the analysis algorithms which shall unveil the types of anomalies described before. 
  • Analysis Modules, Reasoner and Supervisor: the use case relies on the WP4 anomaly detection analysis module  ADTool to detect on-line the targeted traffic anomalies, which is based on the analysis of the empirical distributions of several traffic features. The mpAD_Reasoner is an extended mPlane client which orchestrates, through the mPlane Supervisor, all the tasks needed to automate the detection and diagnosis of anomalies in occurring in the distribution of YouTube videos.  

The interactions with mPlane components and Analysis Modules is achieved through the mPlane Supervisor, using the mPlane RI protocol. Component (1) is the standard mPlane Supervisor. The use case bootstrap requires that all the components and Analysis Modules are registered to the Supervisor and running with a pre-defined configuration. The list of components/modules includes (2) Tstat, (3) DBStream, (4) MATH data-transfer protocol, (5) ADTool, and (6) RIPE Atlas (i.e., DisNETPerf and the mPlane RIPE Atlas proxy). 

 

 

 

Use case execution flowchart

 The following list details the steps performed during the continuous monitoring:

  1. all traffic flows are analyzed by Tstat at the vantage point, and those belonging to YouTube are exported into DBStream.
  2. the anomaly detection algorithm runs continuously on the YouTube flows within DBStream, considering as KPIs the per-flow average download throughput (to detect performance issues) and the number of flows served per /24 CDN subnetwork (to detect Google cache selection changes).
  3. when an anomaly is detected as a major shift in the distribution of flows throughput towards lower throughput values, the diagnosis analysis is triggered. The diagnosis analysis is iterated by the Reasoner, following the diagnosis graph presented in D4.2.
  4. the first step is to verify if the detected anomaly is statistically consistent, i.e., that it is not caused because of a big drop in the number of samples considered in the empirical distribution computation.
  5. then the analysis verifies if this detected anomaly is actually impacting the QoE of the users, by analyzing the QoE-based KPIs as defined in D4.3.
  6. the first diagnosis event to verify is a main drop on the time-series related to the empirical entropy of the operative system type of the devices downloading the captured flows. A drop in the empirical entropy would flag a major concentration on the distribution of the OS type of the devices, indicating a possible relation to the OS type.
  7. the second event to verify is the occurrence of performance degradation in the corresponding end-to-end paths carrying the impaired YouTube flows. Events tracked on the time series related to packet re-transmissions, queuing delay, etc. are checked in order to identify path congestion.
  8. if path congestion is identified, then the Reasoner instructs active measurements from geo-distributed probes (i.e., using RIPE Atlas via the corresponding mPlane proxy) to identify the specific AS or sub-path causing the performance degradation.
  9. if no path performance degradation is observed, the analysis checks for events related to load balancing and cache selection modifications in the Google CDN serving the YouTube flows.
  10. if no cache selection modification events are present in the logged events at the specific times of the detected YouTube QoE-based anomalies, the drilling-down checks for the occurrence of inter-AS routing changes which might be linked to the detected anomalies.
  11. if cache selection modifications are present, then the analysis focuses on understanding if the new selected servers are the origin of the problems. For doing so, different application-level KPIs are verified on top of the monitored traffic, such as server elaboration times, TCP flags, etc...

After running these steps over a long period of time, the demo will allow to present the obtained results through a direct query of the mPlane framework. We expect that some major anomalies will be captured during the time span of the demo traffic monitoring.

 

Use-case purpose

The use-case aims at demonstrating that we can detect and provide troubleshooting support for real large scale anomalies occurring at the Internet level on web-based services. In particular, it shows how the mPlane can detect anomalies based on its Anomaly Detection modules, using both QoS-based and QoE-based performance metrics, as well as how the iterative analysis performed by the mPlane reasoner can correlate measurements from multiple probes and vantage points to understand the root causes of the detected anomalies.

The following is a list of the targets to demonstrate:

  • the interaction between passive and active measurements, using a common repository to store and analyze the monitored data.;
  • that mPlane Anomaly Detection modules can effectively detect anomalous behaviors related to both QoS-based and QoE-based performance metrics;
  • how the mPlane Anomaly Detection modules can automatically learn and adapt to normal traffic variations to avoid false alarms and perform in a semi-autonomous fashion;
  • how the mPlane reasoner is capable of instructing new measurements on the fly when some specific events are detected;
  • the integration of external data sources within the mPlane framework, including both external databases and external active monitoring platforms (in particular, we consider RIPE Atlas as a distributed monitoring framework, based on active measurements);
  • how the mPlane can provide elaborated analysis of multiple measurements to potentially diagnose the root causes of the detected problems;
  • how the mPlane can rely on multiple Analysis Modules to enrich the traffic monitoring and analysis process;
  • how the mPlane can store and perform historical data analysis using its repositories to better support the analysis of relevant anomalies;
  • how the mPlane can learn new specific data models from stored (off-line) and streaming (on-line) data through machine learning approaches, particularly using the DBStream repository and analysis framework.

 

How to setup and deploy the use-case

For the detailed instructions on the deployment, setup, and demo of the use-case we refer interested readers to the corresponding demontration guidelines page (link).