This is an illustrative set of Spark jobs, implemented in Scala, that can be run on a HDFS repository storing JSON files. The main goal of these jobs is to compute statistical values (e.g., mean, median, std) and to perform basic analysis on raw data coming from browsing sessions recorded by the Firelog probe, for the purpose of wen QoE analysis.
- Job 1: Calculate mean, median, standard deviation, top5 of the Page Load Time in all stored sessions.
- Job 2: Calculate the correlation between the Page Load Time and other properties (e.g., page size, http times, and so on)
- Job 3: Find the most contacted servers, excluding the DNS resolved one, for all stored sessions.
- Job 4: For all probes, find the longest common path in traceroutes towards the same destination.
For each job, a capability on the repository component has to be registered. For example, for Job 1:
|mplane| runcap page-load-time-stats
|when| = now + 5m
hdfs.root = /path/to/root/json/folder/
ok