You are here



The goal of this course is to provide a comprehensive view on recent topics and trends in distributed systems and cloud computing. We will discuss the software techniques employed to construct and program reliable, highly-scalable systems, with a particular focus on data-intensive computing systems.

Specifically, the course will cover the MapReduce programming model, its connection to relational algebra, and high-level programming models that build on MapReduce; in addition, the course delves into the details of the underlying execution framework that supports and execute parallel MapReduce programs, including distributed file-systems and the Hadoop implementation. The course is complemented by a series of practical, hands-on exercises executed on a small-scale cluster, in which students will learn the tools to program in MapReduce, Pig; such exercises are drawn from real-world case studies, including for example network-traffic analysis.



Wednesday 18

Thursday 19th

Friday 20th


Introduction to MapReduce

  • Motivation

  • Programming models

  • Examples

Hadoop Internals:

  • HDFS Design

  • Scheduling

  • Fault Tolerance

  • I/O

High-level Languages:

  • Relational Algebra

  • Pig

  • Pig Latin


Laboratory Session:

  • The basics

  • Using HDFS

  • WordCount

Laboratory Session:

  • Design patterns

  • Joins

  • Distributed Cache

Laboratory Session:

 The basics

  • Network data analysis