DistOS 2015W Session 9

Anderson et al., "SETI@home: An Experiment in Public-Resource Computing" (CACM 2002) (DOI) (Proxy)
Anderson, "BOINC: A System for Public-Resource Computing and Storage" (Grid Computing 2004) (DOI) (Proxy)
Dean & Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters" (OSDI 2004)
Murray et al., "Naiad: a timely dataflow system" (SOSP 2013)

Session 9 is about processing large volumes of data (big data).

BOINC and SETI@home are crowdsourced systems that spread the input data across computers via the internet to have each chunk individually processed and returned back to the sender.

MapReduce and Naiad are more technically challenging systems that not only let nodes process individual chunks of data but also combine and fold them together to allow algorithms to process various aggregate results from the input data set.

BOINC

Public Resource Computing Platform
Gives scientists the ability to use large amounts of computation resources.
The clients do not connect directly with each other but instead they talk to a central server located at Berkley
The goals of Boinc are:

1) reduce the barriers of entry
2) Share resources among autonomous projects
3) Support diverse applications
4) Reward participants.
5) Provide screensaver graphics

It can run as applications in common language with no modifications
A BOINC application can be identified by a single master URL, which serves as the homepage as well as the directory of the servers.
Servers perform set of function using:
- Scheduling servers: handles Remote Procedure Call from clients
- Data servers:helps to manage the uploads

Can only work on data that can be split into many small shards and each shard processed entirely independently. Pure mappings on big data, no larger folding capabilities. Was used for scientific purposes mostly.

SETI@Home

Uses public resource computing to analyze radio signals to find extraterrestrial intelligence
Need good quality telescope to search for radio signals, and lots of computational power, which was unavailable locally
It has not yet found extraterrestrial intelligence, but its has established credibility of public resource computing projects
Originally custom, now uses BOINC as a backbone for the project
Uses relational database to store information on a large scale, further it uses a multi-threaded server to distribute work to clients
Quality of data in this architecture is untrustworthy, the main incentive to use it, however, is that it is a cheap and easy way of scaling the work exponentially.
Provided social incentives to encourage users to join the system.
This computation model still exists but not in the legitimate world.
Formed a good concept of public resource computing and a distributed computing by providing a platform independent framework

MapReduce

A programming model presented by Google to do large scale parallel computations
Uses the Map() and Reduce() functions from functional style programming languages

Map (Filtering)

Takes a function and applies it to a bunch of keys to produce values

Hides parallelization, fault tolerance, locality optimization and load balancing

Reduce (Summary)

Accumulates results from the data set using a given function

Very easy to use and understand, with many classic problems fitting this pattern
Otherwise quite constrained in what exactly can be done
Uses hashing to distribute similar keys to similar machines, but otherwise spread the load

Naiad

A programming model similar to MapReduce but with streaming capabilities so that data results are almost instantaneous
A distributed system for executing data parallel cyclic dataflow programs offering high throughput and low latency
Aims to provide a general purpose system which will fulfill the requirements and the will also support wide variety of high level programming models.
Highly used for parallel execution of data
Provides the functionality of checkpoint and restoring
A complex framework that can be the backend for simpler models of computation like LINQ or MapReduce to be built on top of.
Real Time Applications:

Batch iterative Machine Learning:

VW, an open source distributed machine learning performs iteration in 3 phases: each process updates local state; processes independently training on local data; and process jointly performed global average which is All Reduce.

Streaming Acyclic Computation

When compared to a system called Kineograph ( also done by Microsoft ), which processes twitter handles and provides counts of the occurrence of hashtags as well as links between popular tags, was written using Naiad in 26 lines of code and ran close to 2X faster.

Naiad paper won the best paper award in SOSP 2013, check-out this link in Microsoft Research website http://research.microsoft.com/en-us/projects/naiad/ . Down in this page you can see some videos that explains naiad including Derek's Murray presentation at SOSP 2013.