DistOS 2015W Session 9: Difference between revisions

Latest revision as of 01:49, 13 April 2015

Anderson et al., "SETI@home: An Experiment in Public-Resource Computing" (CACM 2002) (DOI) (Proxy)
Anderson, "BOINC: A System for Public-Resource Computing and Storage" (Grid Computing 2004) (DOI) (Proxy)
Dean & Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters" (OSDI 2004)
Murray et al., "Naiad: a timely dataflow system" (SOSP 2013)

Session 9 is about processing large volumes of data (big data).

BOINC and SETI@home are crowdsourced systems that spread the input data across computers via the internet to have each chunk individually processed and returned back to the sender.

MapReduce and Naiad are more technically challenging systems that not only let nodes process individual chunks of data but also combine and fold them together to allow algorithms to process various aggregate results from the input data set.

BOINC

Public Resource Computing Platform
Gives scientists the ability to use large amounts of computation resources.
The clients do not connect directly with each other but instead they talk to a central server located at Berkley
The goals of Boinc are:

1) reduce the barriers of entry
2) Share resources among autonomous projects
3) Support diverse applications
4) Reward participants.
5) Provide screensaver graphics

It can run as applications in common language with no modifications
A BOINC application can be identified by a single master URL, which serves as the homepage as well as the directory of the servers.
Servers perform set of function using:
- Scheduling servers: handles Remote Procedure Call from clients
- Data servers:helps to manage the uploads

Can only work on data that can be split into many small shards and each shard processed entirely independently. Pure mappings on big data, no larger folding capabilities. Was used for scientific purposes mostly.

SETI@Home

Uses public resource computing to analyze radio signals to find extraterrestrial intelligence
Need good quality telescope to search for radio signals, and lots of computational power, which was unavailable locally
It has not yet found extraterrestrial intelligence, but its has established credibility of public resource computing projects
Originally custom, now uses BOINC as a backbone for the project
Uses relational database to store information on a large scale, further it uses a multi-threaded server to distribute work to clients
Quality of data in this architecture is untrustworthy, the main incentive to use it, however, is that it is a cheap and easy way of scaling the work exponentially.
Provided social incentives to encourage users to join the system.
This computation model still exists but not in the legitimate world.
Formed a good concept of public resource computing and a distributed computing by providing a platform independent framework

MapReduce

A programming model presented by Google to do large scale parallel computations
Uses the Map() and Reduce() functions from functional style programming languages

Map (Filtering)

Takes a function and applies it to a bunch of keys to produce values

Hides parallelization, fault tolerance, locality optimization and load balancing

Reduce (Summary)

Accumulates results from the data set using a given function

Very easy to use and understand, with many classic problems fitting this pattern
Otherwise quite constrained in what exactly can be done
Uses hashing to distribute similar keys to similar machines, but otherwise spread the load

Naiad

A programming model similar to MapReduce but with streaming capabilities so that data results are almost instantaneous
A distributed system for executing data parallel cyclic dataflow programs offering high throughput and low latency
Aims to provide a general purpose system which will fulfill the requirements and the will also support wide variety of high level programming models.
Highly used for parallel execution of data
Provides the functionality of checkpoint and restoring
A complex framework that can be the backend for simpler models of computation like LINQ or MapReduce to be built on top of.
Real Time Applications:

Batch iterative Machine Learning:

VW, an open source distributed machine learning performs iteration in 3 phases: each process updates local state; processes independently training on local data; and process jointly performed global average which is All Reduce.

Streaming Acyclic Computation

When compared to a system called Kineograph ( also done by Microsoft ), which processes twitter handles and provides counts of the occurrence of hashtags as well as links between popular tags, was written using Naiad in 26 lines of code and ran close to 2X faster.

Naiad paper won the best paper award in SOSP 2013, check-out this link in Microsoft Research website http://research.microsoft.com/en-us/projects/naiad/ . Down in this page you can see some videos that explains naiad including Derek's Murray presentation at SOSP 2013.

@@ Line 1: / Line 1: @@
+* Anderson et al., "SETI@home: An Experiment in Public-Resource Computing" (CACM 2002) [http://dx.doi.org/10.1145/581571.581573 (DOI)] [http://dl.acm.org.proxy.library.carleton.ca/citation.cfm?id=581573 (Proxy)]
+* Anderson, "BOINC: A System for Public-Resource Computing and Storage" (Grid Computing 2004) [http://dx.doi.org/10.1109/GRID.2004.14 (DOI)] [http://ieeexplore.ieee.org.proxy.library.carleton.ca/stamp/stamp.jsp?tp=&arnumber=1382809 (Proxy)]
+* [http://research.google.com/archive/mapreduce.html Dean & Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters" (OSDI 2004)]
+* [http://dl.acm.org/citation.cfm?doid=2517349.2522738 Murray et al., "Naiad: a timely dataflow system" (SOSP 2013)]
+Session 9 is about processing large volumes of data (big data).
+BOINC and SETI@home are crowdsourced systems that spread the input data across computers via the internet to have each chunk individually processed and returned back to the sender.
+MapReduce and Naiad are more technically challenging systems that not only let nodes process individual chunks of data but also combine and fold them together to allow algorithms to process various aggregate results from the input data set.
 == BOINC ==
@@ Line 10: / Line 20: @@
 :*3) Support diverse applications
 :*4) Reward participants.
- A BOINC application can be identified by a single master URL, <br/>which serves as the homepage as well as the directory of the servers.
+:*5) Provide screensaver graphics
+*It can run as applications in common language with no modifications
+*A BOINC application can be identified by a single master URL, which serves as the homepage as well as the directory of the servers.
+*Servers perform set of function using:
+**Scheduling servers: handles Remote Procedure Call from clients
+** Data servers:helps to manage the uploads
+*Can only work on data that can be split into many small shards and each shard processed entirely independently. Pure mappings on big data, no larger folding capabilities. Was used for scientific purposes mostly.
 == SETI@Home ==
@@ Line 16: / Line 34: @@
 *Uses public resource computing to analyze radio signals to find extraterrestrial intelligence
 *Need good quality telescope to search for radio signals, and lots of computational power, which was unavailable locally
-*It has not yet found extraterrestrial intelligence, but its has established credibility of public resource computing projects which are given by the public
+*It has not yet found extraterrestrial intelligence, but its has established credibility of public resource computing projects
-*Uses BOINC as a backbone for the project
+*Originally custom, now uses BOINC as a backbone for the project
 *Uses relational database to store information on a large scale, further it uses a multi-threaded server to distribute work to clients
 *Quality of data in this architecture is untrustworthy, the main incentive to use it, however, is that it is a cheap and easy way of scaling the work exponentially.
 *Provided social incentives to encourage users to join the system.
 *This computation model still exists but not in the legitimate world.
+*Formed a good concept of public resource computing and a distributed computing by providing a platform independent framework
 == MapReduce ==
@@ Line 28: / Line 47: @@
 *Uses the <code>Map()</code> and <code>Reduce()</code> functions from functional style programming languages
 :*Map (Filtering)
-::*Takes a function and applies it to all elements of the given data set
+::*Takes a function and applies it to a bunch of keys to produce values
+* Hides parallelization, fault tolerance, locality optimization and load balancing
 :*Reduce (Summary)
 ::*Accumulates results from the data set using a given function
+* Very easy to use and understand, with many classic problems fitting this pattern
+* Otherwise quite constrained in what exactly can be done
+* Uses hashing to distribute similar keys to similar machines, but otherwise spread the load
 == Naiad ==
@@ Line 37: / Line 60: @@
 *A distributed system for executing data parallel cyclic dataflow programs offering high throughput and low latency
 *Aims to provide a general purpose system which will fulfill the requirements and the will also support wide variety of high level programming models.
+*Highly used for parallel execution of data
+*Provides the functionality of checkpoint and restoring
+*A complex framework that can be the backend for simpler models of computation like LINQ or MapReduce to be built on top of.
 *Real Time Applications:
 :*Batch iterative Machine Learning: