DistOS 2015W Session 9

From Soma-notes
Jump to navigation Jump to search

Session 9 is about processing large volumes of data (big data).

BOINC and SETI@home are crowdsourced systems that spread the input data across computers via the internet to have each chunk individually processed and returned back to the sender.

MapReduce and Naiad are more technically challenging systems that not only let nodes process individual chunks of data but also combine and fold them together to allow algorithms to process various aggregate results from the input data set.

BOINC

  • Public Resource Computing Platform
  • Gives scientists the ability to use large amounts of computation resources.
  • The clients do not connect directly with each other but instead they talk to a central server located at Berkley
  • The goals of Boinc are:
  • 1) reduce the barriers of entry
  • 2) Share resources among autonomous projects
  • 3) Support diverse applications
  • 4) Reward participants.
  • 5) Provide screensaver graphics
  • It can run as applications in common language with no modifications
  • A BOINC application can be identified by a single master URL, which serves as the homepage as well as the directory of the servers.
  • Servers perform set of function using:
    • Scheduling servers: handles Remote Procedure Call from clients
    • Data servers:helps to manage the uploads
  • Can only work on data that can be split into many small shards and each shard processed entirely independently. Pure mappings on big data, no larger folding capabilities. Was used for scientific purposes mostly.

SETI@Home

  • Uses public resource computing to analyze radio signals to find extraterrestrial intelligence
  • Need good quality telescope to search for radio signals, and lots of computational power, which was unavailable locally
  • It has not yet found extraterrestrial intelligence, but its has established credibility of public resource computing projects
  • Originally custom, now uses BOINC as a backbone for the project
  • Uses relational database to store information on a large scale, further it uses a multi-threaded server to distribute work to clients
  • Quality of data in this architecture is untrustworthy, the main incentive to use it, however, is that it is a cheap and easy way of scaling the work exponentially.
  • Provided social incentives to encourage users to join the system.
  • This computation model still exists but not in the legitimate world.
  • Formed a good concept of public resource computing and a distributed computing by providing a platform independent framework

MapReduce

  • A programming model presented by Google to do large scale parallel computations
  • Uses the Map() and Reduce() functions from functional style programming languages
  • Map (Filtering)
  • Takes a function and applies it to a bunch of keys to produce values
  • Hides parallelization, fault tolerance, locality optimization and load balancing
  • Reduce (Summary)
  • Accumulates results from the data set using a given function
  • Very easy to use and understand, with many classic problems fitting this pattern
  • Otherwise quite constrained in what exactly can be done
  • Uses hashing to distribute similar keys to similar machines, but otherwise spread the load

Naiad

  • A programming model similar to MapReduce but with streaming capabilities so that data results are almost instantaneous
  • A distributed system for executing data parallel cyclic dataflow programs offering high throughput and low latency
  • Aims to provide a general purpose system which will fulfill the requirements and the will also support wide variety of high level programming models.
  • Highly used for parallel execution of data
  • Provides the functionality of checkpoint and restoring
  • A complex framework that can be the backend for simpler models of computation like LINQ or MapReduce to be built on top of.
  • Real Time Applications:
  • Batch iterative Machine Learning:

VW, an open source distributed machine learning performs iteration in 3 phases: each process updates local state; processes independently training on local data; and process jointly performed global average which is All Reduce.

  • Streaming Acyclic Computation

When compared to a system called Kineograph ( also done by Microsoft ), which processes twitter handles and provides counts of the occurrence of hashtags as well as links between popular tags, was written using Naiad in 26 lines of code and ran close to 2X faster.

  • Naiad paper won the best paper award in SOSP 2013, check-out this link in Microsoft Research website http://research.microsoft.com/en-us/projects/naiad/ . Down in this page you can see some videos that explains naiad including Derek's Murray presentation at SOSP 2013.