MapReduce, Globus, BOINC: Difference between revisions

From Soma-notes
Taisia (talk | contribs)
No edit summary
 
(5 intermediate revisions by 3 users not shown)
Line 8: Line 8:


[http://homeostasis.scs.carleton.ca/~soma/distos/2008-03-24/mapreduce-osdi04.pdf Jeffrey Dean and Sanjay Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters" (2004)]
[http://homeostasis.scs.carleton.ca/~soma/distos/2008-03-24/mapreduce-osdi04.pdf Jeffrey Dean and Sanjay Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters" (2004)]
Paper mentioned in class:
[http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf Krste Asanovíc, et al, "The Landscape of Parallel Computing Research: A View from Berkeley" (2006)]
==Notes==
===Globus===
*Ony in release 4 they implemented Web services.
*Its an API.
*Globus you build an applications on top of existing framework. More like an interface to your application, other than something your application will use internally.
*Seems programmer friendly, though possibly unwieldy and too complex.
**Arguably the state of modern programming.
***Using a complex set of APIs, not actually just a simple new language.
***Just a new API to learn, Globus is this way too.
*Is this ok? Is this enough? Should we be expecting more from such a network?
**Some systems based their environment on the POSIX API – making the transition very easy.
**There are a LOT of API calls required for this system, why not a simpler API?
*What was NOT in this paper?
**No example code
**No comparison (even to previous versions!)
**No evaluation/metrics/performance
**Was this a marketing document?
*Side reports?
**AWEFUL!
**Wait a second… using XML in a grid computing environment? How SLOWWWWWWW
*Brought together by the Globus Alliance
**An effort to provide a standard
**In essence done by committee… meaning that people aren’t necessarily using it as it is developed, and priorities are skewed to marketable specs rather than performance metrics.
===BOINC===
*Premise?  Local client on your machine downloads a 'workunit', churns the data, dumps the results and downloads a new 'workunit'
*Why are we caring?
**Entertainment?
**How is this an OS paradigm?  What is it useful for?
***It isn't really an OS, just a method to have your mass computation done
***More of a distributed scheduler?
****Not even, central scheduler, but mass computation
***How many systems have we seen that have accomplished mass computation on millions of uncontrolled computers?
****ummm... none?
***As an OS?
****An OS is something that is created to run programs
****This is a special case allowing us to run specific programs (BUT IS IT AN OS?)
***Useful for "embarassingly parallel programs"
*Perfect for large scale simulation?
**But then you need LOTS of communication, and this system does not have interconnects
*The type of problems that we most care about tend not to be THAT parallel
*So what would a distributed OS be for?
**Shared communication!
***But we don't have much in the way that works well.
*An OS typically provides a lot of services, together in one package
**We have been seeing that there are no complete packages, just pieces and parts.  Why?
***Computers are changing too fast?  Same *NIX OS, same TCP/IP stack... so more of the same, why no true solution?
***Communication is unreliable? Yes, but that is also nothing new
*If people found that distributed file systems were successful, they would be in use all the time, but they aren't.  Reason? PERFORMANCE
*Take away message?
*Can't handle communication - how do you abstract access to resources when driven through a network?
**As a result, we have many many specialized solutions for particular workloads.
*If you are willing to not have communication between nodes, you gain a HUGE amount of computation.
*The most reliable systems are the one that forget communication.
**The more you system tolerates bad stuff with a network, the better is scales.
*We dont have general cluster distributed OS.
===MapReduce===
*The communication happens when you reduce the problem.
**MapReduce works because there is mapping and there is reducing.
***There is no side effects (enabling things).
*Why is it a good fit to a thousands of machines?
**They first had all these pieces, and if one of them does not replay, then they just do it over :)
***You create the algorithm to fit this model, create this pieces, you have a combining function.
****You have to have some back end that keeps track of who got work done. But you don't care if any machine fail in the middle of the computation.
*Compare MapReduce to POSIX
**The difference is in efficiency. MapReduce is an extension to POSIX.
***Distributed OSs trying to run the programs that run on different APIs. The systems that work, they are relaxed.
****Here is the model, loose compatibility by gaining scalability.
*Side effects - you cant redo and undo. Functional programming model

Latest revision as of 19:50, 26 March 2008

Readings

Ian Foster and Carl Kesselman, "Computational Grids" (1998)

Ian Foster, "Globus Toolkit Version 4: Software for Service-Oriented Systems" (2006)

David P. Anderson, "BOINC: A System for Public-Resource Computing and Storage" (2004)

Jeffrey Dean and Sanjay Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters" (2004)

Paper mentioned in class:

Krste Asanovíc, et al, "The Landscape of Parallel Computing Research: A View from Berkeley" (2006)

Notes

Globus

  • Ony in release 4 they implemented Web services.
  • Its an API.
  • Globus you build an applications on top of existing framework. More like an interface to your application, other than something your application will use internally.
  • Seems programmer friendly, though possibly unwieldy and too complex.
    • Arguably the state of modern programming.
      • Using a complex set of APIs, not actually just a simple new language.
      • Just a new API to learn, Globus is this way too.
  • Is this ok? Is this enough? Should we be expecting more from such a network?
    • Some systems based their environment on the POSIX API – making the transition very easy.
    • There are a LOT of API calls required for this system, why not a simpler API?
  • What was NOT in this paper?
    • No example code
    • No comparison (even to previous versions!)
    • No evaluation/metrics/performance
    • Was this a marketing document?
  • Side reports?
    • AWEFUL!
    • Wait a second… using XML in a grid computing environment? How SLOWWWWWWW
  • Brought together by the Globus Alliance
    • An effort to provide a standard
    • In essence done by committee… meaning that people aren’t necessarily using it as it is developed, and priorities are skewed to marketable specs rather than performance metrics.

BOINC

  • Premise? Local client on your machine downloads a 'workunit', churns the data, dumps the results and downloads a new 'workunit'
  • Why are we caring?
    • Entertainment?
    • How is this an OS paradigm? What is it useful for?
      • It isn't really an OS, just a method to have your mass computation done
      • More of a distributed scheduler?
        • Not even, central scheduler, but mass computation
      • How many systems have we seen that have accomplished mass computation on millions of uncontrolled computers?
        • ummm... none?
      • As an OS?
        • An OS is something that is created to run programs
        • This is a special case allowing us to run specific programs (BUT IS IT AN OS?)
      • Useful for "embarassingly parallel programs"
  • Perfect for large scale simulation?
    • But then you need LOTS of communication, and this system does not have interconnects
  • The type of problems that we most care about tend not to be THAT parallel
  • So what would a distributed OS be for?
    • Shared communication!
      • But we don't have much in the way that works well.
  • An OS typically provides a lot of services, together in one package
    • We have been seeing that there are no complete packages, just pieces and parts. Why?
      • Computers are changing too fast? Same *NIX OS, same TCP/IP stack... so more of the same, why no true solution?
      • Communication is unreliable? Yes, but that is also nothing new
  • If people found that distributed file systems were successful, they would be in use all the time, but they aren't. Reason? PERFORMANCE
  • Take away message?
  • Can't handle communication - how do you abstract access to resources when driven through a network?
    • As a result, we have many many specialized solutions for particular workloads.
  • If you are willing to not have communication between nodes, you gain a HUGE amount of computation.
  • The most reliable systems are the one that forget communication.
    • The more you system tolerates bad stuff with a network, the better is scales.
  • We dont have general cluster distributed OS.

MapReduce

  • The communication happens when you reduce the problem.
    • MapReduce works because there is mapping and there is reducing.
      • There is no side effects (enabling things).
  • Why is it a good fit to a thousands of machines?
    • They first had all these pieces, and if one of them does not replay, then they just do it over :)
      • You create the algorithm to fit this model, create this pieces, you have a combining function.
        • You have to have some back end that keeps track of who got work done. But you don't care if any machine fail in the middle of the computation.
  • Compare MapReduce to POSIX
    • The difference is in efficiency. MapReduce is an extension to POSIX.
      • Distributed OSs trying to run the programs that run on different APIs. The systems that work, they are relaxed.
        • Here is the model, loose compatibility by gaining scalability.
  • Side effects - you cant redo and undo. Functional programming model