Soma-notes - User contributions [en]

MapReduce, Globus, BOINC

2008-03-26T19:50:39Z

Taisia:

==Readings==

[http://homeostasis.scs.carleton.ca/~soma/distos/2008-03-24/foster-grid.pdf Ian Foster and Carl Kesselman, "Computational Grids" (1998)]

[http://homeostasis.scs.carleton.ca/~soma/distos/2008-03-24/foster-globus-intro.pdf Ian Foster, "Globus Toolkit Version 4: Software for Service-Oriented Systems" (2006)]

[http://homeostasis.scs.carleton.ca/~soma/distos/2008-03-24/anderson-boinc.pdf David P. Anderson, "BOINC: A System for Public-Resource Computing and Storage" (2004)]

[http://homeostasis.scs.carleton.ca/~soma/distos/2008-03-24/mapreduce-osdi04.pdf Jeffrey Dean and Sanjay Ghemawat, "MapReduce: Simpliﬁed Data Processing on Large Clusters" (2004)]

Paper mentioned in class:

[http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf Krste Asanovíc, et al, "The Landscape of Parallel Computing Research: A View from Berkeley" (2006)]

==Notes==
===Globus===
*Ony in release 4 they implemented Web services.
*Its an API.
*Globus you build an applications on top of existing framework. More like an interface to your application, other than something your application will use internally.

*Seems programmer friendly, though possibly unwieldy and too complex.
**Arguably the state of modern programming.
***Using a complex set of APIs, not actually just a simple new language.
***Just a new API to learn, Globus is this way too.

*Is this ok? Is this enough? Should we be expecting more from such a network?
**Some systems based their environment on the POSIX API – making the transition very easy.
**There are a LOT of API calls required for this system, why not a simpler API?

*What was NOT in this paper?
**No example code
**No comparison (even to previous versions!)
**No evaluation/metrics/performance
**Was this a marketing document?

*Side reports?
**AWEFUL!
**Wait a second… using XML in a grid computing environment? How SLOWWWWWWW

*Brought together by the Globus Alliance
**An effort to provide a standard
**In essence done by committee… meaning that people aren’t necessarily using it as it is developed, and priorities are skewed to marketable specs rather than performance metrics.

===BOINC===
*Premise? Local client on your machine downloads a 'workunit', churns the data, dumps the results and downloads a new 'workunit'
*Why are we caring?
**Entertainment?
**How is this an OS paradigm? What is it useful for?
***It isn't really an OS, just a method to have your mass computation done
***More of a distributed scheduler?
****Not even, central scheduler, but mass computation
***How many systems have we seen that have accomplished mass computation on millions of uncontrolled computers?
****ummm... none?
***As an OS?
****An OS is something that is created to run programs
****This is a special case allowing us to run specific programs (BUT IS IT AN OS?)
***Useful for "embarassingly parallel programs"
*Perfect for large scale simulation?
**But then you need LOTS of communication, and this system does not have interconnects
*The type of problems that we most care about tend not to be THAT parallel

*So what would a distributed OS be for?
**Shared communication!
***But we don't have much in the way that works well.
*An OS typically provides a lot of services, together in one package
**We have been seeing that there are no complete packages, just pieces and parts. Why?
***Computers are changing too fast? Same *NIX OS, same TCP/IP stack... so more of the same, why no true solution?
***Communication is unreliable? Yes, but that is also nothing new

*If people found that distributed file systems were successful, they would be in use all the time, but they aren't. Reason? PERFORMANCE

*Take away message?
*Can't handle communication - how do you abstract access to resources when driven through a network?
**As a result, we have many many specialized solutions for particular workloads.
*If you are willing to not have communication between nodes, you gain a HUGE amount of computation.

*The most reliable systems are the one that forget communication.
**The more you system tolerates bad stuff with a network, the better is scales.

*We dont have general cluster distributed OS.

===MapReduce===
*The communication happens when you reduce the problem.
**MapReduce works because there is mapping and there is reducing.
***There is no side effects (enabling things).
*Why is it a good fit to a thousands of machines?
**They first had all these pieces, and if one of them does not replay, then they just do it over :)
***You create the algorithm to fit this model, create this pieces, you have a combining function.
****You have to have some back end that keeps track of who got work done. But you don't care if any machine fail in the middle of the computation.
*Compare MapReduce to POSIX
**The difference is in efficiency. MapReduce is an extension to POSIX.
***Distributed OSs trying to run the programs that run on different APIs. The systems that work, they are relaxed.
****Here is the model, loose compatibility by gaining scalability.
*Side effects - you cant redo and undo. Functional programming model

MapReduce, Globus, BOINC

2008-03-26T19:38:25Z

Taisia:

Distributed OS Overview

2008-01-13T21:07:02Z

Taisia:

== Distributed Operating Systems ==

[[Image:OS4000_Distributed.png|A distributed operating system.]]

At what level do you want to start the distribution?
'''Hardware Layer''' 
If memory is shared, communication is trivial ie. parallel computers (multi core).

'''Kernel Layer''' 
Distributing at the kernel layer will avoid API and User Space changes. So why don't we share at the lower kernel layer? The main reason is security and performance. We can assume that memory is fast (low latency) so we need to share memory across each computer in the distribution. To share memory in such a way presents a challenge. So why isn't virtual memory enough? Typically because of contention over memory. Virtual memory is slow; duplicating memory pages and synchronizing them across the network takes too much time.

'''Process Layer''' 
At the process layer, we can perform the distribution over the network using TCP/IP. Unfortunately, TCP/IP has a high latency. We can use ethernet to reduce the latency, but it is only a viable solution for LAN based systems. Therefor, latency will always be present.
We can deal with latency by caching a local copy of data which effectively reduces the amount of communication required. The downside is, caching introduces a need for synchronization.

We can also deal with latency by compressing the data and splitting up the computation "wisely". Computers are not wise enough to do it effectively, leaving it up to the programmers. Programmers split up such computations by introducing client/server architectures, using web applications and web services as well as using distributes file systems. For example, spam uses internet resources to communicate by email with large numbers of users. Spam is a feature of the Internet, everyone should be able to send an email to everyone and spam uses that resource. Distributed operating systems on the scale of the Internet capable of wise resource management do not yet exist. 
 
== What Makes a Good Distributed Operating System? ==

A good distributed OS must be:
* Reliable and support dynamically scalable storage.
* More processing power (CPU - linear scaling).
* Manageable (it should be easy to manage, like a single computer)
* Easy to write programs
* Support fore single sign on!!
* A single system image
* Reliability (fault tolerant to software and hardware errors)
* Dynamic reconfiguration
* Exploit local resources.

File:OS4000 Distributed.png

2008-01-13T19:29:28Z

Taisia: