DistOS 2015W Session 5: Difference between revisions
Formatted and added to the GFS section. |
|||
(7 intermediate revisions by 3 users not shown) | |||
Line 40: | Line 40: | ||
The goals of this system were: | The goals of this system were: | ||
# To | # To build a distributed system that can be centrally administered. | ||
# | # To be cost effective using cheap, modern microcomputers. | ||
The distribution itself is transparent to most programs. This | The distribution itself is transparent to most programs. This is made possible by two properties: | ||
# A per process group namespace. | # A per process group namespace. | ||
# Uniform access to most resources by representing them as a file. | # Uniform access to most resources by representing them as a file. | ||
Line 50: | Line 50: | ||
The commands, libraries and system calls are similar to that of Unix and therefore a casual user cannot distinguish between these two. The problems in UNIX were too deep to fix but still the various ideas were brought along. The problems addressed badly by UNIX were improved. Old tools were dropped and others were polished and reused. | The commands, libraries and system calls are similar to that of Unix and therefore a casual user cannot distinguish between these two. The problems in UNIX were too deep to fix but still the various ideas were brought along. The problems addressed badly by UNIX were improved. Old tools were dropped and others were polished and reused. | ||
== Similaritieis with the UNIX == | |||
<ul> | |||
<li>shell</li> | |||
<li>Various C compilers</li> | |||
</ul> | |||
== Unique Features == | == Unique Features == | ||
Line 71: | Line 78: | ||
* There is also '''unbind''' which undoes the effects of the other two calls. | * There is also '''unbind''' which undoes the effects of the other two calls. | ||
Namespaces in Plan 9 are on a per-process basis. While everything had a way reference resources with a unique name, using mount and bind | Namespaces in Plan 9 are on a per-process basis. While everything had a way reference resources with a unique name, using mount and bind every process could build a custom namespace as they saw fit. | ||
Since most resources are in the form of files (and folders), the term ''namespace'' really only refers to the filesystem layout. | Since most resources are in the form of files (and folders), the term ''namespace'' really only refers to the filesystem layout. | ||
=== Parallel Programming === | === Parallel Programming === | ||
Parallel programming was supported in two ways: | |||
* Kernel provides simple process model and carefully designed system calls for synchronization. | * Kernel provides simple process model and carefully designed system calls for synchronization. | ||
* Programming language supports concurrent programming. | * Programming language supports concurrent programming. | ||
Line 82: | Line 89: | ||
== Legacy == | == Legacy == | ||
Even though Plan 9 is no longer developed, the good ideas from the system still exist today. For example, the ''/proc'' virtual filesystem which displays current process information in the form of files exists in | Even though Plan 9 is no longer developed, the good ideas from the system still exist today. For example, the ''/proc'' virtual filesystem which displays current process information in the form of files exists in modern Linux kernels. | ||
= Google File System = | = Google File System = | ||
Line 90: | Line 96: | ||
Unlike most filesystems, GFS must be implemented by individual applications and is not part of the kernel. While this introduces some technical overhead, it gives the system more freedom to implement or not implement certain non-standard features. | Unlike most filesystems, GFS must be implemented by individual applications and is not part of the kernel. While this introduces some technical overhead, it gives the system more freedom to implement or not implement certain non-standard features. | ||
Link to an explanation on how GFS works | |||
[http://computer.howstuffworks.com/internet/basics/google-file-system1.htm] | |||
== Architecture == | == Architecture == | ||
Line 100: | Line 109: | ||
Master and Chunk server communication consists of | Master and Chunk server communication consists of | ||
# checking whether there | # checking whether there any chunk-server is down | ||
# checking if any file is corrupted | # checking if any file is corrupted | ||
# deleting stale chunks | # deleting stale chunks | ||
When a client | When a client wants to do some operations on the chunks | ||
# it first asks the master server for the list of servers that store the parts of a file it wants to access | # it first asks the master server for the list of servers that store the parts of a file it wants to access | ||
# it receives a list of chunk servers, with multiple servers for each chunk | # it receives a list of chunk servers, with multiple servers for each chunk | ||
# it finally communicates with the the chunk servers to perform the operation | # it finally communicates with the the chunk servers to perform the operation | ||
The system is geared towards appends and sequential reads. This is why the master server responds with multiple server addresses for each chunk - the client can then request a small piece from each server, increasing the data throughput linearly with the number of servers. Writes, in general, are in the form of a special ''append'' system call. When appending, there is no chance that two clients will want to write to the same location at the same time. This helps avoid any potential synchronization issues. If there are multiple appends to the same file at the same time, the chunk servers are free to order them as they wish (chunks on each server are not guaranteed to be byte-for-byte identical). While a problem in the general sense, this is good enough for Google's needs. | The system is geared towards appends and sequential reads. This is why the master server responds with multiple server addresses for each chunk - the client can then request a small piece from each server, increasing the data throughput linearly with the number of servers. Writes, in general, are in the form of a special ''append'' system call. When appending, there is no chance that two clients will want to write to the same location at the same time. This helps avoid any potential synchronization issues. If there are multiple appends to the same file at the same time, the chunk servers are free to order them as they wish (chunks on each server are not guaranteed to be byte-for-byte identical). Changes may also be applied multiple times. These issues are left for the application using GFS to resolve themselves. While a problem in the general sense, this is good enough for Google's needs. | ||
== Redundancy == | == Redundancy == | ||
Line 122: | Line 131: | ||
For efficiency, there is only a single live master server at a time. While not making the system completely distributed, it avoid many synchronization problems and suits Google's needs. At any point in time, there are multiple read-only master servers that copy metadata from the currently live master. Should it go down, they will serve any read operations from the clients, until one of the hot spares is promoted to being the new live master server. | For efficiency, there is only a single live master server at a time. While not making the system completely distributed, it avoid many synchronization problems and suits Google's needs. At any point in time, there are multiple read-only master servers that copy metadata from the currently live master. Should it go down, they will serve any read operations from the clients, until one of the hot spares is promoted to being the new live master server. | ||
=== Server:Stateless === | |||
*the servers does not store states about clients | |||
*no caching at client either | |||
**since most program only cares about the output | |||
**if client wants up-to-date result, rerun the program | |||
*use heartbeat messages to monitor servers | |||
**good for system with assumption that changes (or failures) are often |
Latest revision as of 10:03, 20 April 2015
The Clouds Distributed Operating System
It is a distributed OS running on a set of computers that are interconnected by a group of network. It basically unifies different computers into a single component.
The OS is based on 2 patterns:
- Message Based OS
- Object Based OS
Object Thread Model
The structure of this is based on object thread model. It has set of objects which are defined by the class. Objects respond to messages. Sending message to object causes object to execute the method and then reply back.
The system has active objects and passive objects.
- Active objects are the objects which have one or more processes associated with them and further they can communicate with the external environment.
- Passive objects are those that currently do not have an active thread executing in them.
The content of the Clouds data is long lived. Since the memory is implemented as a single-level store, the data exists forever and can survive system crashes and shut downs.
Threads
The threads are the logical path of execution that traverse objects and executes code in them. The Clouds thread is not bound to a single address space. Several threads can enter an object simultaneously and execute concurrently. The nature of the Clouds object prohibits a thread from accessing any data outside the current address space in which it is executing.
Interaction Between Objects and Threads
- Inter object interfaces are procedural
- Invocations work across machine boundaries
- Objects in clouds unify concept of persistent storage and memory to create address space, thus making the programming simpler.
- Control flow achieved by threads invoking objects.
Clouds Environment
- Integrates set of homogeneous machines into one seamless environment
- There are three logical categories of machines- Compute Server, User Workstation and Data server.
Plan 9
Plan 9 is a general purpose, multi-user and mobile computing environment physically distributed across machines. Development of the system began in the late 1980s. The system was built at Bell Labs - the birth place of Unix. The original Unix OS had no support for networking, and there were many attempts over the years by others to create distributed systems with Unix compatibility. Plan 9, however, is distributed systems done following the original Unix philosophy.
The goals of this system were:
- To build a distributed system that can be centrally administered.
- To be cost effective using cheap, modern microcomputers.
The distribution itself is transparent to most programs. This is made possible by two properties:
- A per process group namespace.
- Uniform access to most resources by representing them as a file.
Unix Compatibility
The commands, libraries and system calls are similar to that of Unix and therefore a casual user cannot distinguish between these two. The problems in UNIX were too deep to fix but still the various ideas were brought along. The problems addressed badly by UNIX were improved. Old tools were dropped and others were polished and reused.
Similaritieis with the UNIX
- shell
- Various C compilers
Unique Features
What actually distinguishes Plan 9 is its organization. Plan 9 is divided along the lines of service function.
- CPU services and terminals use same kernel.
- Users may choose to run programs locally or remotely on CPU servers.
- It lets the user choose whether they want a distributed or centralized system.
The design of Plan 9 is based on 3 principles:
- Resources are named and accessed like files in hierarchical file system.
- Standard protocol 9P.
- Disjoint hierarchical provided by different services are joined together into single private hierarchical file name space.
Virtual Namespaces
In a virtual namespace, a user boots a terminal or connects to a CPU server and then a new process group is created. Processes in group can either add to or rearrange their name space using two system calls - mount and bind.
- Mount is used to attach new file system to a point in name space.
- Bind is used to attach a kernel resident (existing, mounted) file system to name space and also arrange pieces of name space.
- There is also unbind which undoes the effects of the other two calls.
Namespaces in Plan 9 are on a per-process basis. While everything had a way reference resources with a unique name, using mount and bind every process could build a custom namespace as they saw fit.
Since most resources are in the form of files (and folders), the term namespace really only refers to the filesystem layout.
Parallel Programming
Parallel programming was supported in two ways:
- Kernel provides simple process model and carefully designed system calls for synchronization.
- Programming language supports concurrent programming.
Legacy
Even though Plan 9 is no longer developed, the good ideas from the system still exist today. For example, the /proc virtual filesystem which displays current process information in the form of files exists in modern Linux kernels.
Google File System
It is scalable, distributed file system for large, data intensive applications. It is crafted to Google's unique needs as a search engine company.
Unlike most filesystems, GFS must be implemented by individual applications and is not part of the kernel. While this introduces some technical overhead, it gives the system more freedom to implement or not implement certain non-standard features.
Link to an explanation on how GFS works [1]
Architecture
The architecture of the Google file system consists of a single master, multiple chunk-servers and multiple clients. Chunk servers are used to store the data in uniformly sized chunks. Each chunk is identified by globally unique 64 bit handle assigned by master at the time of creation. The chunks are split into 64KB blocks, each with its own hashsum for data integrity checks. The chunks are replicated between servers, 3 by default. The master maintains all the file system metadata which includes the namespace and chunk location.
Each of the chunks is 64 MB large (contrast this to the typical filesystem sectors of 512 or 4096 bytes) as the system is meant to hold enormous amount of data - namely the internet. The large chunk size is also important for the scalability of the system - the larger the chunk size, the less metadata the master server has to store for any given amount of data. With the current size, the master server is able to store the entirety of the metadata in memory, increasing performance by a significant margin.
Operation
Master and Chunk server communication consists of
- checking whether there any chunk-server is down
- checking if any file is corrupted
- deleting stale chunks
When a client wants to do some operations on the chunks
- it first asks the master server for the list of servers that store the parts of a file it wants to access
- it receives a list of chunk servers, with multiple servers for each chunk
- it finally communicates with the the chunk servers to perform the operation
The system is geared towards appends and sequential reads. This is why the master server responds with multiple server addresses for each chunk - the client can then request a small piece from each server, increasing the data throughput linearly with the number of servers. Writes, in general, are in the form of a special append system call. When appending, there is no chance that two clients will want to write to the same location at the same time. This helps avoid any potential synchronization issues. If there are multiple appends to the same file at the same time, the chunk servers are free to order them as they wish (chunks on each server are not guaranteed to be byte-for-byte identical). Changes may also be applied multiple times. These issues are left for the application using GFS to resolve themselves. While a problem in the general sense, this is good enough for Google's needs.
Redundancy
GFS is built with failure in mind. The system expects that at any time, there is some server or disk that is malfunctioning. The system deals with the failures as follows.
Chunk Servers
By default, chunks are replicated to three servers. This exact number depends on the application in question doing the write. When a chunk server finds that some of its data is corrupt, it grabs the data from other servers to repair itselfTemplate:Citation needed.
Master Server
For efficiency, there is only a single live master server at a time. While not making the system completely distributed, it avoid many synchronization problems and suits Google's needs. At any point in time, there are multiple read-only master servers that copy metadata from the currently live master. Should it go down, they will serve any read operations from the clients, until one of the hot spares is promoted to being the new live master server.
Server:Stateless
- the servers does not store states about clients
- no caching at client either
- since most program only cares about the output
- if client wants up-to-date result, rerun the program
- use heartbeat messages to monitor servers
- good for system with assumption that changes (or failures) are often