Soma-notes - User contributions [en]

DistOS 2014W Lecture 8

2014-04-22T18:58:22Z

Cdelahou: /* Class Discussion: */

==NFS and AFS (Jan 30)==

* [http://homeostasis.scs.carleton.ca/~soma/distos/2008-02-11/sandberg-nfs.pdf Russel Sandberg et al., "Design and Implementation of the Sun Network Filesystem" (1985)]
* [http://homeostasis.scs.carleton.ca/~soma/distos/2008-02-11/howard-afs.pdf John H. Howard et al., "Scale and Performance in a Distributed File System" (1988)]

==NFS==
Group 1:

1) per operation traffic.

2) rpc based. Easy with which to program but a very [http://www.joelonsoftware.com/articles/LeakyAbstractions.html leaky abstraction].

3) unreliable

Group 2:

1) designed to share disks over a network, not files

2) more UNIX like. They tried to maintain unix file semantics on the client and server side.

3) portable. It was meant to work (as a server) across many FS types.

4) used UDP: if request dropped, just request again.

5) it is not minimize network traffic.

6) used VNODE, VFS as transparent interfaces to local disks.

7) not have much hardware equipment

8) later versions took on features of AFS

9) stateless protocol conflicts with files being stateful by nature.

Group 3:

1) cache assumption invalid.

2) no dedicated locking mechanism. They couldn't decide on which locking strategy to use, so they left it up to the users of NFS to use their own separate locking service.

3) bad security

Other:
* Client mounts full FS. No common namespace.
* Hostname lookup and address binding happens at mount

==AFS==

Group 1

1) design for 5000 to 10000 clients

2) high integrity.

Group 2

1) designed to share files over a network, not disks. It is one FS.

2) better scalability

3) better security (Kerberos).

4) minimize network traffic.

5) less UNIX like

6) plugin authentication

7) needs more kernel storage due to complex commands

8) inode concept replaced with fid

Group 3

1) cache assumption valid

2) locking

3) good security.

Other:
* Caches full files locally on open. Sends diffs on close.

==Class Discussion:==

Capturing some of Anil's Observations about NFS and AFS:
* The reason why NFS does not try to share at block level instead of file level is that sharing at block level is complicated from the implementation point of view.
* NFS use UDP as the transport protocol since UDP being a stateless protocol is in-line with the NFS design philosophy of not maintaining state information.
* Security and unreliability issues in NFS are an implication of using RPC.
** RPC is a nice way for programming but RPC is not designed for networks (where flakiness is an inherent characteristic) which is better explained by the analogy that you never expect from a programming point of view your function call to fail(not to return) because of communication error.
* AFS designers considered network as a bottle neck and tried to reduce the number of chatter over network by using caching.
** 'open' and 'close' operations in AFS were critical
** the 'close' operation assumes importance to the same proportions of a 'commit' operation in a well-designed database system.
* The security model of AFS is interesting in that rather than going for the UNIX access list based implementation AFS used a single sign on system based on Kerberos.
** cool thing about Kerberos is that idea of using tickets to get access.
* Irrespective of having better features compared to NFS, AFS did not get widely adopted. The reason for this was that the administrative mechanism for AFS was complex and it required highly trained/skilled people to setup AFS and it also required quite a number of day’s effort to set it up and maintain.

DistOS 2014W Lecture 21

2014-04-20T02:57:39Z

Cdelahou: /* Naiad */

== Presentation ==

=== Marking ===

* marked mostly on presentation, not content
* basically we want to communicate the basic structure of the paper, and do so in a way that isn't boring

=== Content ===

* concrete, not "head in the clouds"
* present the area
* compare and contrast the papers
* 10 minutes talk, 5 minutes feedback
* basic argument
* basic references

=== Form ===

* show the work we've done on paper
* try to get feedback
* think of it as a rough draft
* try to get people to read the paper
* enthusiasm
* powerpoints are easier
* don't read slides
* no whole sentences on slides
* look at talks by Mark Shuttleworth

== MapReduce ==

A clever observation that a simple solution could solve most distributed problems. It's all about programming to an abstraction that is efficiently parallelizable. Note that it's not actually a simple solution, because it sits atop a mountain of code. It requires something like BigTable which requires something like GFS, which requires something like Chubby. Despite this, it allows for programmers to easily do distributed computation using a simple framework that hides the messy details of parrallelization.

* Restricted programming model
* Interestingly large scale problems can be implemented with this
* Easy to program, powerful for certain classes of problems, it scales well.
* MapReduce job model is VERY limited though. You can't do things like simulations.
* MapReduce is problem specific.
** Naiad is less problem specific and allows you to do more.

Programming to an abstraction that is efficiently parllel. We have learnt all about infrastructure until now.
Classic OS abstractions were about files. Now we used programming abstraction.

Example: word frequency in a document.

=== How does it work? ===

* Two steps. Map and Reduce. The user writes theses.
** Map takes a single input key-value pair (eg. a named document) and converts it to an intermediate (k,v) representation. A list of new key-values.
** Reduce: Take the intermediate representation and merge the values.

=== Implementation ===

* Uses commodity HW and GFS.
* Master/Slave relationship amongst machines. Master delegates tasks to slaves.
* Intermediate representation saved as files.
* Many MapReduce jobs can happen in sequence.

== Naiad ==

Where MapReduce was suited for a specific family of solutions, Naiad tries to generalize the solution to apply parallelization to a much wider family. Naiad supports MapReduce style solutions, but also many other solutions. However, the tradeoff was simplicity. It's like we took MapReduce and took away its low barrier to entry. The idea is to create a constrained graph that can easily be parallelized.

* More complicated than Map Reduce
* Talks about Timely dataflow graphs
* Its all about Graph algorithms - Graph abstraction
* Restrictions on graphs so that they can be mapped to parllel computation
* How to fit anything to this model is a big question.
* More general than map reduce.

* After reading the MapReduce paper, you could easily write a map reduce job. After reading the Naiad, you can't. Naiad is super complicated.
* Their model is super complicated. It doesn't minimize our cognitive load.
* Doesn't scale at all. After about 40 nodes, there is no improvement in performance. MapReduce can scale to thousands of nodes and scales forever.
* Nobody wants to use it because the abstraction is complicated.

DistOS 2014W Lecture 21

2014-04-20T02:50:40Z

Cdelahou: Niad

DistOS 2014W Lecture 21

2014-04-20T02:43:08Z

Cdelahou: More on M/R

== Presentation ==

=== Marking ===

* marked mostly on presentation, not content
* basically we want to communicate the basic structure of the paper, and do so in a way that isn't boring

=== Content ===

* concrete, not "head in the clouds"
* present the area
* compare and contrast the papers
* 10 minutes talk, 5 minutes feedback
* basic argument
* basic references

=== Form ===

* show the work we've done on paper
* try to get feedback
* think of it as a rough draft
* try to get people to read the paper
* enthusiasm
* powerpoints are easier
* don't read slides
* no whole sentences on slides
* look at talks by Mark Shuttleworth

== MapReduce ==

A clever observation that a simple solution could solve most distributed problems. It's all about programming to an abstraction that is efficiently parallelizable. Note that it's not actually a simple solution, because it sits atop a mountain of code. It requires something like BigTable which requires something like GFS, which requires something like Chubby. Despite this, it allows for programmers to easily do distributed computation using a simple framework that hides the messy details of parrallelization.

* Restricted programming model
* Interestingly large scale problems can be implemented with this
* Easy to program, powerful for certain classes of problems, it scales well.
* MapReduce job model is VERY limited though. You can't do things like simulations.
* MapReduce is problem specific.
** Naiad is less problem specific and allows you to do more.

Programming to an abstraction that is efficiently parllel. We have learnt all about infrastructure until now.
Classic OS abstractions were about files. Now we used programming abstraction.

Example: word frequency in a document.

=== How does it work? ===

* Two steps. Map and Reduce. The user writes theses.
** Map takes a single input key-value pair (eg. a named document) and converts it to an intermediate (k,v) representation. A list of new key-values.
** Reduce: Take the intermediate representation and merge the values.

=== Implementation ===

* Uses commodity HW and GFS.
* Master/Slave relationship amongst machines. Master delegates tasks to slaves.
* Intermediate representation saved as files.
* Many MapReduce jobs can happen in sequence.

== Naiad ==

Where MapReduce was suited for a specific family of solutions, Naiad tries to generalize the solution to apply parallelization to a much wider family. Naiad supports MapReduce style solutions, but also many other solutions. However, the tradeoff was simplicity. It's like we took MapReduce and took away its low barrier to entry. The idea is to create a constrained graph that can easily be parallelized.
* Complicated than Map Reduce
* Talks about Timely dataflow graphs
* Its all about Graph algorithms - Graph abstraction
* Restrictions on graphs so that they can be mapped to parllel computation
* How to fit anything to this model is a big question.
* More general than map reduce
* Not very useful.

DistOS 2014W Lecture 21

2014-04-20T02:40:45Z

Cdelahou: MapReduce stuff

== Presentation ==

=== Marking ===

* marked mostly on presentation, not content
* basically we want to communicate the basic structure of the paper, and do so in a way that isn't boring

=== Content ===

* concrete, not "head in the clouds"
* present the area
* compare and contrast the papers
* 10 minutes talk, 5 minutes feedback
* basic argument
* basic references

=== Form ===

* show the work we've done on paper
* try to get feedback
* think of it as a rough draft
* try to get people to read the paper
* enthusiasm
* powerpoints are easier
* don't read slides
* no whole sentences on slides
* look at talks by Mark Shuttleworth

== MapReduce ==

A clever observation that a simple solution could solve most distributed problems. It's all about programming to an abstraction that is efficiently parallelizable. Note that it's not actually a simple solution, because it sits atop a mountain of code. It requires something like BigTable which requires something like GFS, which requires something like Chubby. Despite this, it allows for programmers to easily do distributed computation using a simple framework that hides the messy details of parrallelization.

* Restricted programming model
* Interestingly large scale problems can be implemented with this
* Easy to program, powerful for certain classes, it scales like no ones business.
* Kind of empoweraged model
Programming to an abstraction that is efficiently parllel. We have learnt all about infrastructure until now.
Classic OS abstractions were about files. Now we used programming abstraction.

Example: word frequency in a document.

=== How does it work? ===

* Two steps. Map and Reduce. The user writes theses.
** Map takes a single input key-value pair (eg. a named document) and converts it to an intermediate (k,v) representation. A list of new key-values.
** Reduce: Take the intermediate representation and merge the values.

=== Implementation ===

* Uses commodity HW and GFS.
* Master/Slave relationship amongst machines. Master delegates tasks to slaves.
* Intermediate representation saved as files.
* Many MapReduce jobs can happen in sequence.

== Naiad ==

Where MapReduce was suited for a specific family of solutions, Naiad tries to generalize the solution to apply parallelization to a much wider family. Naiad supports MapReduce style solutions, but also many other solutions. However, the tradeoff was simplicity. It's like we took MapReduce and took away its low barrier to entry. The idea is to create a constrained graph that can easily be parallelized.
* Complicated than Map Reduce
* Talks about Timely dataflow graphs
* Its all about Graph algorithms - Graph abstraction
* Restrictions on graphs so that they can be mapped to parllel computation
* How to fit anything to this model is a big question.
* More general than map reduce
* Not very useful.

DistOS 2014W Lecture 20

2014-04-20T02:04:52Z

Cdelahou: Added a few other points for Comet.

== Cassandra ==

Cassandra is essentially running a BigTable interface on top of a Dynamo infrastructure. BigTable uses GFS' built-in replication and Chubby for locking. Cassandra uses gossip algorithms (similar to Dynamo): [http://dl.acm.org/citation.cfm?id=1529983 Scuttlebutt].

=== A brief look at Open Source ===

Initialy Anil talked about google versus facebook approach to technologies.
* Google developed its technology internally and used for competitive advantage.
* Facebook developed its technology in open source manner. They needed to create an open source community to keep up.
* He talked little bit about licences. With GPL3 you have to provide code with binary. In AGPL additional service also be given with source code.

While discussing Hbase versus Cassandra discussed why two projects with same notion are supported? Apache as a community. For any tool in CS, particularly software tools, its actually important to have more than one good implementation. Only time it doesn't happen because of market realities.

Hadoop is a set of technologies that represent the open source equivalent of
Google's infrastructure
* Cassandra -> ???
* HBase -> BigTable
* HDFS -> GFS
* Zookeeper -> Chubby

=== Back to Cassandra ===

* Cassandra is basically you take a key value store system like Dynamo and then you extend to look like BigTable.
* Not just a key value store. It is a multi dimensional map. You can look up different columns, etc. The data is more structured than a Key-Value store.
* In a key value store, you can only look up the key. Cassandra is much richer than this.

Bigtable vs. Cassandra:
* Bigtable and Cassandra exposes similar APIs.
* Cassandra seems to be lighter weight.
* Bigtable depends on GFS. Cassandra depends on server's file system. Anil feels cassandra cluster is easy to setup.
* Bigtable is designed for stream oriented batch processing . Cassandra is for handling online/realtime/highspeed stuff.

Schema design is explained in inbox example. It does not give clarity about how table will look like. Anil thinks they store lot data with messages which makes table crappy.

Apache Zookeeper is used for distributed configuration. It will also bootstrap and configure a new node. It is similar to Chubby. Zookeeper is for node level information. The Gossip protocol is more about key partitioning information and distributing that information amongst nodes.

Cassandra uses a modified version of the Accrual Failure Detector. The idea of an Accrual Failure Detection is that failure detection module emits a value which represents a suspicion level for each of monitored nodes. The idea is to express the value of phi� on a scale that is dynamically adjusted to react network and load conditions at the monitored nodes.

Files are written to disk in an sequential way and are never mutated. This way, reading a file does not require locks. Garbage collection takes care of deletion.

Cassandra writes in an immutable way like functional programming. There is no assignment in functional programming. It tries to eliminate side effects. Data is just binded you associate a name with a value.

Cassandra -
* Uses consistent hashing (like most DHTs)
* Lighter weight
* All most of the readings are part of Apache
* More designed for online updates for interactive lower latency
* Once they write to disk they only read back
* Scalable multi master database with no single point of failure
* Reason for not giving out the complete detail on the table schema
* Probably not just inbox search
* All data in one row of a table
* Its not a key-value store. Big blob of data.
* Gossip based protocol - Scuttlebutt. Every node is aware of overy other.
* Fixed circular ring
* Consistency issue not addressed at all. Does writes in an immutable way. Never change them.

Older style network protocol - token rings
What sort of computational systems avoid changing data?
Systems talking about implementing functional like semantics.

== Comet ==

The major idea behind Comet is triggers/callbacks. There is an extensive literature in extensible operating systems, basically adding code to the operating system to better suit my application. "Generally, extensible systems suck." -[[User:Soma]] This was popular before operating systems were open source.

[https://www.usenix.org/conference/osdi10/comet-active-distributed-key-value-store The presentation video of Comet]

Comet seeks to greatly expand the application space for key-value storage systems through application-specific customization.Comet storage object is a <key,value> pair.Each Comet node stores a collection of active storage objects (ASOs) that consist of a key, a value, and a set of handlers. Comet handlers run as a result of timers or storage operations, such as get or put, allowing an ASO to take dynamic, application-specific actions to customize its behaviour. Handlers are written in a simple sandboxed extension language, providing properties of safety and isolation.ASO can modify its environment, monitor its execution,and make dynamic decisions about its state.

Researchers try to provide the ability to extend a DHT without requiring a substantial investment of effort to modify its implementation.They try to implement to isolation and safety using restricting system access,restricting resource consumption and restricting within-Comet communication.

* Provids callbacks (aka. Database triggers)
* Provides DHT platform that is extensible at the application level
* Uses Lua
* Provided extensibility in an untrusted environment. Dynamo, by contrast, was extensible but only in a trusted environment.
* Why do we care? We don't really. Why would you want this extensibility? You wouldn't. It isn't worth the cost. Current systems currently have an allowance for tuneability.

== Other ==

* if someone wants to understand the consistent hashing in detail, here is a blog which explains it really well, this blog has other great posts in the field of distributed system as well -
http://loveforprogramming.quora.com/Distributed-Systems-Part-1-A-peek-into-consistent-hashing *

DistOS 2014W Lecture 20

2014-04-20T01:44:18Z

Cdelahou: Removed line breaks

== Cassandra ==

Cassandra is essentially running a BigTable interface on top of a Dynamo infrastructure. BigTable uses GFS' built-in replication and Chubby for locking. Cassandra uses gossip algorithms (similar to Dynamo): [http://dl.acm.org/citation.cfm?id=1529983 Scuttlebutt].

=== A brief look at Open Source ===

Initialy Anil talked about google versus facebook approach to technologies.
* Google developed its technology internally and used for competitive advantage.
* Facebook developed its technology in open source manner. They needed to create an open source community to keep up.
* He talked little bit about licences. With GPL3 you have to provide code with binary. In AGPL additional service also be given with source code.

While discussing Hbase versus Cassandra discussed why two projects with same notion are supported? Apache as a community. For any tool in CS, particularly software tools, its actually important to have more than one good implementation. Only time it doesn't happen because of market realities.

Hadoop is a set of technologies that represent the open source equivalent of
Google's infrastructure
* Cassandra -> ???
* HBase -> BigTable
* HDFS -> GFS
* Zookeeper -> Chubby

=== Back to Cassandra ===

* Cassandra is basically you take a key value store system like Dynamo and then you extend to look like BigTable.
* Not just a key value store. It is a multi dimensional map. You can look up different columns, etc. The data is more structured than a Key-Value store.
* In a key value store, you can only look up the key. Cassandra is much richer than this.

Bigtable vs. Cassandra:
* Bigtable and Cassandra exposes similar APIs.
* Cassandra seems to be lighter weight.
* Bigtable depends on GFS. Cassandra depends on server's file system. Anil feels cassandra cluster is easy to setup.
* Bigtable is designed for stream oriented batch processing . Cassandra is for handling online/realtime/highspeed stuff.

Schema design is explained in inbox example. It does not give clarity about how table will look like. Anil thinks they store lot data with messages which makes table crappy.

Apache Zookeeper is used for distributed configuration. It will also bootstrap and configure a new node. It is similar to Chubby. Zookeeper is for node level information. The Gossip protocol is more about key partitioning information and distributing that information amongst nodes.

Cassandra uses a modified version of the Accrual Failure Detector. The idea of an Accrual Failure Detection is that failure detection module emits a value which represents a suspicion level for each of monitored nodes. The idea is to express the value of phi� on a scale that is dynamically adjusted to react network and load conditions at the monitored nodes.

Files are written to disk in an sequential way and are never mutated. This way, reading a file does not require locks. Garbage collection takes care of deletion.

Cassandra writes in an immutable way like functional programming. There is no assignment in functional programming. It tries to eliminate side effects. Data is just binded you associate a name with a value.

Cassandra -
* Uses consistent hashing (like most DHTs)
* Lighter weight
* All most of the readings are part of Apache
* More designed for online updates for interactive lower latency
* Once they write to disk they only read back
* Scalable multi master database with no single point of failure
* Reason for not giving out the complete detail on the table schema
* Probably not just inbox search
* All data in one row of a table
* Its not a key-value store. Big blob of data.
* Gossip based protocol - Scuttlebutt. Every node is aware of overy other.
* Fixed circular ring
* Consistency issue not addressed at all. Does writes in an immutable way. Never change them.

Older style network protocol - token rings
What sort of computational systems avoid changing data?
Systems talking about implementing functional like semantics.

== Comet ==

The major idea behind Comet is triggers/callbacks. There is an extensive literature in extensible operating systems, basically adding code to the operating system to better suit my application. "Generally, extensible systems suck." -[[User:Soma]]

[https://www.usenix.org/conference/osdi10/comet-active-distributed-key-value-store The presentation video of Comet]

Comet seeks to greatly expand the application space for key-value storage systems through application-specific customization.Comet storage object is a <key,value> pair.Each Comet node stores a collection of active storage objects (ASOs) that consist of a key, a value, and a set of handlers. Comet handlers run as a result of timers or storage operations, such as get or put, allowing an ASO to take dynamic, application-specific actions to customize its behaviour. Handlers are written in a simple sandboxed extension language, providing properties of safety and isolation.ASO can modify its environment, monitor its execution,and make dynamic decisions about its state.

Researchers try to provide the ability to extend a DHT without requiring a substantial investment of effort to modify its implementation.They try to implement to isolation and safety using restricting system access,restricting resource consumption and restricting within-Comet communication.

* if someone wants to understand the consistent hashing in detail, here is a blog which explains it really well, this blog has other great posts in the field of distributed system as well -
http://loveforprogramming.quora.com/Distributed-Systems-Part-1-A-peek-into-consistent-hashing *

DistOS 2014W Lecture 20

2014-04-20T01:43:11Z

Cdelahou: Added Cassandra stuff. Fixed and cleaned up some english.

== Cassandra ==

Cassandra is essentially running a BigTable interface on top of a Dynamo infrastructure. BigTable uses GFS' built-in replication and Chubby for locking. Cassandra uses gossip algorithms (similar to Dynamo): [http://dl.acm.org/citation.cfm?id=1529983 Scuttlebutt].

=== A brief look at Open Source ===

Initialy Anil talked about google versus facebook approach to technologies.
* Google developed its technology internally and used for competitive advantage.
* Facebook developed its technology in open source manner. They needed to create an open source community to keep up.
* He talked little bit about licences. With GPL3 you have to provide code with binary. In AGPL additional service also be given with source code.

While discussing Hbase versus Cassandra discussed why two projects with same notion are supported? Apache as a community. For any tool in CS, particularly software tools, its actually important to have more than one good implementation. Only time it doesn't happen because of market realities.

Hadoop is a set of technologies that represent the open source equivalent of
Google's infrastructure
* Cassandra -> ???
* HBase -> BigTable
* HDFS -> GFS
* Zookeeper -> Chubby

=== Back to Cassandra ==

* Cassandra is basically you take a key value store system like Dynamo and then
you extend to look like BigTable.
* Not just a key value store. It is a multi dimensional map. You can look up
different columns, etc. The data is more structured than a Key-Value store.
* In a key value store, you can only look up the key. Cassandra is much richer
than this.

Bigtable vs. Cassandra:
* Bigtable and Cassandra exposes similar APIs.
* Cassandra seems to be lighter weight.
* Bigtable depends on GFS. Cassandra depends on server's file system. Anil feels cassandra cluster is easy to setup.
* Bigtable is designed for stream oriented batch processing . Cassandra is for handling online/realtime/highspeed stuff.

Schema design is explained in inbox example. It does not give clarity about how table will look like. Anil thinks they store lot data with messages which makes table crappy.

Apache Zookeeper is used for distributed configuration. It will also bootstrap and configure a new node. It is similar to Chubby. Zookeeper is for node level information. The Gossip protocol is more about key partitioning information and distributing that information amongst nodes.

Cassandra uses a modified version of the Accrual Failure Detector. The idea of an Accrual Failure Detection is that failure detection module emits a value which represents a suspicion level for each of monitored nodes. The idea is to express the value of phi� on a scale that is dynamically adjusted to react network and load conditions at the monitored nodes.

Files are written to disk in an sequential way and are never mutated. This way, reading a file does not require locks. Garbage collection takes care of deletion.

Cassandra writes in an immutable way like functional programming. There is no assignment in functional programming. It tries to eliminate side effects. Data is just binded you associate a name with a value.

Cassandra -
* Uses consistent hashing (like most DHTs)
* Lighter weight
* All most of the readings are part of Apache
* More designed for online updates for interactive lower latency
* Once they write to disk they only read back
* Scalable multi master database with no single point of failure
* Reason for not giving out the complete detail on the table schema
* Probably not just inbox search
* All data in one row of a table
* Its not a key-value store. Big blob of data.
* Gossip based protocol - Scuttlebutt. Every node is aware of overy other.
* Fixed circular ring
* Consistency issue not addressed at all. Does writes in an immutable way. Never change them.

Older style network protocol - token rings
What sort of computational systems avoid changing data?
Systems talking about implementing functional like semantics.

== Comet ==

The major idea behind Comet is triggers/callbacks. There is an extensive literature in extensible operating systems, basically adding code to the operating system to better suit my application. "Generally, extensible systems suck." -[[User:Soma]]

[https://www.usenix.org/conference/osdi10/comet-active-distributed-key-value-store The presentation video of Comet]

Comet seeks to greatly expand the application space for key-value storage systems through application-specific customization.Comet storage object is a <key,value> pair.Each Comet node stores a collection of active storage objects (ASOs) that consist of a key, a value, and a set of handlers. Comet handlers run as a result of timers or storage operations, such as get or put, allowing an ASO to take dynamic, application-specific actions to customize its behaviour. Handlers are written in a simple sandboxed extension language, providing properties of safety and isolation.ASO can modify its environment, monitor its execution,and make dynamic decisions about its state.

Researchers try to provide the ability to extend a DHT without requiring a substantial investment of effort to modify its implementation.They try to implement to isolation and safety using restricting system access,restricting resource consumption and restricting within-Comet communication.

* if someone wants to understand the consistent hashing in detail, here is a blog which explains it really well, this blog has other great posts in the field of distributed system as well -
http://loveforprogramming.quora.com/Distributed-Systems-Part-1-A-peek-into-consistent-hashing *

DistOS 2014W Lecture 20

2014-04-19T22:52:50Z

Cdelahou: Added periods, bullets, and spaces

== Cassandra ==

Cassandra is essentially running a BigTable interface on top of a Dynamo infrastructure. BigTable uses GFS' built-in replication and Chubby for locking. Cassandra uses gossip algorithms: [http://dl.acm.org/citation.cfm?id=1529983 Scuttlebutt].

Initialy Anil talked about google versus facebook approach to technologies. Google developed its technology internally and used for competitive advantage. Facebook developed its technology in open source manner. He talked little bit about licences. Gpl 3 you have to provide code with binary. In AGPL additional service also be given with source code.

While discussing Hbase versus Cassandra discussed why two projects with same notion are supported?Apache as a community. For any tool in CS particularly software tools, its actually important to have more than one good implementation. Only time it doesn't happen because of market realities.

Bigtable and Cassandra exposes similar apis. Bigtable needs GFS. Cassandra depends on server's file system. Anil feels cassandra cluster is easy to setup. Bigtable is designed for batch updates. Cassandra is for handling realtime stuff.

Schema design is explained in inbox example.It does not give clarity about how table will look like. Anil thinks they store lot data with messages which makes table crappy.

Cassandra is design for high speed access and online operation.

Apache Zookeeper is used for distributed configuration. Zookeeper is similar to chubby. Zookeeper is for node level information. Gossip is more about key partitioning.Zookeeper is for configuration of new node.

Cassandra uses a modified version of the Accrual Failure Detector. The idea of an Accrual Failure Detection is that failure detection module emits a value which represents a suspicion level for each of monitored nodes. The idea is to express the value of phi� on a scale that is dynamically adjusted to react network and load conditions at the monitored nodes.

Cassandra writes in immutable way like functional programming. There is no assignment in functional programming. It tries to eliminate side effects. Data is just binded you associate a name with a value. Garbage collection.

Casandra -
* GFS type cluster which big table depends on
* Lighter weight
* All most of the readings are part of Apache
* More designed for online updates for interactive lower latency
* Once they write to disk they only read back
* Scalable multi master database with no single point of failure
* Reason for not giving out the complete detail on the table schema
* Probably not just inbox search
* All data in one row of a table
* Its not a key-value store. Big blob of data.
* Gossip based protocol - Scuttlebutt
* Fixed circular ring
* Consistency issue not addressed at all. Does writes in an immutable way. Never change them.

Older style network protocol - token rings
What sort of computational systems avoid changing data?
Systems talking about implementing functional like semantics.

== Comet ==

The major idea behind Comet is triggers/callbacks. There is an extensive literature in extensible operating systems, basically adding code to the operating system to better suit my application. "Generally, extensible systems suck." -[[User:Soma]]

[https://www.usenix.org/conference/osdi10/comet-active-distributed-key-value-store The presentation video of Comet]

Comet seeks to greatly expand the application space for key-value storage systems through application-specific customization.Comet storage object is a <key,value> pair.Each Comet node stores a collection of active storage objects (ASOs) that consist of a key, a value, and a set of handlers. Comet handlers run as a result of timers or storage operations, such as get or put, allowing an ASO to take dynamic, application-specific actions to customize its behaviour. Handlers are written in a simple sandboxed extension language, providing properties of safety and isolation.ASO can modify its environment, monitor its execution,and make dynamic decisions about its state.

Researchers try to provide the ability to extend a DHT without requiring a substantial investment of effort to modify its implementation.They try to implement to isolation and safety using restricting system access,restricting resource consumption and restricting within-Comet communication.

* if someone wants to understand the consistent hashing in detail, here is a blog which explains it really well, this blog has other great posts in the field of distributed system as well -
http://loveforprogramming.quora.com/Distributed-Systems-Part-1-A-peek-into-consistent-hashing *

DistOS 2014W Lecture 19

2014-04-19T22:45:41Z

Cdelahou: /* Consider the following */

== Dynamo ==

* Key value-store.
* Query model: key-value only
* Highly available, always writable.
* Guarantee Service Level Agreements (SLA).
* 0-hop DHT: it has direct link to the destination. Has complete view of system locally. No dynamic routing.
* Dynamo sacrifices consistency under certain failure scenarios.
* Consistent hashing to partition key-space: the output range of a hash function is treated as a fixed circular space or “ring”.
* Key-space is linear and the nodes partition it.
* ”Virtual Nodes”: Each server can be responsible for more than one virtual node.
* Each data item is replicated at N hosts.
* “preference list”: The list of nodes that is responsible for storing a particular key.
* Sacrifice strong consistency for availability.
** Eventual consistency.
* Decentralized, P2P, limited administration.
* it work with 100 servers,it is not more big.
* Application/client specific conflict resolution.
* Designed to be flexible
** "Tuneable consistency"
** Pluggable local persistence: DBD, MySQL.

Amazon's motivating use case is that at no point, in a customer's shopping cart, should any newly added item be dropped. Dynamo should be highly available and always writeable.

Amazon has an service oriented architecture. A response to a client is a composite of many services, so SLA's were a HUGE consideration when designing Dynamo. Amazon needed low latency and high availability to ensure a good user experience when aggregating all the services together.

Traditional RDBMS emphasise ACID compliance. Amazon found that ACID compliancy lead to systems with far less availability. It's hard to have consistency and availability both at the same time. See [http://en.wikipedia.org/wiki/CAP_theorem CAP Theorem]. Dynamo can, and usually does sacrifice consistency for availability. They use the terms "eventual consistency" and "tunable consistency".

Key range is partitioned according to consistent hashing algorithm,which treats the output range as a fixed circular space or “ring”. Any time a new node joins in, it takes a token which decides its position on the ring. Every node becomes owner of key range which is in between itself and the previous node on the ring, so anytime a node comes in or leaves it only affects its neighbor nodes. Dynamo has this notion of virtual node, where a machine actually can host more than one node and hence allows to adjust the load according to the machine's capability.

Dynamo uses replication to provide availability, each key-value is distributed to N-1 node (N can be configured by the application that uses Dynamo).

Each node has a complete view of the network. A node knows the key-range that every node supports. Any time a node joins, the gossip based protocols are used to inform every node about the key range changes. This allows for Dynamo to be a 0-hop network. 0-hop means it is logically 0 hop network. IP routing is still be required to actually physically get to the node. This 0-hop approach is different from typical distributed hash tables where routing and hops are used to find the node responsible for a key (eg. Tapestry). Dynamo can do this because the system is deployed on trusted, fully known, networks.

Dynamo is deployed on trusted networks (ie. for amazon's internal applications. It doesn't have to worry about making the system secure. Compare this to Oceanstore.

When compared to BigTable, Dynamo typically scales to hundreds of servers, not thousands. That is not to say that Dynamo can not scale, we need to understand the difference between the use cases for BigTable and Dynamo.

Any "write" that is done on any replica is never held off to serialize the updates to maintain consistency, it will eventually try to reconcile the difference between two different versions( based on the logs) if it can not do so, the conflict resolution is left to the client application which would read data from Dynamo(If there are more than versions of a replica, all the versions along with the log is passed to client and client should reconcile the changes)

== Bigtable ==

* BigTable is a distributed storage system for managing structured data.
* Designed to scale to a very large size
* More focused on consistency than Dynamo.

* A BigTable is a sparse, distributed persistent multi-dimensional sorted map.
* Column oriented DB.
** Streaming chunks of columns is easier than streaming entire rows.

* Data Model: rows made up of column families.
** Eg. Row: the page URL. Column families would either be the content, or the set of inbound links.
** Each column in a column family has copies. Timestamped.

* Tablets: Large tables broken into tablets at row boundaries and each raw Tablet holds contiguous range of sorted rows.
** Immutable b/c of GFS. Deletion happens via garbage collection.

* An SSTable provides a persistent,ordered immutable map from keys to values, where both keys and values are arbitrary byte strings.
* Metadata operations: Create/delete tables, column families, change metadata.

===Implementation:===

* Centralized, hierchy.
* Three major components: client library, one master server, many tablet servers.

* Master server
** Assigns tablets to tablet server.
** Detects tablet additions and removals
** garbage collection on GFS.

* Tablet Servers
** holds tablet locations.
** Manages multiple tablets (thousands per tablet server)
** Handles I/O.

* Client Library
** What devs use.
** Caches tablet locations

=== Consider the following ===

Can big table be used in a shopping cart type of scenario, where low latency and availability are the main focus. Can it be used like Dynamo? Yes, it can, but not as well. Big Table would have more latency because it was designed for Data procession and was not designed to work under such a scenario. Dynamo was designed for different use cases. There is no one solution that can solve all the problems in the world of distributed file systems, there is no silver bullet, no - one size fits all. File systems are usually designed for specific use cases and they work best for them, later if the need be they can be molded to work on other scenarios as well and they may provide good enough performance for the later added goals as well but they would work best for the use cases,which were the targets in the beginnings.

* BigTable -> Highly consistent, Data Processing, Map Reduce, semi structured store
* Dynamo -> High availability, low latency, key-value store

== General talk ==

* Read the introduction and conclusion for each paper and think about cases in the paper more than look to how the author solve the problem.

DistOS 2014W Lecture 19

2014-04-19T22:45:08Z

Cdelahou: Added big table stuff, cleaned everything up.

== Dynamo ==

* Key value-store.
* Query model: key-value only
* Highly available, always writable.
* Guarantee Service Level Agreements (SLA).
* 0-hop DHT: it has direct link to the destination. Has complete view of system locally. No dynamic routing.
* Dynamo sacrifices consistency under certain failure scenarios.
* Consistent hashing to partition key-space: the output range of a hash function is treated as a fixed circular space or “ring”.
* Key-space is linear and the nodes partition it.
* ”Virtual Nodes”: Each server can be responsible for more than one virtual node.
* Each data item is replicated at N hosts.
* “preference list”: The list of nodes that is responsible for storing a particular key.
* Sacrifice strong consistency for availability.
** Eventual consistency.
* Decentralized, P2P, limited administration.
* it work with 100 servers,it is not more big.
* Application/client specific conflict resolution.
* Designed to be flexible
** "Tuneable consistency"
** Pluggable local persistence: DBD, MySQL.

Amazon's motivating use case is that at no point, in a customer's shopping cart, should any newly added item be dropped. Dynamo should be highly available and always writeable.

Amazon has an service oriented architecture. A response to a client is a composite of many services, so SLA's were a HUGE consideration when designing Dynamo. Amazon needed low latency and high availability to ensure a good user experience when aggregating all the services together.

Traditional RDBMS emphasise ACID compliance. Amazon found that ACID compliancy lead to systems with far less availability. It's hard to have consistency and availability both at the same time. See [http://en.wikipedia.org/wiki/CAP_theorem CAP Theorem]. Dynamo can, and usually does sacrifice consistency for availability. They use the terms "eventual consistency" and "tunable consistency".

Key range is partitioned according to consistent hashing algorithm,which treats the output range as a fixed circular space or “ring”. Any time a new node joins in, it takes a token which decides its position on the ring. Every node becomes owner of key range which is in between itself and the previous node on the ring, so anytime a node comes in or leaves it only affects its neighbor nodes. Dynamo has this notion of virtual node, where a machine actually can host more than one node and hence allows to adjust the load according to the machine's capability.

Dynamo uses replication to provide availability, each key-value is distributed to N-1 node (N can be configured by the application that uses Dynamo).

Each node has a complete view of the network. A node knows the key-range that every node supports. Any time a node joins, the gossip based protocols are used to inform every node about the key range changes. This allows for Dynamo to be a 0-hop network. 0-hop means it is logically 0 hop network. IP routing is still be required to actually physically get to the node. This 0-hop approach is different from typical distributed hash tables where routing and hops are used to find the node responsible for a key (eg. Tapestry). Dynamo can do this because the system is deployed on trusted, fully known, networks.

Dynamo is deployed on trusted networks (ie. for amazon's internal applications. It doesn't have to worry about making the system secure. Compare this to Oceanstore.

When compared to BigTable, Dynamo typically scales to hundreds of servers, not thousands. That is not to say that Dynamo can not scale, we need to understand the difference between the use cases for BigTable and Dynamo.

Any "write" that is done on any replica is never held off to serialize the updates to maintain consistency, it will eventually try to reconcile the difference between two different versions( based on the logs) if it can not do so, the conflict resolution is left to the client application which would read data from Dynamo(If there are more than versions of a replica, all the versions along with the log is passed to client and client should reconcile the changes)

== Bigtable ==

* BigTable is a distributed storage system for managing structured data.
* Designed to scale to a very large size
* More focused on consistency than Dynamo.

* A BigTable is a sparse, distributed persistent multi-dimensional sorted map.
* Column oriented DB.
** Streaming chunks of columns is easier than streaming entire rows.

* Data Model: rows made up of column families.
** Eg. Row: the page URL. Column families would either be the content, or the set of inbound links.
** Each column in a column family has copies. Timestamped.

* Tablets: Large tables broken into tablets at row boundaries and each raw Tablet holds contiguous range of sorted rows.
** Immutable b/c of GFS. Deletion happens via garbage collection.

* An SSTable provides a persistent,ordered immutable map from keys to values, where both keys and values are arbitrary byte strings.
* Metadata operations: Create/delete tables, column families, change metadata.

===Implementation:===

* Centralized, hierchy.
* Three major components: client library, one master server, many tablet servers.

* Master server
** Assigns tablets to tablet server.
** Detects tablet additions and removals
** garbage collection on GFS.

* Tablet Servers
** holds tablet locations.
** Manages multiple tablets (thousands per tablet server)
** Handles I/O.

* Client Library
** What devs use.
** Caches tablet locations

=== Consider the following ===

Can big table be used in a shopping cart type of scenario, where low latency and availability are the main focus. Can it be used like Dynamo? Yes, it can, but not as well. Big Table would have more latency because it was designed for Data procession and was not designed to work under such a scenario. Dynamo was designed for different use cases. There is no one solution that can solve all the problems in the world of distributed file systems, there is no silver bullet, no - one size fits all. File systems are usually designed for specific use cases and they work best for them, later if the need be they can be molded to work on other scenarios as well and they may provide good enough performance for the later added goals as well but they would work best for the use cases,which were the targets in the beginnings.

BigTable -> Highly consistent, Data Processing, Map Reduce, semi structured store
Dynamo -> High availability, low latency, key-value store

== General talk ==

* Read the introduction and conclusion for each paper and think about cases in the paper more than look to how the author solve the problem.

DistOS 2014W Lecture 19

2014-04-19T18:21:21Z

Cdelahou: Added a line about P2P and decentrlization.

== Dynamo ==

* Key value-store.
* Query model: key-value only
* Highly available, always writable.
* Guarantee Service Level Agreements (SLA).
* 0-hop DHT: it has direct link to the destination. Has complete view of system locally. No dynamic routing.
* Dynamo sacrifices consistency under certain failure scenarios.
* Consistent hashing to partition key-space: the output range of a hash function is treated as a fixed circular space or “ring”.
* Key-space is linear and the nodes partition it.
* ”Virtual Nodes”: Each server can be responsible for more than one virtual node.
* Each data item is replicated at N hosts.
* “preference list”: The list of nodes that is responsible for storing a particular key.
* Sacrifice strong consistency for availability.
** Eventual consistency.
* Decentralized, P2P, limited administration.
* it work with 100 servers,it is not more big.
* Application/client specific conflict resolution.
* Designed to be flexible
** "Tuneable consistency"
** Pluggable local persistence: DBD, MySQL.

Amazon's motivating use case is that at no point, in a customer's shopping cart, should any newly added item be dropped. Dynamo should be highly available and always writeable.

Amazon has an service oriented architecture. A response to a client is a composite of many services, so SLA's were a HUGE consideration when designing Dynamo. Amazon needed low latency and high availability to ensure a good user experience when aggregating all the services together.

Traditional RDBMS emphasise ACID compliance. Amazon found that ACID compliancy lead to systems with far less availability. It's hard to have consistency and availability both at the same time. See [http://en.wikipedia.org/wiki/CAP_theorem CAP Theorem]. Dynamo can, and usually does sacrifice consistency for availability. They use the terms "eventual consistency" and "tunable consistency".

Key range is partitioned according to consistent hashing algorithm,which treats the output range as a fixed circular space or “ring”. Any time a new node joins in, it takes a token which decides its position on the ring. Every node becomes owner of key range which is in between itself and the previous node on the ring, so anytime a node comes in or leaves it only affects its neighbor nodes. Dynamo has this notion of virtual node, where a machine actually can host more than one node and hence allows to adjust the load according to the machine's capability.

Dynamo uses replication to provide availability, each key-value is distributed to N-1 node (N can be configured by the application that uses Dynamo).

Each node has a complete view of the network. A node knows the key-range that every node supports. Any time a node joins, the gossip based protocols are used to inform every node about the key range changes. This allows for Dynamo to be a 0-hop network. 0-hop means it is logically 0 hop network. IP routing is still be required to actually physically get to the node. This 0-hop approach is different from typical distributed hash tables where routing and hops are used to find the node responsible for a key (eg. Tapestry). Dynamo can do this because the system is deployed on trusted, fully known, networks.

Dynamo is deployed on trusted networks (ie. for amazon's internal applications. It doesn't have to worry about making the system secure. Compare this to Oceanstore.

When compared to BigTable, Dynamo typically scales to hundreds of servers, not thousands. That is not to say that Dynamo can not scale, we need to understand the difference between the use cases for BigTable and Dynamo.

Any "write" that is done on any replica is never held off to serialize the updates to maintain consistency, it will eventually try to reconcile the difference between two different versions( based on the logs) if it can not do so, the conflict resolution is left to the client application which would read data from Dynamo(If there are more than versions of a replica, all the versions along with the log is passed to client and client should reconcile the changes)

== Bigtable ==

* BigTable is a distributed storage system for managing structured data.
* Designed to scale to a very large size
* it stores the column together ,the raw is web pages and the column is the contents.
* Each pages have incoming links
* A BigTable is a sparse, distributed persistent multi-dimensional sorted map.
* it have a many columns and it is look as table.
* Each raw has arbitrary column.
* It is multi-dimension map.
* An SSTable provides a persistent,ordered immutable map from keys to values, where both keys and values are arbitrary byte strings.
* Large tables broken into tablets at row boundaries and each raw Tablet holds contiguous range of rows.
* Metadata operations: Create/delete tables, column families, change metadata.

The question to consider is- can big table be used in a shopping cart type of scenario, where latency and availability are the main focus( or to rephrase the question- can big table be used in place of dynamo and vice- versa ). The answer is- it can be but it wouldn't be as good as dynamo at latency parameter, Dynamo would probably do a lot better than big table but the reason is that big table was not designed to work under such a scenario, its use cases were different. There is no one solution that can solve all the problems in the world of distributed file systems, there is no silver bullet, no - one size fits all. file systems are usually designed for specific use cases and they work best for them, later if the need be they can be molded to work on other scenarios as well and they may provide good enough performance for the later added goals as well but they would work best for the use cases,which were the targets in the beginnings.

== General talk ==

* Read the introduction and conclusion for each paper and think about cases in the paper more than look to how the author solve the problem.

DistOS 2014W Lecture 19

2014-04-19T18:19:04Z

Cdelahou: Added a ton of stuff to Dynamo. Rewrote a lot of paragraphs to proper english.

== Dynamo ==

* Key value-store.
* Query model: key-value only
* Highly available, always writable.
* Guarantee Service Level Agreements (SLA).
* 0-hop DHT: it has direct link to the destination. Has complete view of system locally. No dynamic routing.
* Dynamo sacrifices consistency under certain failure scenarios.
* Consistent hashing to partition key-space: the output range of a hash function is treated as a fixed circular space or “ring”.
* Key-space is linear and the nodes partition it.
* ”Virtual Nodes”: Each server can be responsible for more than one virtual node.
* Each data item is replicated at N hosts.
* “preference list”: The list of nodes that is responsible for storing a particular key.
* Sacrifice strong consistency for availability.
** Eventual consistency.
* it work with 100 servers,it is not more big.
* Application/client specific conflict resolution.
* Designed to be flexible
** "Tuneable consistency"
** Pluggable local persistence: DBD, MySQL.

Amazon's motivating use case is that at no point, in a customer's shopping cart, should any newly added item be dropped. Dynamo should be highly available and always writeable.

Amazon has an service oriented architecture. A response to a client is a composite of many services, so SLA's were a HUGE consideration when designing Dynamo. Amazon needed low latency and high availability to ensure a good user experience when aggregating all the services together.

Traditional RDBMS emphasise ACID compliance. Amazon found that ACID compliancy lead to systems with far less availability. It's hard to have consistency and availability both at the same time. See [http://en.wikipedia.org/wiki/CAP_theorem CAP Theorem]. Dynamo can, and usually does sacrifice consistency for availability. They use the terms "eventual consistency" and "tunable consistency".

Key range is partitioned according to consistent hashing algorithm,which treats the output range as a fixed circular space or “ring”. Any time a new node joins in, it takes a token which decides its position on the ring. Every node becomes owner of key range which is in between itself and the previous node on the ring, so anytime a node comes in or leaves it only affects its neighbor nodes. Dynamo has this notion of virtual node, where a machine actually can host more than one node and hence allows to adjust the load according to the machine's capability.

Dynamo uses replication to provide availability, each key-value is distributed to N-1 node (N can be configured by the application that uses Dynamo).

Each node has a complete view of the network. A node knows the key-range that every node supports. Any time a node joins, the gossip based protocols are used to inform every node about the key range changes. This allows for Dynamo to be a 0-hop network. 0-hop means it is logically 0 hop network. IP routing is still be required to actually physically get to the node. This 0-hop approach is different from typical distributed hash tables where routing and hops are used to find the node responsible for a key (eg. Tapestry). Dynamo can do this because the system is deployed on trusted, fully known, networks.

Dynamo is deployed on trusted networks (ie. for amazon's internal applications. It doesn't have to worry about making the system secure. Compare this to Oceanstore.

When compared to BigTable, Dynamo typically scales to hundreds of servers, not thousands. That is not to say that Dynamo can not scale, we need to understand the difference between the use cases for BigTable and Dynamo.

Any "write" that is done on any replica is never held off to serialize the updates to maintain consistency, it will eventually try to reconcile the difference between two different versions( based on the logs) if it can not do so, the conflict resolution is left to the client application which would read data from Dynamo(If there are more than versions of a replica, all the versions along with the log is passed to client and client should reconcile the changes)

== Bigtable ==

* BigTable is a distributed storage system for managing structured data.
* Designed to scale to a very large size
* it stores the column together ,the raw is web pages and the column is the contents.
* Each pages have incoming links
* A BigTable is a sparse, distributed persistent multi-dimensional sorted map.
* it have a many columns and it is look as table.
* Each raw has arbitrary column.
* It is multi-dimension map.
* An SSTable provides a persistent,ordered immutable map from keys to values, where both keys and values are arbitrary byte strings.
* Large tables broken into tablets at row boundaries and each raw Tablet holds contiguous range of rows.
* Metadata operations: Create/delete tables, column families, change metadata.

The question to consider is- can big table be used in a shopping cart type of scenario, where latency and availability are the main focus( or to rephrase the question- can big table be used in place of dynamo and vice- versa ). The answer is- it can be but it wouldn't be as good as dynamo at latency parameter, Dynamo would probably do a lot better than big table but the reason is that big table was not designed to work under such a scenario, its use cases were different. There is no one solution that can solve all the problems in the world of distributed file systems, there is no silver bullet, no - one size fits all. file systems are usually designed for specific use cases and they work best for them, later if the need be they can be molded to work on other scenarios as well and they may provide good enough performance for the later added goals as well but they would work best for the use cases,which were the targets in the beginnings.

== General talk ==

* Read the introduction and conclusion for each paper and think about cases in the paper more than look to how the author solve the problem.

DistOS 2014W Lecture 18

2014-04-19T02:01:39Z

Cdelahou: Added more Tapestry stuff. Cleaned up the page.

==Distributed Hash Tables (March 18)==

* [http://en.wikipedia.org/wiki/Distributed_hash_table Wikipedia's article on Distributed Hash Tables]
* [http://pdos.csail.mit.edu/~strib/docs/tapestry/tapestry_jsac03.pdf Zhao et al, "Tapestry: A Resilient Global-Scale Overlay for Service Deployment" (JSAC 2003)]

== Distributed Hash Table Overview ==

A Distributed Hash Table (DHT) is a fast lookup structure of <key,value> pairs,
distributed across many nodes in a network. Keys are hashed to generate the
index at which the value can be found. Depending on the nature of the hash
function, typically, only exact queries may be returned.

Usually, each node has a partial view of
the hash table, as opposed to a full replica. They don't know exactly which other node is responsible for a given key. This has given rise to a number
of different routing techniques:
* A centralized server may maintain a list of all keys and associated nodes at which the value can be found. This method involves a single point of failure.
** eg. Napster
* Flooding: Each node may query all connected nodes. This method has performance and scalability shortcomings but had the benefit of being decentralized.
** eg. Gnutella
* [http://en.wikipedia.org/wiki/Consistent_hashing Consistent Hashing] The keyspace can be partitioned such that nodes will maintain the values for keys that hash to similar indices (e.g., within a certain hamming distance). Given a query, nodes do not know specifically on which node a key is located, but they do know a few nodes (a proper subset of the network) located "closer" to the key. The query then continues onto the closest node. This seems to be the most popular technique for DHTs. It's biggest benefit is that nodes can be added and removed without notifying every other node on the network.
** eg. Tapestry

==Tapestry:==
Tapestry is an overlay network which makes use of a DHT to provide routing for
distributed applications. Similar to IP routing, not all nodes need to be
directly connected to each other: they can query a subset of neighbours for
information about which nodes are responsible for certain parts of the keyspace.
Routing is performed in such a way that nodes are aware of their ''distance''
to the object being queried. Hence objects can be located with low latency
without the need to migrate actual object data between nodes.

Tapestry was built for Oceanstore. Oceanstore was built for the open internet. Nodes would be constantly added and removed. Chances are, the network topology would change. That's why you'd need a dynamic routing system.

* Decentralized and P2P. Self organizing.
* Distributed.
* Simple key-value store.
* look up table contains : key and value
* DNS as tree but Tapestry as hercically structured.
* How does the information flow? Each node has a neighbour table which that contains the neighbour's number.
** From initialization, each node has a locally optimal routing table that it maintains
** Routing happens digit by digit

* Tapestry API:
** have four operations called PublishObject, UnpublishObject, RouteToObject, RouteToNode.
** each node has ID and each endpoint object has a GUID (Globally unique identifier).

* Tapestry look like operating system.
** it has two models,one is built on UDP protocol and the other on TCP protocol.

Fun fact, it is now called [http://current.cs.ucsb.edu/projects/chimera/ Chimera].

DistOS 2014W Lecture 18

2014-04-19T01:31:05Z

Cdelahou: Added info about DHT's and tapestry

==Distributed Hash Tables (March 18)==

* [http://en.wikipedia.org/wiki/Distributed_hash_table Wikipedia's article on Distributed Hash Tables]
* [http://pdos.csail.mit.edu/~strib/docs/tapestry/tapestry_jsac03.pdf Zhao et al, "Tapestry: A Resilient Global-Scale Overlay for Service Deployment" (JSAC 2003)]

== Distributed Hash Table Overview ==

A Distributed Hash Table (DHT) is a fast lookup structure of <key,value> pairs,
distributed across many nodes in a network. Keys are hashed to generate the
index at which the value can be found. Depending on the nature of the hash
function, typically, only exact queries may be returned.

Usually, each node has a partial view of
the hash table, as opposed to a full replica. They don't know exactly which other node is responsible for a given key. This has given rise to a number
of different routing techniques:
* A centralized server may maintain a list of all keys and associated nodes at which the value can be found. This method involves a single point of failure.
** eg. Napster
* Flooding: Each node may query all connected nodes. This method has performance and scalability shortcomings but had the benefit of being decentralized.
** eg. Gnutella
* [http://en.wikipedia.org/wiki/Consistent_hashing Consistent Hashing] The keyspace can be partitioned such that nodes will maintain the values for keys that hash to similar indices (e.g., within a certain hamming distance). Given a query, nodes do not know specifically on which node a key is located, but they do know a few nodes (a proper subset of the network) located "closer" to the key. The query then continues onto the closest node.
** eg. Tapestry

==Tapestry:==
Tapestry is an overlay network which makes use of a DHT to provide routing for
distributed applications. Similar to IP routing, not all nodes need to be
directly connected to each other: they can query a subset of neighbours for
information about which nodes are responsible for certain parts of the keyspace.
Routing is performed in such a way that nodes are aware of their ''distance''
to the object being queried. Hence objects can be located with low latency
without the need to migrate actual object data between nodes. Tapestry has been used in some academic applications such as OceanStore.

**Tapestry
* Distributed .
* Simple key-value store.
* using DHT ( distributed hash table).
* look up table contains : key and value
* DNS as tree but Tapestry as hercically structure.

**More dtails about Tapestry:
** how the information flow?
* each nod has neighbour table which that contains the node neighbour number.

** Tapestry API:
* have four operations called
* each node has ID and each endpoint has GUID (Globally unique identifier).

**Tapestry look like operating system.
* it has two models,one is built on UDP protocol and the other on TCP protocol.

DistOS 2014W Lecture 16

2014-04-19T00:33:43Z

Cdelahou: /* Keywords */

Public Resource Computing

== Outline for upcoming lectures ==

All the papers that would be covered in upcoming lectures have been posted on Wiki. These papers will be more difficult in comparison to the papers we have covered so far, so we should be prepared to allot more time for studying these papers and come prepared in class. We may abandon the way of discussing the papers in group and instead everyone would ask the questions about what,they did not understand from paper so it would allow us to discuss the technical detail better.
Professor will not be taking the next class, instead our TA would discuss the two papers on how to conduct a literature survey, which should help with our projects.
The rest of the papers will deal with many closely related systems. In particular, we will be looking at distributed hash tables and systems that use distributed hash tables.

After looking at the material from today, we will also be looking at how we can get the kind of distribution that we get with public resource computing, but with greater flexibility.

== Project proposal==
There were 11 proposals and out of which professor found 4 to be in the state of getting accepted and has graded them 10/10. professor has mailed to everyone with the feedback about the project proposal so that we can incorporate those comments and submit the project proposals by coming Saturday ( the extended deadline). the deadline has been extended so that every one can work out the flaws in their proposal and get the best grades (10/10).
Project Presentation are to be held on 1st and 3rd april. People who got 10/10 should be ready to present on Tuesday as they are ahead and better prepared for it, there should be 6 presentation on Tuesday and rest on Thursday.
Under-grad will have their final exam on 24th April. 24th April is also the date to turn-in the final project report.

==Public Resource Computing (March 11)==

* Anderson et al., "SETI@home: An Experiment in Public-Resource Computing" (CACM 2002) [http://dx.doi.org/10.1145/581571.581573 (DOI)] [http://dl.acm.org.proxy.library.carleton.ca/citation.cfm?id=581573 (Proxy)]
* Anderson, "BOINC: A System for Public-Resource Computing and Storage" (Grid Computing 2004) [http://dx.doi.org/10.1109/GRID.2004.14 (DOI)] [http://ieeexplore.ieee.org.proxy.library.carleton.ca/stamp/stamp.jsp?tp=&arnumber=1382809 (Proxy)]

=== Keywords ===
BOINC & SETI@Home: Lowered entry barriers, master/slave relationship, work units, [Embarrassingly_parallel http://en.wikipedia.org/wiki/Embarrassingly_parallel], inverted use-cases, gamification, redundant computing, consentual bot nets, centralized authority, untrusted clients, replication as reliability, exponential backoff, limited server reliability engineering.

Embarrassingly parallel: ease of parallization, communication to computation ratio, discrete work units, abstractions help, map reduce.

=== Introduction ===

The paper assigned for readings were on SETI and BOINC. BOINC is the system SETI is built upon, there are other projects running on the same system like Folding@home etc. In particular, we want to discuss the following:
What is public resource computing? How does public resource computing relate to the various computational models and systems that we have seen this semester? How are they similar in design, purpose, and technologies? How is it different?

The main purpose of public resource computing was to have a universally accessible, easy-to-use, way of sharing resources. This is interesting as it differs from some of the systems we have looked at that deal with the sharing of information rather than resources.

For computational parallelism, you need a highly parallel problem. SETI@home and folding@home give examples of such problems. In public resource computing, particularly with the BOINC system, you divide the problem into work units. People voluntarily install the clients on their machines, running the program to work on work units that are sent to their clients in return for credits.

In the past, it has been institutions, such as universities, running services with other people connecting in to use said service. Public resource computing turns this use case on its head, having the institutiton (e.g., the university) being the one using the service while other people are contributing to said service voluntarily. In the files systems we have covered so far, people would want access to the files stored in a network system, here a system wants to access people's machines to utilize the processing power of their machine.

Since they are contributing voluntarily, how do you make these users care about the system if something were to happen? The gamification of the system causes many users to become invested in the system. People are doing work for credits and those with the most credits are showcased as major contributors. They can also see the amount of resources (e.g., process cycles) they have devoted to the cause on the GUI of the installed client. When the client produces results for the work unit it was processing, it sends the result to the server.

Important to the design of the BOINC platform is that it was easily deployed by scientists (ie. non IT specialists). It was meant to lower the entry barrier for the types of scientific computing that lent itself to being embarrassingly parallel. The platform used a simple design with commodity software (PHP, Python, MySQL).

For fault tolerance, such as malicious clients or faulty processors, redundant computing is done. Work units are processed multiple times.
Work units are later taken off of the clients as dictated by the following two cases:
# They receive the number of results, '''n''', for a certain work unit, they take the answer that the majority gave.
# They have transmitted a work unit '''m''' times and have not gotten back the n expected responses.
It should be noted that, in doing this, it is possible that some work units are never processed. The probability of this happening can be reduced by increasing the value of '''m''', though.

In the case of SETI@Home, the amount of available work units is fixed. The system scales by increasing the amount of redundant computing. If more clients join the system, they just end up getting the same work units.

=== Comparison to Botnets ===
So, given all this, how would we generally define public resource computing/public interest computing? It is essentially using the public as a resource--you are voluntarily giving up your extra compute cycles for projects (this is a little like donating blood--public resource computing is a vampire). Looking at public resource computing like this, we can contrast it with a botnet. What is the difference? Both system are utilizing client machines to perform/aid in some task.

The answer: consent.

You are consensually contributing to a project rather than being (unknowingly) forced to. Other differences are the ends/resources that you want as well as reliability. With a botnet, you can trust that a higher proportion of your users are following your commands exactly (as they have no idea they are performing them). Whereas, in public resource computing, how can you guarantee that clients are doing what you want? You can't. You can only verify the results.

=== General Comparisons ===
Basic Comparison with other File systems, we have covered so far -

# Inverted use cases. In the files systems we have covered so far, clients would want access to the files stored in a network system, here a system wants to access clients' machines to utilize the processing power of their machine. There is an inverted flow.
# In other file systems it was about many clients sharing the data, here it is more about sharing the processing power. In Folding@home, the system can store some of its data at client's storage but that is not the public resource computing's main focus.
# It is nothing like systems like OceanStore where there is no centralized authority, in BOINC the master/slave relationship between the centralized server and the clients installed across users' machine can still be visualized and it is more like GFS in that sense because GFS also had a centralized metadata server.
# Public resource systems are like BOTNETs but people install these clients with consent and there is no need for communication between the clients ( it is not peer to peer network). It could be made to communicate at peer to peer level but it would risk security as clients are not trusted in the network
# Skype was modelled much like a public resource computing network (before Microsoft took over). The whole model of Skype was that the infrastructure just ran on the computers of those who had downloaded the clients (like a consensual botnet). Once a person downloaded the client, they would be a part of this system. As with public resource computing, you would donate some of your resources in order to support the distributed infrastructure. It was also not assumed that everyone was reliable, but would assume that some people are reliable some of the time. The network would choose super nodes to act as routers. These super nodes would be the machines with higher reliability and better processing powers. After MS' takeover the supernodes have been centralized and the election of supernodes functionality has been removed from the system.

=== Trust Model and Fault Tolerance ===

In this central model, you have a central resource and distribute work to clients who process the work and send back results. Once they do, you can send them more work. In this model, can you trust the client to complete the computation successfully? The answer is not necessarily--there could be untrustworthy clients sending back rubbish answers.

So, how does SETI address the questions of fault tolerance ? They use replication for reliability and redundant computing. Work units are assigned to multiple clients and the results that are returned to server can be analyzed to find the outliers in order to detect the malicious users but that addresses the situations of fault tolerance from client perspective.

However, SETI has a centralized server, which can go down and when it does, it uses exponential back off to push back the clients and ask them to wait before sending the result again. But, whenever a server comes back up many clients may try to access the server at once and may crash the server once again--essentially, the server will have manufactured its own DDOS attack due to the server's own inadequacies. The Exponential back-off approach is similar to the one adopted in resolving the TCP congestion.

It can be noted that there is almost no reliability engineering here, though. These are just standard servers running with one backup that is manually failed over to. This can give an idea of how asymmetric the relationship is.

One reason that this might be is to look at the actual service and who is running it. Reliability for a service is high when a high amount of people use the service and, hence, would be upset were the service to go down. In this case, it's the university using the service and clients are helping out by providing resources. If the service goes down, it is the university's fault and they can individually deal with it. It is interesting to compare this strategy to highly reliable systems like Ceph or Oceanstore, which could recover the data in case a node crashes.

The idea of redundancy relates to Oceanstore a little, but how would Oceanstore map onto this idea of public resource computing? In place of the Oceanstore metadata cluster, there is a central server. In place of the data store, there are machines doing computation. Specifically, mapping on this model of public resource computing is the notion of having one central thing and a bunch of outlying nodes. This is very much a master/slave relationship, though it is a voluntary one. In this relationship, CPU cycles are cheap, but bandwidth is expensive, hence showing why work units are sent infrequently. The storage is in-between--sometimes data is pushed to the clients. When this is done, the resemblance of public resource computing to Oceanstore is stronger.

=== Embarrassingly Parallell ===

When you are doing parallel computations, you have to do a mixture of computation and communication. You're doing computation separately, but you always have to do some communication. But, how much communication do you have to do for every unit of computation? In some cases, there are many dependencies meaning that a high amount of communication is required (e.g., weather system simulations).

Embarrassingly parallel means that a given problem requires a minimum of communication between the pieces of work. This typically means that you have a bunch of data that you want to analyze, and it's all independent. Due to this, you can just split up and distribute the work for analysis. In an embarrassingly parallel problem, computations are trivial, due to the minimum of communication, as the more processors you add, the faster the system will run. However, problems that are not embarrassingly parallel, the system can actually slow down when more processors are added as more communication is required. With distributed systems, you either need to accept communications costs or modify abstractions to allow you to get closer to an embarrassingly parallel system. Since speedup is trivial when the problem is embarrassingly parallel, you don't get much praise for doing it.

SETI is an example of an "embarrassingly parallel" workload. The inherent nature of the problem lends itself to be divided into work-units and be computed in-parallel without any need to consolidate the results. It is called "embarrassingly parallel" because there is little to no effort required to distribute the work load in parallel.

One more example of "embarrassingly parallel" in what we have covered so far could be web-indexing in GFS. Any file system that we have discussed so far, which doesn't trust the clients can be modelled to work as public sharing system.

Note: Public resource computing is also very similar to mapreduce, which we will be discussing later in the course. Make sure to keep public resource computing in mind when we reach this.

DistOS 2014W Lecture 16

2014-04-18T22:06:39Z

Cdelahou: Keywords. Added a few paragraphs.

Public Resource Computing

== Outline for upcoming lectures ==

All the papers that would be covered in upcoming lectures have been posted on Wiki. These papers will be more difficult in comparison to the papers we have covered so far, so we should be prepared to allot more time for studying these papers and come prepared in class. We may abandon the way of discussing the papers in group and instead everyone would ask the questions about what,they did not understand from paper so it would allow us to discuss the technical detail better.
Professor will not be taking the next class, instead our TA would discuss the two papers on how to conduct a literature survey, which should help with our projects.
The rest of the papers will deal with many closely related systems. In particular, we will be looking at distributed hash tables and systems that use distributed hash tables.

After looking at the material from today, we will also be looking at how we can get the kind of distribution that we get with public resource computing, but with greater flexibility.

== Project proposal==
There were 11 proposals and out of which professor found 4 to be in the state of getting accepted and has graded them 10/10. professor has mailed to everyone with the feedback about the project proposal so that we can incorporate those comments and submit the project proposals by coming Saturday ( the extended deadline). the deadline has been extended so that every one can work out the flaws in their proposal and get the best grades (10/10).
Project Presentation are to be held on 1st and 3rd april. People who got 10/10 should be ready to present on Tuesday as they are ahead and better prepared for it, there should be 6 presentation on Tuesday and rest on Thursday.
Under-grad will have their final exam on 24th April. 24th April is also the date to turn-in the final project report.

==Public Resource Computing (March 11)==

* Anderson et al., "SETI@home: An Experiment in Public-Resource Computing" (CACM 2002) [http://dx.doi.org/10.1145/581571.581573 (DOI)] [http://dl.acm.org.proxy.library.carleton.ca/citation.cfm?id=581573 (Proxy)]
* Anderson, "BOINC: A System for Public-Resource Computing and Storage" (Grid Computing 2004) [http://dx.doi.org/10.1109/GRID.2004.14 (DOI)] [http://ieeexplore.ieee.org.proxy.library.carleton.ca/stamp/stamp.jsp?tp=&arnumber=1382809 (Proxy)]

=== Keywords ===
BOINC & SETI@Home: Lowered entry barriers, master/slave relationship, work units, [Embarrassingly_parallel http://en.wikipedia.org/wiki/Embarrassingly_parallel], inverted use-cases, gamification, redundant computing, consentual bot nets, centralized authority, untrusted clients, replication as reliability, exponential backoff, limited server reliability engineering.

Embarrassingly parallel: easy of parallization, communication to computation ratio, discrete work units, abstractions help, map reduce.

=== Introduction ===

The paper assigned for readings were on SETI and BOINC. BOINC is the system SETI is built upon, there are other projects running on the same system like Folding@home etc. In particular, we want to discuss the following:
What is public resource computing? How does public resource computing relate to the various computational models and systems that we have seen this semester? How are they similar in design, purpose, and technologies? How is it different?

The main purpose of public resource computing was to have a universally accessible, easy-to-use, way of sharing resources. This is interesting as it differs from some of the systems we have looked at that deal with the sharing of information rather than resources.

For computational parallelism, you need a highly parallel problem. SETI@home and folding@home give examples of such problems. In public resource computing, particularly with the BOINC system, you divide the problem into work units. People voluntarily install the clients on their machines, running the program to work on work units that are sent to their clients in return for credits.

In the past, it has been institutions, such as universities, running services with other people connecting in to use said service. Public resource computing turns this use case on its head, having the institutiton (e.g., the university) being the one using the service while other people are contributing to said service voluntarily. In the files systems we have covered so far, people would want access to the files stored in a network system, here a system wants to access people's machines to utilize the processing power of their machine.

Since they are contributing voluntarily, how do you make these users care about the system if something were to happen? The gamification of the system causes many users to become invested in the system. People are doing work for credits and those with the most credits are showcased as major contributors. They can also see the amount of resources (e.g., process cycles) they have devoted to the cause on the GUI of the installed client. When the client produces results for the work unit it was processing, it sends the result to the server.

Important to the design of the BOINC platform is that it was easily deployed by scientists (ie. non IT specialists). It was meant to lower the entry barrier for the types of scientific computing that lent itself to being embarrassingly parallel. The platform used a simple design with commodity software (PHP, Python, MySQL).

For fault tolerance, such as malicious clients or faulty processors, redundant computing is done. Work units are processed multiple times.
Work units are later taken off of the clients as dictated by the following two cases:
# They receive the number of results, '''n''', for a certain work unit, they take the answer that the majority gave.
# They have transmitted a work unit '''m''' times and have not gotten back the n expected responses.
It should be noted that, in doing this, it is possible that some work units are never processed. The probability of this happening can be reduced by increasing the value of '''m''', though.

In the case of SETI@Home, the amount of available work units is fixed. The system scales by increasing the amount of redundant computing. If more clients join the system, they just end up getting the same work units.

=== Comparison to Botnets ===
So, given all this, how would we generally define public resource computing/public interest computing? It is essentially using the public as a resource--you are voluntarily giving up your extra compute cycles for projects (this is a little like donating blood--public resource computing is a vampire). Looking at public resource computing like this, we can contrast it with a botnet. What is the difference? Both system are utilizing client machines to perform/aid in some task.

The answer: consent.

You are consensually contributing to a project rather than being (unknowingly) forced to. Other differences are the ends/resources that you want as well as reliability. With a botnet, you can trust that a higher proportion of your users are following your commands exactly (as they have no idea they are performing them). Whereas, in public resource computing, how can you guarantee that clients are doing what you want? You can't. You can only verify the results.

=== General Comparisons ===
Basic Comparison with other File systems, we have covered so far -

# Inverted use cases. In the files systems we have covered so far, clients would want access to the files stored in a network system, here a system wants to access clients' machines to utilize the processing power of their machine. There is an inverted flow.
# In other file systems it was about many clients sharing the data, here it is more about sharing the processing power. In Folding@home, the system can store some of its data at client's storage but that is not the public resource computing's main focus.
# It is nothing like systems like OceanStore where there is no centralized authority, in BOINC the master/slave relationship between the centralized server and the clients installed across users' machine can still be visualized and it is more like GFS in that sense because GFS also had a centralized metadata server.
# Public resource systems are like BOTNETs but people install these clients with consent and there is no need for communication between the clients ( it is not peer to peer network). It could be made to communicate at peer to peer level but it would risk security as clients are not trusted in the network
# Skype was modelled much like a public resource computing network (before Microsoft took over). The whole model of Skype was that the infrastructure just ran on the computers of those who had downloaded the clients (like a consensual botnet). Once a person downloaded the client, they would be a part of this system. As with public resource computing, you would donate some of your resources in order to support the distributed infrastructure. It was also not assumed that everyone was reliable, but would assume that some people are reliable some of the time. The network would choose super nodes to act as routers. These super nodes would be the machines with higher reliability and better processing powers. After MS' takeover the supernodes have been centralized and the election of supernodes functionality has been removed from the system.

=== Trust Model and Fault Tolerance ===

In this central model, you have a central resource and distribute work to clients who process the work and send back results. Once they do, you can send them more work. In this model, can you trust the client to complete the computation successfully? The answer is not necessarily--there could be untrustworthy clients sending back rubbish answers.

So, how does SETI address the questions of fault tolerance ? They use replication for reliability and redundant computing. Work units are assigned to multiple clients and the results that are returned to server can be analyzed to find the outliers in order to detect the malicious users but that addresses the situations of fault tolerance from client perspective.

However, SETI has a centralized server, which can go down and when it does, it uses exponential back off to push back the clients and ask them to wait before sending the result again. But, whenever a server comes back up many clients may try to access the server at once and may crash the server once again--essentially, the server will have manufactured its own DDOS attack due to the server's own inadequacies. The Exponential back-off approach is similar to the one adopted in resolving the TCP congestion.

It can be noted that there is almost no reliability engineering here, though. These are just standard servers running with one backup that is manually failed over to. This can give an idea of how asymmetric the relationship is.

One reason that this might be is to look at the actual service and who is running it. Reliability for a service is high when a high amount of people use the service and, hence, would be upset were the service to go down. In this case, it's the university using the service and clients are helping out by providing resources. If the service goes down, it is the university's fault and they can individually deal with it. It is interesting to compare this strategy to highly reliable systems like Ceph or Oceanstore, which could recover the data in case a node crashes.

The idea of redundancy relates to Oceanstore a little, but how would Oceanstore map onto this idea of public resource computing? In place of the Oceanstore metadata cluster, there is a central server. In place of the data store, there are machines doing computation. Specifically, mapping on this model of public resource computing is the notion of having one central thing and a bunch of outlying nodes. This is very much a master/slave relationship, though it is a voluntary one. In this relationship, CPU cycles are cheap, but bandwidth is expensive, hence showing why work units are sent infrequently. The storage is in-between--sometimes data is pushed to the clients. When this is done, the resemblance of public resource computing to Oceanstore is stronger.

=== Embarrassingly Parallell ===

When you are doing parallel computations, you have to do a mixture of computation and communication. You're doing computation separately, but you always have to do some communication. But, how much communication do you have to do for every unit of computation? In some cases, there are many dependencies meaning that a high amount of communication is required (e.g., weather system simulations).

Embarrassingly parallel means that a given problem requires a minimum of communication between the pieces of work. This typically means that you have a bunch of data that you want to analyze, and it's all independent. Due to this, you can just split up and distribute the work for analysis. In an embarrassingly parallel problem, computations are trivial, due to the minimum of communication, as the more processors you add, the faster the system will run. However, problems that are not embarrassingly parallel, the system can actually slow down when more processors are added as more communication is required. With distributed systems, you either need to accept communications costs or modify abstractions to allow you to get closer to an embarrassingly parallel system. Since speedup is trivial when the problem is embarrassingly parallel, you don't get much praise for doing it.

SETI is an example of an "embarrassingly parallel" workload. The inherent nature of the problem lends itself to be divided into work-units and be computed in-parallel without any need to consolidate the results. It is called "embarrassingly parallel" because there is little to no effort required to distribute the work load in parallel.

One more example of "embarrassingly parallel" in what we have covered so far could be web-indexing in GFS. Any file system that we have discussed so far, which doesn't trust the clients can be modelled to work as public sharing system.

Note: Public resource computing is also very similar to mapreduce, which we will be discussing later in the course. Make sure to keep public resource computing in mind when we reach this.

DistOS 2014W Lecture 16

2014-04-18T21:29:07Z

Cdelahou: Typos

Public Resource Computing

== Outline for upcoming lectures ==

All the papers that would be covered in upcoming lectures have been posted on Wiki. These papers will be more difficult in comparison to the papers we have covered so far, so we should be prepared to allot more time for studying these papers and come prepared in class. We may abandon the way of discussing the papers in group and instead everyone would ask the questions about what,they did not understand from paper so it would allow us to discuss the technical detail better.
Professor will not be taking the next class, instead our TA would discuss the two papers on how to conduct a literature survey, which should help with our projects.
The rest of the papers will deal with many closely related systems. In particular, we will be looking at distributed hash tables and systems that use distributed hash tables.

After looking at the material from today, we will also be looking at how we can get the kind of distribution that we get with public resource computing, but with greater flexibility.

== Project proposal==
There were 11 proposals and out of which professor found 4 to be in the state of getting accepted and has graded them 10/10. professor has mailed to everyone with the feedback about the project proposal so that we can incorporate those comments and submit the project proposals by coming Saturday ( the extended deadline). the deadline has been extended so that every one can work out the flaws in their proposal and get the best grades (10/10).
Project Presentation are to be held on 1st and 3rd april. People who got 10/10 should be ready to present on Tuesday as they are ahead and better prepared for it, there should be 6 presentation on Tuesday and rest on Thursday.
Under-grad will have their final exam on 24th April. 24th April is also the date to turn-in the final project report.

== Public Resource Computing ==

The paper assigned for readings were on SETI and BOINC. BOINC is the system SETI is built upon, there are other projects running on the same system like Folding@home etc. In particular, we want to discuss the following:
What is public resource computing? How does public resource computing relate to the various computational models and systems that we have seen this semester? How are they similar in design, purpose, and technologies? How is it different?

The main purpose of public resource computing was to have a universally accessible, easy-to-use, way of sharing resources. This is interesting as it differs from some of the systems we have looked at that deal with the sharing of information rather than resources.

For computational parallelism, you need a highly parallel problem. SETI@home and folding@home give examples of such problems. In public resource computing, particularly with the BOINC system, you divide the problem into work units. People voluntarily install the clients on their machines, running the program to work on work units that are sent to their clients in return for credits. In the past, it has been institutions, such as universities, running services with other people connecting in to use said service. Public resource computing turns this use case on its head, having the institutiton (e.g., the university) being the one using the service while other people are contributing to said service voluntarily. In the files systems we have covered so far, people would want access to the files stored in a network system, here a system wants to access people's machines to utilize the processing power of their machine.

Since they are contributing voluntarily, how do you make these users care about the system if something were to happen? The gamification of the system causes many users to become invested in the system. People are doing work for credits and those with the most credits are showcased as major contributors. They can also see the amount of resources (e.g., process cycles) they have devoted to the cause on the GUI of the installed client. When the client produces results for the work unit it was processing, it sends the result to the server.

For fault tolerance, such as malicious clients or faulty processors, redundant computing is done. Work units are processed multiple times.
Work units are later taken off of the clients as dictated by the following two cases:
# They receive the number of results, '''n''', for a certain work unit, they take the answer that the majority gave.
# They have transmitted a work unit '''m''' times and have not gotten back the n expected responses.
It should be noted that, in doing this, it is possible that some work units are never processed. The probability of this happening can be reduced by increasing the value of '''m''', though.

=== General Discussion ===
So, given all this, how would we generally define public resource computing/public interest computing? It is essentially using the public as a resource--you are voluntarily giving up your extra compute cycles for projects (this is a little like donating blood--public resource computing is a vampire). Looking at public resource computing like this, we can contrast it with a botnet. What is the difference? Both system are utilizing client machines to perform/aid in some task. The answer: consent. You are consensually contributing to a project rather than being (unknowingly) forced to. Other differences are the ends/resources that you want as well as reliability. With a botnet, you can trust that a higher proportion of your users are following your commands exactly (as they have no idea they are performing them). Whereas, in public resource computing, how can you guarantee that clients are doing what you want?

=== Comparisons ===
Basic Comparison with other File systems , we have covered so far -

# Use-Cases have been turned on their head. In the files systems we have covered so far, People would want access to the files stored in a network system, here a system wants to access people's machines to utilize the processing power of their machine.
# In other file systems it was about many clients sharing the data, here it is more about sharing the processing power. In Folding@home, the system can store some of its data at client's storage but that is not the public resource computing's main focus.
# It is nothing like systems like OceanStore where there is no centralized authority, in BOINC the master/slave relationship between the centralized server and the clients installed across users' machine can still be visualized and it is more like GFS in that sense because GFS also had a centralized metadata server.
# Public resource systems are like BOTNETs but people install these clients with consent and there is no need for communication between the clients ( it is not peer to peer network). It could be made to communicate at peer to peer level but it would risk security as clients are not trusted in the network
# Skype was modelled much like a public resource computing network (before Microsoft took over). The whole model of Skype was that the infrastructure just ran on the computers of those who had downloaded the clients (like a consensual botnet). Once a person downloaded the client, they would be a part of this system. As with public resource computing, you would donate some of your resources in order to support the distributed infrastructure. It was also not assumed that everyone was reliable, but would assume that some people are reliable some of the time. The network would choose super nodes to act as routers. These super nodes would be the machines with higher reliability and better processing powers. After MS' takeover the supernodes have been centralized and the election of supernodes functionality has been removed from the system.

=== Trust Model and Fault Tolerance ===

In this central model, you have a central resource and distribute work to clients who proess the work and send back results. Once they do, you can send them more work. In this model, can you trust the client to complete the computation successfully? The answer is not necessarily--there could be untrustworthy clients sending back rubbish answers.

So, how does SETI address the questions of fault tolerance ? They use replication for reliability and redundant computing. Work units are assigned to multiple clients and the results that are returned to server can be analyzed to find the outliers in order to detect the malicious users but that addresses the situations of fault tolerance from client perspective.

However, SETI has a centralized server, which can go down and when it does, it uses exponential back off to push back the clients and ask them to wait before sending the result again. But, whenever a server comes back up many clients may try to access the server at once and may crash the server once again--essentially, the server will have manufactured its own DDOS attack due to the server's own inadequacies. The Exponential back-off approach is similar to the one adopted in resolving the TCP congestion.

It can be noted that there is almost no reliability engineering here, though. These are just standard servers running with one backup that is manually failed over to. This can give an idea of how asymmetric the relationship is.

One reason that this might be is to look at the actual service and who is running it. Reliability for a service is high when a high amount of people use the service and, hence, would be upset were the service to go down. In this case, it's the university using the service and clients are helping out by providing resources. If the service goes down, it is the university's fault and they can individually deal with it. It is interesting to compare this strategy to highly reliable systems like Ceph or Oceanstore, which could recover the data in case a node crashes.

The idea of redundancy relates to Oceanstore a little, but how would Oceanstore map onto this idea of public resource computing? In place of the Oceanstore metadata cluster, there is a central server. In place of the data store, there are machines doing computation. Specifically, mapping on this model of public resource computing is the notion of having one central thing and a bunch of outlying nodes. This is very much a master/slave relationship, though it is a voluntary one. In this relationship, CPU cycles are cheap, but bandwidth is expensive, hence showing why work units are sent infrequently. The storage is in-between--sometimes data is pushed to the clients. When this is done, the resemblance of public resource computing to Oceanstore is stronger.

=== Embarrassingly Parallell ===

When you are doing parallel computations, you have to do a mixture of computation and communication. You're doing computation separately, but you always have to do some communication. But, how much communication do you have to do for every unit of computation? In some cases, there are many dependencies meaning that a high amount of communication is required (e.g., weather system simulations).

Embarrassingly parallel means that a given problem requires a minimum of communication between the pieces of work. This typically means that you have a bunch of data that you want to analyze, and it's all independent. Due to this, you can just split up and distribute the work for analysis. In an embarrassingly parallel problem, computations are trivial, due to the minimum of communication, as the more processors you add, the faster the system will run. However, problems that are not embarrassingly parallel, the system can actually slow down when more processors are added as more communication is required. With distributed systems, you either need to accept communications costs or modify abstractions to allow you to get closer to an embarrassingly parallel system. Since speedup is trivial when the problem is embarrassingly parallel, you don't get much praise for doing it.

SETI is an example of an "embarrassingly parallel" workload. The inherent nature of the problem lends itself to be divided into work-units and be computed in-parallel without any need to consolidate the results. It is called "embarrassingly parallel" because there is little to no effort required to distribute the work load in parallel.

One more example of "embarrassingly parallel" in what we have covered so far could be web-indexing in GFS. Any file system that we have discussed so far, which doesn't trust the clients can be modelled to work as public sharing system.

Note: Public resource computing is also very similar to mapreduce, which we will be discussing later in the course. Make sure to keep public resource computing in mind when we reach this.

User:Cdelahou

2014-04-18T20:18:08Z

Cdelahou:

"I have no idea where I am or what I'm doing here. What's my name? Are you my mommy?" - Christian Delahousse

Fourth year CS student. Took [[Operating Systems (Fall 2012)]] and [[Distributed_OS:_Winter_2014]].

[http://christian.delahousse.ca Christian Delahousse]

^^^^^^^^^^^^^
We call that SEO juice.

DistOS 2014W Lecture 15

2014-04-18T20:16:56Z

Cdelahou: added linebreak

'''Designing Exercise'''

Can we do any kind of distributed system without crypto? We can't trust crypto...

What are the main features we need to consider for such a system ?
*Limited Sharing
*Integrity
*Availability

Perhaps probabilistically...

Want to be able to put data in, have it distributed, and be able to get it out on some other machine. This kind of sharing would need identification or authentication process.

Availability: "distribute the crap out of it", doesn't need crypto. No corruption of data.

Integrity: hashing, but we assume hashes can be forged. If we want to know that we got the same file, then simply send each other the file and compare.

'''Big Takeaway'''

Everything you do with crypto is a refinement of what you can already do in
weaker forms with weaker assumptions.

'''Note on Project Proposal'''
* Date has been extended until next week. As Prof said some of the proposals are not completely up to mark.

DistOS 2014W Lecture 15

2014-04-18T20:16:36Z

Cdelahou: Added big takeaway

'''Designing Exercise'''

Can we do any kind of distributed system without crypto? We can't trust crypto...

What are the main features we need to consider for such a system ?
*Limited Sharing
*Integrity
*Availability

Perhaps probabilistically...

Want to be able to put data in, have it distributed, and be able to get it out on some other machine. This kind of sharing would need identification or authentication process.

Availability: "distribute the crap out of it", doesn't need crypto. No corruption of data.

Integrity: hashing, but we assume hashes can be forged. If we want to know that we got the same file, then simply send each other the file and compare.

'''Big Takeaway'''
Everything you do with crypto is a refinement of what you can already do in
weaker forms with weaker assumptions.

'''Note on Project Proposal'''
* Date has been extended until next week. As Prof said some of the proposals are not completely up to mark.

DistOS 2014W Lecture 14

2014-04-18T20:13:30Z

Cdelahou: added stuff, move stuff around.

=OceanStore=

* [http://homeostasis.scs.carleton.ca/~soma/distos/fall2008/oceanstore-sigplan.pdf John Kubiatowicz et al., "OceanStore: An Architecture for Global-Scale Persistent Storage" (2000)]
* [http://homeostasis.scs.carleton.ca/~soma/distos/fall2008/fast2003-pond.pdf Sean Rhea et al., "Pond: the OceanStore Prototype" (2003)]
* [http://oceanstore.cs.berkeley.edu/info/overview.html Project Overview]

==Keywords==
Highly available, universally available, utility business model, untrusted servers, nomadic data, promiscuous caching, immutable version-based archival storage, highly persistent, pond, tapestry (DHT), broken dreams.

==What is the dream?==
The dream was to create a persistent storage system that had high availability and was universally accessibly--a global, ubiquitous persistent data storage solution. OceanStore was meant to be utility managed by multiple parties, with no one party having total control/monopoly over the system.

The basic assumption made by the designers of OceanStore, however, was that none of the servers could be trusted. It would be built over the open internet. To support this, the system held only opaque/encrypted data. As such, the system could be used for more than files (e.g., for whole databases).

The second basic assumption was that the system utilized nomadic data. Information was divorced from a physical location. Information was stored and replicated everywhere. It used promiscuous caching to cache information near its users. This is unlike with NFS and AFS where only specific servers cache the data.

To support the goal of high availability, there was a high amount of redundancy and fault-tolerance. For high persistence, everything was archived--nothing was ever truly deleted. This can be likened to working in version control with "Commits". This is possibly due to the realization that the easier it is to delete things, the easier it is to lose things.

==Why did the dream die?==

The biggest reason that caused the OceanStore dream to die was the assumption of mistrusting all the actors--everything else they did was right. This assumption, however, caused the system to become needlessly complicated as they had to rebuild ''everything'' to accommodate this assumption. This was also unrealistic as this is not an assumption that is generally made (i.e., it is normally assumed that at least some of the actors can be trusted).

Other successful distributed systems are built on a more trusted model. Every node in Dynamo, BigTable, etc. is trusted. In short, the solution that accommodates untrusted actors assumption is just too expensive.

=== Technology ===
As outlined above, the trust model (read: fundamentally untrusted model) is the most attractive feature which ultimately killed it. The untrusted assumption introduced a huge burden on the system, forcing technical limitations which made OceanStore uncompetitive in comparison to other solutions. It is just much more easy and convenient to trust a given system. It should be noted that every system is compromisable, despite this mistrust.

The public key system also reduces usability--if a user loses their key, they are completely out of luck and would need to acquire a new key. This also means that, if you wanted to remove their access over an object, you would have to re-encrypt the object with a new key and provide that key to said user, who would then have access to the object.

With regards to the security, there is no security mechanism on the server side. The server can not know who is accessing the data. On the economic side, the economic model is unconvincing with the way it is defined. The authors suggest that a collection of companies will host OceanStore servers and consumers will buy capacity (not unlike web-hosting today).

===Use Cases===
A subset of the features outlined for OceanStore already exist. For example, Blackberry and Google offer similar groupware services (eg. email, contact lists, etc.) These current services are owned by one company, however, not many providers. You can also not sell back your services as a user (e.g., you can't sell your extra storage back to the utility).

==Pond: What insights?==
In short: they actually built it! However, due to the untrusted assumption, they can't assume the use of any infrastructure, causing them to rebuild ''everything''! It was built over the internet with Tapestry (dynamic routing) and GUID for object identification (object naming scheme).

==Benchmarks==
In short: the system had really good read speed, really bad write speed. Absolutely everything is expensive and there is high latency.

===Storage overhead===
One general question was how much are they increasing the storage needed to implement their storage model? The answer: a factor of 4.8x the space is needed (you'll have 1/5th the storage). While this is expensive, it does have a good value as your data is backed up, replicated, etc. However, it does cause one to consider how important it is to make an update as you burn more storage space as more updates are made.

===Update performance===
None of the data is mutated--it is diffed and archived (ie. a "commit). You are essentially creating a new version of an object and then distributing that object to all nodes.

==Other stuff==
'''Byzantine fault tolerance'''
* Byzantine fault tolerance has the assumption that there are malicious actors in your system.
* byzantine fault tolerant network replicates the data in such a way that even if m nodes out of total n nodes,in a network,fail, you would still be able to recover the whole data. but as you increase the value of number m, the required network messages to be exchanges also increases, so there is a tradeoff.
* You are assuming certain actors are malicious.
'''Bitcoin'''
* Trusted vs Untrusted.
* It is considered to be untrusted but it takes huge amount of trust when exchanges are made.

==What's worth salvaging from the dream?==
Some of the good things that we can salvage are using spare resources in other locations. It can also be noted that similar routing systems are used in large peer to peer systems.

==How to read a research paper==
# Start with the Introduction to figure out what the problem is.
# See/read through the related work/background for context of the paper.
# Go to the conclusion and focus on the results (i.e., figure out what they actually did).
# Fill in the gaps by reading specific parts of the body.

DistOS 2014W Lecture 14

2014-03-04T16:24:33Z

Cdelahou: ADded more info for rest of class

=OceanStore=

==What is the dream?==
* High availabitility, universally accessible.
* Utility managed by multiple parties.
* Highly redundant, fault tolerant
* Basic assumption was that servers would NOT be trusted.

* Highly persistent
** Everything archived
** Everything saved, nothing deleted. "Commits"

* Service was untrusted
** Held opaque/encrypted data.
* Would have been used for more than files. (eg. DB's, etc.)

==Why did the dream die?==

* Biggest reason it died was it's assumption of mistrusting the actors.
** Everything else they did was right.
* Other successful distributed systems are built on a more trusted model.

=== Technology ===
* The trust model is the most attractive feature which ultimately killed it.
** The untrusted assumption was a huge burden on the system. Forced technical limitations made them uncompetitive.
** It is just easier to trust a given system. More convenient.
** Every system is compromisable despite this mistrust
* Pub key system reduces usability
** If you loose your key, you're S.O.L.
*security
**there is no security mechanism in servers side.
**can not now who access the data
*economic side
**The economic model is unconvincing as defined. The authors suggest that a collection of companies will host OceanStore servers, and consumers will buy capacity (not unlike web-hosting of today).

===Use Cases===
* Subset of the features already exist
** Blackberry and Google offer similar services.
** These current services owned by one company, not many providers.
** Can not sell back your services as a user.
*** ex. Can not sell your extra storage back to the utility.

==Pond: What insights?==

* They actually built it.
* Can't assume the use of any infrastructure, so they rebuild everything!
** Built over the internet.
** Tapestry (routing).
** GUID for object indentification. Object naming scheme.

==Benchmarks==
* Really good read speed, really bad write speed.

===Storage overhead===
* How much are they increasing the storage needed to implement their storage model.
* Factor of 4.8x the space needed (you'll have 1/5th the storage)
* Expensive, but good value (data is backed up, replicated, etc..)

===Update performance===
* No data is mutated. It is diffed and archived.
* Creating a new version of an object and distributing that object.

===Benchmarks in a nutshell===
* Everything is expensive!
* High latency

==Other stuff==
* Byzantine fault tolerance
** Assuming certain actors are malicious

==What's worth salvaging from the dream?==
* Using spare resources in other locations.

==How to read a research paper==
* Start with Intro
** Figure out what the problem is
* then see the related work for context
* then go to conclusion. Focus on results.
* then fill in the gaps by reading specific parts of the body

DistOS 2014W Lecture 14

2014-03-04T16:15:42Z

Cdelahou: Cleaned up some stuff a bit.

=OceanStore=

==What is the dream?==
* High availabitility, universally accessible.
* Utility managed by multiple parties.
* Highly redundant, fault tolerant
* Basic assumption was that servers would NOT be trusted.

* Highly persistent
** Everything archived
** Everything saved, nothing deleted. "Commits"

* Service was untrusted
** Held opaque/encrypted data.
* Would have been used for more than files. (eg. DB's, etc.)

==Why did the dream die?==

=== Technology ===
* The trust model is the most attractive feature which ultimately killed it.
** The untrusted assumption was a huge burden on the system. Forced technical limitations made them uncompetitive.
** It is just easier to trust a given system. More convenient.
** Every system is compromisable despite this mistrust
* Pub key system reduces usability
** If you loose your key, you're S.O.L.
*security
**there is no security mechanism in servers side.
**can not now who access the data
*economic side
**The economic model is unconvincing as defined. The authors suggest that a collection of companies will host OceanStore servers, and consumers will buy capacity (not unlike web-hosting of today).

===Use Cases===
* Subset of the features already exist
** Blackberry and Google offer similar services.
** These current services owned by one company, not many providers.
** Can not sell back your services as a user.
*** ex. Can not sell your extra storage back to the utility.

==Pond: What insights?==

* They actually built it.
* Can't assume the use of any infrastructure, so they rebuild everything!
** Built over the internet.
** Tapestry (routing).
** GUID for object indentification. Object naming scheme.

==Benchmarks==
* Really good read speed, really bad write speed.

===Storage overhead===
* How much are they increasing the storage needed to implement their storage model.
* Factor of 4.8x the space needed (you'll have 1/5th the storage)
* Expensive, but good value (data is backed up, replicated, etc..)

===Update performance===
* No data is mutated. It is diffed and archived.
* Creating a new version of an object and distributing that object.

===Benchmarks in a nutshell===
* Everything is expensive!
* High latency

==Other stuff==
* Byzantine fault tolerance
** Assuming certain actors are malicious

==What's worth salvaging from the dream?==
* Using spare resources in other locations.

DistOS 2014W Lecture 14

2014-03-04T16:05:38Z

Cdelahou: Initial dump

=OceanStore=

==What is the dream?==
* High availabitility, universally available.
* Utility managed by multiple parties
* Highly redundant, fault tolerant
* Basic assumption was that servers would NOT be trusted.

* Highly persistent
** Effective archival
** Everything saved, nothing deleted. "Commits"

* Service was untrusted
** Held opaque data.
* Would be used for more than files. DB's, etc.

==Why did the dream die?==

=== Technology ===
* The trust model is the most attractive feature which ultimately killed it.
** The untrusted assumption was a huge burden on the system. Forced technical
limitations Made them uncompetitive.
** It is just easier to trust a given system
** Every system is compromisable despite this mistrust
* Pub key system reduces usability
** If you loose your key, you're S.O.L.

===Use Cases===
* Subset of the features already exist
** Black berry. Google.
** Current services owned by one company, not many providers.
** Can not sell back your services as a user.
*** ex. Can not sell your extra storage back to the utility.

==Pond: What insights?==

* They actually built it.

* Can't assume the use of any infrastructure, so they rebuild everything!
** Built over the internet.
** Tapestry
** GUID for object indentification. Object naming scheme.

==Benchmarks==
* Really good read speed, really bad write speed.

===Storage overhead===
* How much are they increasing the storage needed to implement their storage model.
* Factor of 4.8x the space needed (you'll have 1/5th the storage)
* Expensive, but good value (data is backed up, replicated, etc..)

==Update performance==
* No data is mutated. It is diffed and archived.
* Creating a new version of an object and distributing that object.

==Other stuff==
* Byzantine fault tolerance
** Assuming certain actors are malicious

DistOS 2014W Lecture 5

2014-02-24T02:16:43Z

Cdelahou: Added stuff about the alto compared to the NLS system

=The Mother of all Demos (Jan. 21)=

* [http://www.dougengelbart.org/firsts/dougs-1968-demo.html Doug Engelbart Institute, "Doug's 1968 Demo"]
* [http://en.wikipedia.org/wiki/The_Mother_of_All_Demos Wikipedia's page on "The Mother of all Demos"]

= Introduction =

Anil set the theme of the discussion for the week as - to try and understand what the early visionaries/researchers wanted the computer to be and what it has become. Putting in other words what was considered fundamental those days and where those stands today. It is to be noted that features that were easier to implement using simple mechanisms were carried forward where as the ones which demanded more complex systems or the one which were found out to add not much value in the near feature were pegged down in the order. In the same context following observations were made: (1) truly distributed computational infrastructure really makes sense only when we have something to distribute (2) use cases drive the large distributed systems, a good example is The Web. Another key observation from Anil was that there was always a Utopian aspect to the early systems be it NLS, ARPANET or Alto. One good example is that security aspects were never considered essential in those systems assuming them to operate in a trusted environment.

; Operating system
: The software that turns the computer you have into the one you want (Anil)

* What sort of computer did we want to have?
* What sort of abstractions did they want to be easy? Hard?
* What could we build with the internet (not just WAN, but also LAN)?
* Most dreams people had of their computers smacked into the wall of reality.

= MOAD review in groups =

* Chorded keyboard unfortunately obscure, partly because the attendees disagreed with the long-term investment of training the user.
* View control → hyperlinking system, but in a lightweight (more like nanoweight) markup language.
* Ad-hoc ticketing system
* Ad-hoc messaging system
** Used on a time-sharing systme with shared storage,
* Primitive revision control system
* Different vocabulary:
** Bug and bug smear (mouse and trail)
** Point rather than click

= Class review =

* Doug died Jul 2 2013
* Doug himself called it an “online system”, rather than offline composition of code using card punchers as was common in the day.
* What became of the tech:
** Chorded keyboards:
*** Exist but obscure
** Pre-ARPANET network:
*** Time-sharing mainframe
*** 13 workstations
*** Telephone and television circuit
** Mouse
*** “I sometimes apologize for calling it a mouse”
** Collaborative document editing integrated with screen sharing
** Videoconferencing
*** Part of the vision, but more for the demo at the time,
** Hyperlinks
*** The web on a mainframe
** Languages
*** Metalanguages
**** “Part and parcel of their entire vision of augmenting human intelligence.”
**** You must teach the computer about the language you are using.
**** They were the use case. It was almost designed more for augmenting programmer intelligence rather than human intelligence.
*** It was normal for the time to build new languages (domain-specific) for new systems. Nowadays, we standardize on one but develop large APIs, at the expense of conciseness. We look for short-term benefits; we minimize programmer effort.
*** Compiler compiler
** Freeze-pane
** Folding—Zoomable UI (ZUI)
*** Lots of systems do it, but not the default
*** Much easier to just present everything.
** Technologies the required further investment got left behind.
* The NLS had little to no security
** There was a minimal notion of a user
** There was a utopian aspect. Meanwhile, the Mac had no utopian aspect. Data exchange was through floppies. Any network was small, local, ad-hoc, and among trusted peers.
** The system wasn't envisioned to scale up to masses of people who didn't trust each other.
** How do you enforce secrecy.
* Part of the reason for lack of adoption of some of the tech was hardware. We can posit that a bigger reason would be infrastructure.
* Differentiate usability of system from usability of vision
** What was missing was the polish, the ‘sexiness’, and the intuitiveness of later systems like the Apple II and the Lisa.
** The usability of the later Alto is still less than commercial systems.
*** The word processor was modal, which is apt to confuse unmotivated and untrained users.
* In the context of the Mother of All Demos, the Alto doesn't seem entirely revolutionary. Xerox PARC raided his team. They almost had a GUI; rather they had what we call today a virtual console, with a few things above.
* What happens with visionaries that present a big vision is that the spectators latch onto specific aspects.
* To be comfortable with not adopting the vision, one must ostracize the visionary. People pay attention to things that fit into their world view.
* Use cases of networking have changed little, though the means did
* Fundamentally a resource-sharing system; everything is shared, unlike later systems where you would need to explicitly do so. Resources shared fundamentally sense to share: documents, printers, etc.
* Resource sharing was never enough. '''Information-sharing''' was the focus.

“Mother of all demos” is nickname for Engelbart who could make the computers help humans become smarter.

*More interesting in this works that:
"His idea included seeing computing devices as a means to communicate and retrieve information, rather than just crunch numbers. This idea is represented in NLS”On-Line system”.

*Some information about NLS system:
1) NLS was a revolutionary computer collaboration system from the 1960s.
2) Designed by Douglas Engelbart and implemented by researchers at the Augmentation Research Center (ARC) at the Stanford Research Institute (SRI).
3) The NLS system was the first to employ the practical use of :
a) hypertext links,
b) the mouse,
c) raster-scan video monitors,
d) information organized by relevance,
e) screen windowing,
f) presentation programs,
g) and other modern computing concepts.

= Alto review =

* Fundamentally a personal computer
* Applications:
** Drawing program with curves and arcs for drawing
** Hardware design tools (mostly logic boards)
** Time server
* Less designed for reading than the NLS. More designed around paper. Xerox had a laser printer, and you would read what you printed. Hypertext was deprioritized, unlike the NLS vision had focused on what could not be expressed on paper.
* Xerox had almost an obsession with making documents print beautifully.

= Alto vs NLS =
NLS and Alto both had text processing, drawing, programming environments, some form of email (communication). Alto had WYSIWYG everything.

Alto not built on a mainframe. NLS 'resource sharing' was based around the mainframe. Alto had the idea of sharing via the network (ie. printer server).

Alto focused a lot less on 'hypertext'. Less about navigating deep information. It used the paper metaphor. It implemented existing metaphors and adapted them to the PC. Alto people came from a culture that really valued printed paper.

DistOS 2014W Lecture 7

2014-02-24T00:43:29Z

Cdelahou: Merged Simon's and others notes together. Significant change

== Project ==

We discussed moving the proposal due date back a week. We also discussed spending the class prior to that date discussing the primary papers people had chosen in order to provide preliminary feedback. Anil spent some time going through the papers from OSDI12 and discussing which ones would make good projects and why.

* Pick a primary paper.
* Find papers that cite that paper, papers it cites, etc. to collect a body of related work.
* Don't just give a history, tell a story!
* Do not try to summarize papers.
* Try to identify a pattern, a common ground between the papers
* Tell a story that connects several papers in the topic you choose

Pick a conference (usenix is pretty systems oriented, maybe Lisa), go through their papers and find something interesting.

Examples from OSDI 2012:
* datacenter (filesystems for doing X, heat management, etc...)
* web stuff
* distributed shared memory
* distributed network I/O infrastructure
* distributed databases (potentially)
* anonymity systems

==UNIX and Plan 9 (Jan. 28)==

* [http://homeostasis.scs.carleton.ca/~soma/distos/fall2008/unix.pdf Dennis M. Ritchie and Ken Thompson, "The UNIX Time-Sharing System" (1974)]
* [http://homeostasis.scs.carleton.ca/~soma/distos/2014w/presotto-plan9.pdf Presotto et. al, Plan 9, A Distributed System (1991)]
* [http://homeostasis.scs.carleton.ca/~soma/distos/2014w/pike-plan9.pdf Pike et al., Plan 9 from Bell Labs (1995)]

== Unix and Plan 9 ==

* Multics was a complex system which was bad because it was used less, slower, etc...
* Multics was not for end users, it was designed to support "utility computing" wherein computation was a service to be charged for

UNIX was built as "a castrated version of Multics", which was a very complex system. Multics was, arguably, so far ahead of its time that we are only just achieving their ambitions now. Unix was much more modest, and therefore much more achievable and successful. Just enough infrastructure to avoid reinventing the wheel. Just a couple of programmers making something for their own use.

* Just enough infrastructure to run my programs
* It was really just supposed to be used by programmers
* "By programmers for programmers"

Unix was not designed as product or commercial entity at all. It was licensed out because AT&T was under severe antitrust scrutiny at the time.

They wanted few, simple abstractions so they made everything a file. The only difference amongst most files was that you could use seek on some and not on others. Berkeley promptly broke this abstraction by introducing sockets for networking.

Plan 9 finally introduced networking using the right abstractions, but was too late. Arguably the reason the BSD folks didn't use the file abstraction was because of the difference in reliability. SUN microsystems licensed Berkeley Unix and commercialized it. Files are generally reliable, and failures with them are catastrophic so many applications simply didn't include logic to handle such IO errors. Networks are much less reliable and applications have to be able to deal gracefully with timeouts and other errors.

In Anil's opinion Plan 9's design of using file abstraction to represent Network wasn't a good design idea. The reason being file I/O breaking is uncommon but Network has an inherent flakiness and loss of connectivity is normal in networks. Using file system abstractions to represent Network doesn't properly takes care of the flakiness inherent in the Network. Put in other words Network doesn't have the reliability characteristics of mass storage and how to deal with this fact while using the file abstraction to deal with network was a major question which was left unanswered by the Plan 9 designers. Things that have different failure modes require different APIs. Anil also added that Plan 9 was a elegant attempt at representing everything using file abstraction but they were trying too hard with this approach as pointed out above.

In distributed systems the best approach to use is - if things have different semantics then they should have abstractions that reflect their characteristics, the APIs should reflect their characteristics rather than hide it away and try to pretend or treat them as if they were having characteristics of something else in an attempt towards too much generalizations.

Plan 9 implemented procfs, a directory that listed all processes as files. This was later adopted by Linux.

In Anil's opinion another reason why Plan 9 was not widely adopted was that it was a bit late to the scene, by the time Plan 9 came out in the 90s systems running UNIX with networking was widely adopted driven by the success of Internet.

Another valuable point Anil mentioned was that for a technology to get adopted and become successful it should serve or address a niche area for which there are no successful incumbents. There should be a champion use for the technology. Any tech won't continue existing just because it is cool.

Tangent about programming languages: C was for system programming. Java was for enterprise programming.

DistOS 2014W Lecture 8

2014-02-23T23:48:12Z

Cdelahou: More NFS details.

==NFS and AFS (Jan 30)==

* [http://homeostasis.scs.carleton.ca/~soma/distos/2008-02-11/sandberg-nfs.pdf Russel Sandberg et al., "Design and Implementation of the Sun Network Filesystem" (1985)]
* [http://homeostasis.scs.carleton.ca/~soma/distos/2008-02-11/howard-afs.pdf John H. Howard et al., "Scale and Performance in a Distributed File System" (1988)]

==NFS==
Group 1:

1) per operation traffic.

2) rpc based. Easy with which to program but a very [http://www.joelonsoftware.com/articles/LeakyAbstractions.html leaky abstraction].

3) unreliable

Group 2:

1) designed to share disks over a network, not files

2) more UNIX like. They tried to maintain unix file semantics on the client and server side.

3) portable. It was meant to work (as a server) across many FS types.

4) used UDP: if request dropped, just request again.

5) it is not minimize network traffic.

6) used VNODE, VFS as transparent interfaces to local disks.

7) not have much hardware equipment

8) later versions took on features of AFS

9) stateless protocol conflicts with files being stateful by nature.

Group 3:

1) cache assumption invalid.

2) no dedicated locking mechanism. They couldn't decide on which locking strategy to use, so they left it up to the users of NFS to use their own separate locking service.

3) bad security

Other:
* Client mounts full FS. No common namespace.
* Hostname lookup and address binding happens at mount

==AFS==

Group 1

1) design for 5000 to 10000 clients

2) high integrity.

Group 2

1) designed to share files over a network, not disks. It is one FS.

2) better scalability

3) better security (Kerberos).

4) minimize network traffic.

5) less UNIX like

6) plugin authentication

7) needs more kernel storage due to complex commands

8) inode concept replaced with fid

Group 3

1) cache assumption valid

2) locking

3) good security.

Other:
* Caches full files locally on open. Sends diffs on close.

==Class Discussion:==

Capturing some of Anil's Observations about NFS and AFS:
* The reason why NFS does not try to share at block level instead of file level is that sharing at block level is complicated from the implementation point of view.
* NFS use UDP as the transport protocol since UDP being a stateless protocol is in-line with the NFS design philosophy of not maintaining state information.
* Security and unreliability issues in NFS are an implication of using RPC.
** RPC is a nice way for programming but RPC is not designed for networks (where flakiness is an inherent characteristic) which is better explained by the analogy that you never expect from a programming point of view your function call to fail(not to return) because of communication error. *AFS designers considered network as a bottle neck and tried to reduce the number of chatter over network by using caching.
** 'open' and 'close' operations in AFS were critical
** the 'close' operation assumes importance to the same proportions of a 'commit' operation in a well-designed database system.
* The security model of AFS is interesting in that rather than going for the UNIX access list based implementation AFS used a single sign on system based on Kerberos.
** cool thing about Kerberos is that idea of using tickets to get access.
* Irrespective of having better features compared to NFS, AFS did not get widely adopted. The reason for this was that the administrative mechanism for AFS was complex and it required highly trained/skilled people to setup AFS and it also required quite a number of day’s effort to set it up and maintain.

DistOS 2014W Lecture 8

2014-02-23T23:44:56Z

Cdelahou: Added a few details.

==NFS and AFS (Jan 30)==

* [http://homeostasis.scs.carleton.ca/~soma/distos/2008-02-11/sandberg-nfs.pdf Russel Sandberg et al., "Design and Implementation of the Sun Network Filesystem" (1985)]
* [http://homeostasis.scs.carleton.ca/~soma/distos/2008-02-11/howard-afs.pdf John H. Howard et al., "Scale and Performance in a Distributed File System" (1988)]

==NFS==
Group 1:

1) per operation traffic.

2) rpc based. Easy with which to program but a very [http://www.joelonsoftware.com/articles/LeakyAbstractions.html leaky abstraction].

3) unreliable

Group 2:

1) designed to share disks over a network, not files

2) more UNIX like. They tried to maintain unix file semantics on the client and server side.

3) portable. It was meant to work (as a server) across many FS types.

4) used UDP: if request dropped, just request again.

5) it is not minimize network traffic.

6) used VNODE, VFS as transparent interfaces to local disks.

7) not have much hardware equipment

8) later versions took on features of AFS

9) stateless protocol conflicts with files being stateful by nature.

Group 3:

1) cache assumption invalid.

2) no dedicated locking mechanism. They couldn't decide on which locking strategy to use, so they left it up to the users of NFS to use their own separate locking service.

3) bad security

==AFS==

Group 1

1) design for 5000 to 10000 clients

2) high integrity.

Group 2

1) designed to share files over a network, not disks. It is one FS.

2) better scalability

3) better security (Kerberos).

4) minimize network traffic.

5) less UNIX like

6) plugin authentication

7) needs more kernel storage due to complex commands

8) inode concept replaced with fid

Group 3

1) cache assumption valid

2) locking

3) good security.

Other:
* Caches full files locally on open. Sends diffs on close.

==Class Discussion:==

Capturing some of Anil's Observations about NFS and AFS:
* The reason why NFS does not try to share at block level instead of file level is that sharing at block level is complicated from the implementation point of view.
* NFS use UDP as the transport protocol since UDP being a stateless protocol is in-line with the NFS design philosophy of not maintaining state information.
* Security and unreliability issues in NFS are an implication of using RPC.
** RPC is a nice way for programming but RPC is not designed for networks (where flakiness is an inherent characteristic) which is better explained by the analogy that you never expect from a programming point of view your function call to fail(not to return) because of communication error. *AFS designers considered network as a bottle neck and tried to reduce the number of chatter over network by using caching.
** 'open' and 'close' operations in AFS were critical
** the 'close' operation assumes importance to the same proportions of a 'commit' operation in a well-designed database system.
* The security model of AFS is interesting in that rather than going for the UNIX access list based implementation AFS used a single sign on system based on Kerberos.
** cool thing about Kerberos is that idea of using tickets to get access.
* Irrespective of having better features compared to NFS, AFS did not get widely adopted. The reason for this was that the administrative mechanism for AFS was complex and it required highly trained/skilled people to setup AFS and it also required quite a number of day’s effort to set it up and maintain.

DistOS 2014W Lecture 8

2014-02-23T23:40:06Z

Cdelahou: Filled in a buncha information about NFS

==NFS and AFS (Jan 30)==

* [http://homeostasis.scs.carleton.ca/~soma/distos/2008-02-11/sandberg-nfs.pdf Russel Sandberg et al., "Design and Implementation of the Sun Network Filesystem" (1985)]
* [http://homeostasis.scs.carleton.ca/~soma/distos/2008-02-11/howard-afs.pdf John H. Howard et al., "Scale and Performance in a Distributed File System" (1988)]

==NFS==
Group 1:

1) per operation traffic.

2) rpc based. Easy with which to program but a very [http://www.joelonsoftware.com/articles/LeakyAbstractions.html leaky abstraction].

3) unreliable

Group 2:

1) designed to share disks over a network, not files

2) more UNIX like. They tried to maintain unix file semantics on the client and server side.

3) portable. It was meant to work (as a server) across many FS types.

4) used UDP: if request dropped, just request again.

5) it is not minimize network traffic.

6) used VNODE, VFS as transparent interfaces to local disks.

7) not have much hardware equipment

8) later versions took on features of AFS

9) stateless protocol conflicts with files being stateful by nature.

Group 3:

1) cache assumption invalid.

2) no dedicated locking mechanism. They couldn't decide on which locking strategy to use, so they left it up to the users of NFS to use their own separate locking service.

3) bad security

==AFS==

Group 1

1) design for 5000 clients

2) high integrity.

Group 2

1) designed to share files over a network, not disks

2) better scalability

3) better security.

4) minimize network traffic.

5) less UNIX like

6) plugin authentication

7) needs more kernel storage due to complex commands

8) inode concept replaced with fid

Group 3

1) cache assumption valid

2) locking

3) good security.

==Class Discussion:==

Capturing some of Anil's Observations about NFS and AFS:
* The reason why NFS does not try to share at block level instead of file level is that sharing at block level is complicated from the implementation point of view.
* NFS use UDP as the transport protocol since UDP being a stateless protocol is in-line with the NFS design philosophy of not maintaining state information.
* Security and unreliability issues in NFS are an implication of using RPC.
** RPC is a nice way for programming but RPC is not designed for networks (where flakiness is an inherent characteristic) which is better explained by the analogy that you never expect from a programming point of view your function call to fail(not to return) because of communication error. *AFS designers considered network as a bottle neck and tried to reduce the number of chatter over network by using caching.
** 'open' and 'close' operations in AFS were critical
** the 'close' operation assumes importance to the same proportions of a 'commit' operation in a well-designed database system.
* The security model of AFS is interesting in that rather than going for the UNIX access list based implementation AFS used a single sign on system based on Kerberos.
** cool thing about Kerberos is that idea of using tickets to get access.
* Irrespective of having better features compared to NFS, AFS did not get widely adopted. The reason for this was that the administrative mechanism for AFS was complex and it required highly trained/skilled people to setup AFS and it also required quite a number of day’s effort to set it up and maintain.

DistOS 2014W Lecture 8

2014-02-23T23:28:27Z

Cdelahou: Group 1 no longer exists

==NFS and AFS (Jan 30)==

* [http://homeostasis.scs.carleton.ca/~soma/distos/2008-02-11/sandberg-nfs.pdf Russel Sandberg et al., "Design and Implementation of the Sun Network Filesystem" (1985)]
* [http://homeostasis.scs.carleton.ca/~soma/distos/2008-02-11/howard-afs.pdf John H. Howard et al., "Scale and Performance in a Distributed File System" (1988)]

==NFS==
Group 1:

1) per operating traffic

2) rpc based

3) unreliable

Group 2:

1) designed to share disks over a network, not files

2) more UNIX like

3) portable

4) use UDP

5) it is not minimize network traffic.

6) used VNODE

7) not have much hardware equipment

8) later versions took on features of AFS

9) stateless protocol conflicts with files being state-full by nature.

Group 3:

1) cache assumption invalid.

2) no locking

3) bad security

==AFS==

Group 1

1) design for 5000 clients

2) high integrity.

Group 2

1) designed to share files over a network, not disks

2) better scalability

3) better security.

4) minimize network traffic.

5) less UNIX like

6) plugin authentication

7) needs more kernel storage due to complex commands

8) inode concept replaced with fid

Group 3

1) cache assumption valid

2) locking

3) good security.

==Class Discussion:==

Capturing some of Anil's Observations about NFS and AFS:
* The reason why NFS does not try to share at block level instead of file level is that sharing at block level is complicated from the implementation point of view.
* NFS use UDP as the transport protocol since UDP being a stateless protocol is in-line with the NFS design philosophy of not maintaining state information.
* Security and unreliability issues in NFS are an implication of using RPC.
** RPC is a nice way for programming but RPC is not designed for networks (where flakiness is an inherent characteristic) which is better explained by the analogy that you never expect from a programming point of view your function call to fail(not to return) because of communication error. *AFS designers considered network as a bottle neck and tried to reduce the number of chatter over network by using caching.
** 'open' and 'close' operations in AFS were critical
** the 'close' operation assumes importance to the same proportions of a 'commit' operation in a well-designed database system.
* The security model of AFS is interesting in that rather than going for the UNIX access list based implementation AFS used a single sign on system based on Kerberos.
** cool thing about Kerberos is that idea of using tickets to get access.
* Irrespective of having better features compared to NFS, AFS did not get widely adopted. The reason for this was that the administrative mechanism for AFS was complex and it required highly trained/skilled people to setup AFS and it also required quite a number of day’s effort to set it up and maintain.

DistOS 2014W Lecture 8

2014-02-23T23:28:00Z

Cdelahou: Combined all AFS and NFS parts together. Groups don't matter.

==NFS and AFS (Jan 30)==

* [http://homeostasis.scs.carleton.ca/~soma/distos/2008-02-11/sandberg-nfs.pdf Russel Sandberg et al., "Design and Implementation of the Sun Network Filesystem" (1985)]
* [http://homeostasis.scs.carleton.ca/~soma/distos/2008-02-11/howard-afs.pdf John H. Howard et al., "Scale and Performance in a Distributed File System" (1988)]

==Group 1==

==NFS==
Group 1:

1) per operating traffic

2) rpc based

3) unreliable

Group 2:

1) designed to share disks over a network, not files

2) more UNIX like

3) portable

4) use UDP

5) it is not minimize network traffic.

6) used VNODE

7) not have much hardware equipment

8) later versions took on features of AFS

9) stateless protocol conflicts with files being state-full by nature.

Group 3:

1) cache assumption invalid.

2) no locking

3) bad security

==AFS==

Group 1

1) design for 5000 clients

2) high integrity.

Group 2

1) designed to share files over a network, not disks

2) better scalability

3) better security.

4) minimize network traffic.

5) less UNIX like

6) plugin authentication

7) needs more kernel storage due to complex commands

8) inode concept replaced with fid

Group 3

1) cache assumption valid

2) locking

3) good security.

==Class Discussion:==

Capturing some of Anil's Observations about NFS and AFS:
* The reason why NFS does not try to share at block level instead of file level is that sharing at block level is complicated from the implementation point of view.
* NFS use UDP as the transport protocol since UDP being a stateless protocol is in-line with the NFS design philosophy of not maintaining state information.
* Security and unreliability issues in NFS are an implication of using RPC.
** RPC is a nice way for programming but RPC is not designed for networks (where flakiness is an inherent characteristic) which is better explained by the analogy that you never expect from a programming point of view your function call to fail(not to return) because of communication error. *AFS designers considered network as a bottle neck and tried to reduce the number of chatter over network by using caching.
** 'open' and 'close' operations in AFS were critical
** the 'close' operation assumes importance to the same proportions of a 'commit' operation in a well-designed database system.
* The security model of AFS is interesting in that rather than going for the UNIX access list based implementation AFS used a single sign on system based on Kerberos.
** cool thing about Kerberos is that idea of using tickets to get access.
* Irrespective of having better features compared to NFS, AFS did not get widely adopted. The reason for this was that the administrative mechanism for AFS was complex and it required highly trained/skilled people to setup AFS and it also required quite a number of day’s effort to set it up and maintain.

DistOS 2014W Lecture 8

2014-02-23T23:24:16Z

Cdelahou: Removed unused group 4. Formatted discussion

==NFS and AFS (Jan 30)==

* [http://homeostasis.scs.carleton.ca/~soma/distos/2008-02-11/sandberg-nfs.pdf Russel Sandberg et al., "Design and Implementation of the Sun Network Filesystem" (1985)]
* [http://homeostasis.scs.carleton.ca/~soma/distos/2008-02-11/howard-afs.pdf John H. Howard et al., "Scale and Performance in a Distributed File System" (1988)]

==Group 1==

'''NFS:'''

1) per operating traffic

2) rpc based

3) unreliable

'''AFS:'''

1) design for 5000 clients

2) high integrity.

==Group 2==

'''NFS:'''

1) designed to share disks over a network, not files

2) more UNIX like

3) portable

4) use UDP

5) it is not minimize network traffic.

6) used VNODE

7) not have much hardware equipment

8) later versions took on features of AFS

9) stateless protocol conflicts with files being state-full by nature.

'''AFS:'''

1) designed to share files over a network, not disks

2) better scalability

3) better security.

4) minimize network traffic.

5) less UNIX like

6) plugin authentication

7) needs more kernel storage due to complex commands

8) inode concept replaced with fid

==Group 3==

'''NFS:'''

1) cache assumption invalid.

2) no locking

3) bad security

'''AFS:'''

1) cache assumption valid

2) locking

3) good security.

==Class Discussion:==

Capturing some of Anil's Observations about NFS and AFS:
* The reason why NFS does not try to share at block level instead of file level is that sharing at block level is complicated from the implementation point of view.
* NFS use UDP as the transport protocol since UDP being a stateless protocol is in-line with the NFS design philosophy of not maintaining state information.
* Security and unreliability issues in NFS are an implication of using RPC.
** RPC is a nice way for programming but RPC is not designed for networks (where flakiness is an inherent characteristic) which is better explained by the analogy that you never expect from a programming point of view your function call to fail(not to return) because of communication error. *AFS designers considered network as a bottle neck and tried to reduce the number of chatter over network by using caching.
** 'open' and 'close' operations in AFS were critical
** the 'close' operation assumes importance to the same proportions of a 'commit' operation in a well-designed database system.
* The security model of AFS is interesting in that rather than going for the UNIX access list based implementation AFS used a single sign on system based on Kerberos.
** cool thing about Kerberos is that idea of using tickets to get access.
* Irrespective of having better features compared to NFS, AFS did not get widely adopted. The reason for this was that the administrative mechanism for AFS was complex and it required highly trained/skilled people to setup AFS and it also required quite a number of day’s effort to set it up and maintain.

DistOS 2014W Lecture 10

2014-02-23T19:50:56Z

Cdelahou: Clean up some more and added info about the interface and random accesss stuff

==GFS and Ceph (Feb. 4)==
* [http://research.google.com/archive/gfs-sosp2003.pdf Sanjay Ghemawat et al., "The Google File System" (SOSP 2003)]
* [http://www.usenix.org/events/osdi06/tech/weil.html Weil et al., Ceph: A Scalable, High-Performance Distributed File System (OSDI 2006)].

== GFS ==
GFS is a distributed file system designed specifically for Google' needs and they made two assumption while designing GFS:

# Most of the Data is written in the form of appends ( write at the end of a file).
# Data read from the files is read in a streaming sort of way ( read lot of data in the form of sequential access).

Because of this, they decided to lay emphasis on better performance for sequential access. These two assumption are also the reason because of which they chose to keep the chunk size so huge (64 MB). You can easily read large blocks if you get rid of random access. Once data is written, it is rarely written '''over''' using random access.

* Very different design because of the workload that it is designed for:
** Because of the number of small files that have to be indexed for the web, it is no longer practical to have a file system that stores these individually. Too much overhead. Easier to store millions of objects as large files. Punts problem to userspace, incl. record delimitation.
* Don't care about latency
** surprising considering it's Google, the guys who change the TCP IW standard recommendations for latency.
* Mostly seeking (sequentially) through entire file.
* Paper from 2003, mentions still using 100BASE-T links.
* Data-heavy, metadata light. Contacting the metadata server is a rare event.
* Consider hardware failures as normal operating conditions:
** uses commodity hardware
** All the replication (!)
** Data checksumming
* Performance degrades for small random access workload; use other filesystem.
* Path of least resistance to scale, not to do something super CS-smart.
* Google used to re-index every month, swapping out indexes. Now, it's much more online. GFS is now just a layer to support a more dynamic layer.
* The paper seems to lack any mention of security. This FS probably could only exist on a trusted network.
* Implements interface similar to POSIX, but not the full standard.
** '''create, delete, open, close, read, write'''
** Unique operations too: '''snapshot''' which is low cost file duplication and '''record append'''

== How other filesystems compare to GFS and Ceph ==

* Other File Systems: AFS, NFS, Plan 9, traditional Unix

* Data and metadata are held together.
** They did not optimize for different access patterns:
*** Data → big, long transfers
*** Metadata → small, low latency
** Can't scale separately

* Designed for lower latency

* (Mostly) designed for POSIX semantics
** how the requirements that lead to the ‘standard’ evolved

* Assumed that a file is a fraction of the size of a server
** eg. files on a Unix system were meant to be text files.
** Huge files spread over many servers not even in the cards for NFS
** Meant for small problems, not web-scale
*** Google has a copy of the publicly accessible internet
**** Their strategy is to copy the internet to index it
**** Insane → insane filesystem
**** One file may span multiple servers

* Even mainframes, scale-up solutions, ultra-reliable systems, with data sets bigger than RAM don't have the scale of GFS or CEPH.

* Point-to-point access; much less load-balancing, even in AFS
** One server to service multiple clients.
** Single point of entry, single point of failure, bottleneck

* Less focus on fault tolerance
** No notion of data replication.

* Reliability was a property of the host, not the network

==Ceph==

* Ceph is crazy and tries to do everything
* GFS was very specifically designed to work in a limited scenario, under certain specific conditions, whereas CEPH is sort of generic solution- for how to build a scalable distributed file system

* Achieves high performance, relaibility and availabily through three design features: decoupled data and metadata, dynamically distributed meta data, reliable autonomic distributed object storage.
** Decoupled data and meta data: Metadata operations (open, close) happen to metadata clusters, clients interact directly with OSD's for IO.
** Distributed Meta Data: Meta data operations make up a lot of work load. Ceph distributes this workload to many Meta Data Servers (MDS) to maintail a file hierarchy.
** automic object storage: OSD's organise amongst themselves, taking advantage of their onboard CPU and Memory. Ceph delegates datamigration, replication, failure detection, recovery, to the cluster of OSDs.

* Distributed Meta Data
** Unlike GFS
** Clusters of MDSes.
** Utilizes Dynamic Subtree partitioning: Dynamically mapped subtrees of directories to MDSes. Workloads for every subtree are monitored. Subtrees assigned to MDSes accordingly, in a coarse way.

* Near-Posix like interface: selectively extend interface while relaxing consistency semantics.
** ex: ''readdirplus'' is an extension which optimizes for a common sequence of operations: ''readdir'' followed by multiple ''stats''. This requires brief caching to improve performance which may let small concurrent changes to go unnoticed.
* Object Storage Devices (OSDs) have some intelligence (unlike GFS), and autonomously distribute the data, rather than being controlled by a master.
** Uses EBOFS (instead of ext3). Implemented in user space to avoid dealing with kernel issues. Aggressively schedules disk writes.
** Uses hashing in the distribution process to '''uniformly''' distribute data
** The actual algorithm for distributing data is as follows:
*** file + offset → hash(object ID) → CRUSH(placement group) → OSD
** Each client has knowledge of the entire storage network
** Tracks failure groups (same breaker, switch, etc.), hot data, etc.
** Number of replicas is changeable on the fly, but the placement group is not
*** For example, if every client on the planet is accessing the same file, you can scale out for that data.
** You don't ask where to go, you just go, which makes this very scalable

Any distributed file system that aims to be scalable, need to cut down the number of messages floating around, instead of the actual data transfer, which is what Ceph aims to do with the CRUSH function. basically Client or OSD just need to be aware of this CRUSH algorithm(function) and they can find the location of a file on their own (instead of asking a master server about it), so basically it eliminates the traditional File allocation list approach.

* CRUSH is sufficiently advanced to be called magic.
** O(log n) of the size of the data
** CPUs stupidly fast, so the above is of minimal overhead
*** the network, despite being fast, has latency, etc.
*** Computation scales much better than communication.

* Storage is composed of variable-length atoms

= Class Discussion =

== File Size ==
In Anil’s opinion “how file system size compares to the server storage size?” is a key parameter that distinguishes GFS, NFS designs from the early file systems NFS, AFS, Plan 9. In the early files system designs, file system size was a fraction of the server storage size where as in GFS and Ceph the file system size can be of several times magnitude than that of the server.

== Segue on drives and sequential access following GFS section ==

* Structure of GFS does match some other modern systems:
** Hard drives are like parallel tapes, very suited for streaming.
** Flash devices are log-structured too, but have an abstracting firmware.
*** They do erasure in bulk, in the '''background'''.
*** Used to be we needed specialized FS for [http://en.wikipedia.org/wiki/Memory_Technology_Device MTDs] to get better performance; though now we have better micro-controllers in some embedded systems to abstract away the hardware.
* Architectures that start big, often end up in the smallest things.

== Lookups vs hashing ==
One key aspect in the Ceph design is the attempt to replace communication with computation by using hashing based mechanism CRUSH. Following line from Anil epitomizes the general approach that is followed in the field of Computer Science “If one abstraction does not work stick another one in”.

DistOS 2014W Lecture 10

2014-02-23T19:39:36Z

Cdelahou: woops. sequential => random

==GFS and Ceph (Feb. 4)==
* [http://research.google.com/archive/gfs-sosp2003.pdf Sanjay Ghemawat et al., "The Google File System" (SOSP 2003)]
* [http://www.usenix.org/events/osdi06/tech/weil.html Weil et al., Ceph: A Scalable, High-Performance Distributed File System (OSDI 2006)].

== GFS ==
GFS is a distributed file system designed specifically for Google' needs and they made two assumption while designing GFS:

# Most of the Data is written in the form of appends ( write at the end of a file)
# Data read from the files is read in a streaming sort of way ( read lot of data in the form of sequential access).

Because of this, they decided to lay emphasis on better performance for sequential access. These two assumption are also the reason because of which they chose to keep the chunk size so huge (64 MB). You can easily read large blocks if you get rid of random access.

* Very different design because of the workload that it is designed for:
** Because of the number of small files that have to be indexed for the web, it is no longer practical to have a file system that stores these individually. Too much overhead. Punts problem to userspace, incl. record delimitation.
* Don't care about latency, surprising considering it's Google, the guys who change the TCP IW standard recommendations for latency.
* Mostly seeking (sequentially) through entire file.
* Paper from 2003, mentions still using 100BASE-T links.
* Data-heavy, metadata light. Contacting the metadata server is a rare event.
* Consider hardware failures as normal operating conditions:
** uses commodity hardware
** All the replication (!)
** Data checksumming
* Performance degrades for small random access workload; use other filesystem.
* Path of least resistance to scale, not to do something super CS-smart.
* Google used to re-index every month, swapping out indexes. Now, it's much more online. GFS is now just a layer to support a more dynamic layer.
* The paper seems to lack any mention of security. This FS probably could only exist on a trusted network.

== How other filesystems compare to GFS and Ceph ==

* Other File Systems: AFS, NFS, Plan 9, traditional Unix

* Data and metadata are held together.
** They did not optimize for different access patterns:
*** Data → big, long transfers
*** Metadata → small, low latency
** Can't scale separately

* Designed for lower latency

* (Mostly) designed for POSIX semantics
** how the requirements that lead to the ‘standard’ evolved

* Assumed that a file is a fraction of the size of a server
** eg. files on a Unix system were meant to be text files.
** Huge files spread over many servers not even in the cards for NFS
** Meant for small problems, not web-scale
*** Google has a copy of the publicly accessible internet
**** Their strategy is to copy the internet to index it
**** Insane → insane filesystem
**** One file may span multiple servers

* Even mainframes, scale-up solutions, ultra-reliable systems, with data sets bigger than RAM don't have the scale of GFS or CEPH.

* Point-to-point access; much less load-balancing, even in AFS
** One server to service multiple clients.
** Single point of entry, single point of failure, bottleneck

* Less focus on fault tolerance
** No notion of data replication.

* Reliability was a property of the host, not the network

==Ceph==

* Ceph is crazy and tries to do everything
* GFS was very specifically designed to work in a limited scenario, under certain specific conditions, whereas CEPH is sort of generic solution- for how to build a scalable distributed file system

* Achieves high performance, relaibility and availabily through three design features: decoupled data and metadata, dynamically distributed meta data, reliable autonomic distributed object storage.
** Decoupled data and meta data: Metadata operations (open, close) happen to metadata clusters, clients interact directly with OSD's for IO.
** Distributed Meta Data: Meta data operations make up a lot of work load. Ceph distributes this workload to many Meta Data Servers (MDS) to maintail a file hierarchy.
** automic object storage: OSD's organise amongst themselves, taking advantage of their onboard CPU and Memory. Ceph delegates datamigration, replication, failure detection, recovery, to the cluster of OSDs.

* Distributed Meta Data
** Unlike GFS
** Clusters of MDSes.
** Utilizes Dynamic Subtree partitioning: Dynamically mapped subtrees of directories to MDSes. Workloads for every subtree are monitored. Subtrees assigned to MDSes accordingly, in a coarse way.

* Near-Posix like interface: selectively extend interface while relaxing consistency semantics.
** ex: ''readdirplus'' is an extension which optimizes for a common sequence of operations: ''readdir'' followed by multiple ''stats''. This requires brief caching to improve performance which may let small concurrent changes to go unnoticed.
* Object Storage Devices (OSDs) have some intelligence (unlike GFS), and autonomously distribute the data, rather than being controlled by a master.
** Uses EBOFS (instead of ext3). Implemented in user space to avoid dealing with kernel issues. Aggressively schedules disk writes.
** Uses hashing in the distribution process to '''uniformly''' distribute data
** The actual algorithm for distributing data is as follows:
*** file + offset → hash(object ID) → CRUSH(placement group) → OSD
** Each client has knowledge of the entire storage network
** Tracks failure groups (same breaker, switch, etc.), hot data, etc.
** Number of replicas is changeable on the fly, but the placement group is not
*** For example, if every client on the planet is accessing the same file, you can scale out for that data.
** You don't ask where to go, you just go, which makes this very scalable

Any distributed file system that aims to be scalable, need to cut down the number of messages floating around, instead of the actual data transfer, which is what Ceph aims to do with the CRUSH function. basically Client or OSD just need to be aware of this CRUSH algorithm(function) and they can find the location of a file on their own (instead of asking a master server about it), so basically it eliminates the traditional File allocation list approach.

* CRUSH is sufficiently advanced to be called magic.
** O(log n) of the size of the data
** CPUs stupidly fast, so the above is of minimal overhead
*** the network, despite being fast, has latency, etc.
*** Computation scales much better than communication.

* Storage is composed of variable-length atoms

= Class Discussion =

== File Size ==
In Anil’s opinion “how file system size compares to the server storage size?” is a key parameter that distinguishes GFS, NFS designs from the early file systems NFS, AFS, Plan 9. In the early files system designs, file system size was a fraction of the server storage size where as in GFS and Ceph the file system size can be of several times magnitude than that of the server.

== Segue on drives and sequential access following GFS section ==

* Structure of GFS does match some other modern systems:
** Hard drives are like parallel tapes, very suited for streaming.
** Flash devices are log-structured too, but have an abstracting firmware.
*** They do erasure in bulk, in the '''background'''.
*** Used to be we needed specialized FS for [http://en.wikipedia.org/wiki/Memory_Technology_Device MTDs] to get better performance; though now we have better micro-controllers in some embedded systems to abstract away the hardware.
* Architectures that start big, often end up in the smallest things.

== Lookups vs hashing ==
One key aspect in the Ceph design is the attempt to replace communication with computation by using hashing based mechanism CRUSH. Following line from Anil epitomizes the general approach that is followed in the field of Computer Science “If one abstraction does not work stick another one in”.

DistOS 2014W Lecture 10

2014-02-23T19:25:44Z

Cdelahou: Added line about EBOFS

==GFS and Ceph (Feb. 4)==
* [http://research.google.com/archive/gfs-sosp2003.pdf Sanjay Ghemawat et al., "The Google File System" (SOSP 2003)]
* [http://www.usenix.org/events/osdi06/tech/weil.html Weil et al., Ceph: A Scalable, High-Performance Distributed File System (OSDI 2006)].

== GFS ==
GFS is a distributed file system designed specifically for Google' needs and they made two assumption while designing GFS:

# Most of the Data is written in the form of appends ( write at the end of a file)
# Data read from the files is read in a streaming sort of way ( read lot of data in the form of sequential access).

Because of this, they decided to lay emphasis on better performance for sequential access. These two assumption are also the reason because of which they chose to keep the chunk size so huge (64 MB). You can easily read large blocks if you get rid of sequential access.

* Very different design because of the workload that it is designed for:
** Because of the number of small files that have to be indexed for the web, it is no longer practical to have a file system that stores these individually. Too much overhead. Punts problem to userspace, incl. record delimitation.
* Don't care about latency, surprising considering it's Google, the guys who change the TCP IW standard recommendations for latency.
* Mostly seeking (sequentially) through entire file.
* Paper from 2003, mentions still using 100BASE-T links.
* Data-heavy, metadata light. Contacting the metadata server is a rare event.
* Consider hardware failures as normal operating conditions:
** uses commodity hardware
** All the replication (!)
** Data checksumming
* Performance degrades for small random access workload; use other filesystem.
* Path of least resistance to scale, not to do something super CS-smart.
* Google used to re-index every month, swapping out indexes. Now, it's much more online. GFS is now just a layer to support a more dynamic layer.
* The paper seems to lack any mention of security. This FS probably could only exist on a trusted network.

== How other filesystems compare to GFS and Ceph ==

* Other File Systems: AFS, NFS, Plan 9, traditional Unix

* Data and metadata are held together.
** They did not optimize for different access patterns:
*** Data → big, long transfers
*** Metadata → small, low latency
** Can't scale separately

* Designed for lower latency

* (Mostly) designed for POSIX semantics
** how the requirements that lead to the ‘standard’ evolved

* Assumed that a file is a fraction of the size of a server
** eg. files on a Unix system were meant to be text files.
** Huge files spread over many servers not even in the cards for NFS
** Meant for small problems, not web-scale
*** Google has a copy of the publicly accessible internet
**** Their strategy is to copy the internet to index it
**** Insane → insane filesystem
**** One file may span multiple servers

* Even mainframes, scale-up solutions, ultra-reliable systems, with data sets bigger than RAM don't have the scale of GFS or CEPH.

* Point-to-point access; much less load-balancing, even in AFS
** One server to service multiple clients.
** Single point of entry, single point of failure, bottleneck

* Less focus on fault tolerance
** No notion of data replication.

* Reliability was a property of the host, not the network

==Ceph==

* Ceph is crazy and tries to do everything
* GFS was very specifically designed to work in a limited scenario, under certain specific conditions, whereas CEPH is sort of generic solution- for how to build a scalable distributed file system

* Achieves high performance, relaibility and availabily through three design features: decoupled data and metadata, dynamically distributed meta data, reliable autonomic distributed object storage.
** Decoupled data and meta data: Metadata operations (open, close) happen to metadata clusters, clients interact directly with OSD's for IO.
** Distributed Meta Data: Meta data operations make up a lot of work load. Ceph distributes this workload to many Meta Data Servers (MDS) to maintail a file hierarchy.
** automic object storage: OSD's organise amongst themselves, taking advantage of their onboard CPU and Memory. Ceph delegates datamigration, replication, failure detection, recovery, to the cluster of OSDs.

* Distributed Meta Data
** Unlike GFS
** Clusters of MDSes.
** Utilizes Dynamic Subtree partitioning: Dynamically mapped subtrees of directories to MDSes. Workloads for every subtree are monitored. Subtrees assigned to MDSes accordingly, in a coarse way.

* Near-Posix like interface: selectively extend interface while relaxing consistency semantics.
** ex: ''readdirplus'' is an extension which optimizes for a common sequence of operations: ''readdir'' followed by multiple ''stats''. This requires brief caching to improve performance which may let small concurrent changes to go unnoticed.
* Object Storage Devices (OSDs) have some intelligence (unlike GFS), and autonomously distribute the data, rather than being controlled by a master.
** Uses EBOFS (instead of ext3). Implemented in user space to avoid dealing with kernel issues. Aggressively schedules disk writes.
** Uses hashing in the distribution process to '''uniformly''' distribute data
** The actual algorithm for distributing data is as follows:
*** file + offset → hash(object ID) → CRUSH(placement group) → OSD
** Each client has knowledge of the entire storage network
** Tracks failure groups (same breaker, switch, etc.), hot data, etc.
** Number of replicas is changeable on the fly, but the placement group is not
*** For example, if every client on the planet is accessing the same file, you can scale out for that data.
** You don't ask where to go, you just go, which makes this very scalable

Any distributed file system that aims to be scalable, need to cut down the number of messages floating around, instead of the actual data transfer, which is what Ceph aims to do with the CRUSH function. basically Client or OSD just need to be aware of this CRUSH algorithm(function) and they can find the location of a file on their own (instead of asking a master server about it), so basically it eliminates the traditional File allocation list approach.

* CRUSH is sufficiently advanced to be called magic.
** O(log n) of the size of the data
** CPUs stupidly fast, so the above is of minimal overhead
*** the network, despite being fast, has latency, etc.
*** Computation scales much better than communication.

* Storage is composed of variable-length atoms

= Class Discussion =

== File Size ==
In Anil’s opinion “how file system size compares to the server storage size?” is a key parameter that distinguishes GFS, NFS designs from the early file systems NFS, AFS, Plan 9. In the early files system designs, file system size was a fraction of the server storage size where as in GFS and Ceph the file system size can be of several times magnitude than that of the server.

== Segue on drives and sequential access following GFS section ==

* Structure of GFS does match some other modern systems:
** Hard drives are like parallel tapes, very suited for streaming.
** Flash devices are log-structured too, but have an abstracting firmware.
*** They do erasure in bulk, in the '''background'''.
*** Used to be we needed specialized FS for [http://en.wikipedia.org/wiki/Memory_Technology_Device MTDs] to get better performance; though now we have better micro-controllers in some embedded systems to abstract away the hardware.
* Architectures that start big, often end up in the smallest things.

== Lookups vs hashing ==
One key aspect in the Ceph design is the attempt to replace communication with computation by using hashing based mechanism CRUSH. Following line from Anil epitomizes the general approach that is followed in the field of Computer Science “If one abstraction does not work stick another one in”.

DistOS 2014W Lecture 10

2014-02-23T19:09:51Z

Cdelahou: More on distributed meta data

==GFS and Ceph (Feb. 4)==
* [http://research.google.com/archive/gfs-sosp2003.pdf Sanjay Ghemawat et al., "The Google File System" (SOSP 2003)]
* [http://www.usenix.org/events/osdi06/tech/weil.html Weil et al., Ceph: A Scalable, High-Performance Distributed File System (OSDI 2006)].

== GFS ==
GFS is a distributed file system designed specifically for Google' needs and they made two assumption while designing GFS:

# Most of the Data is written in the form of appends ( write at the end of a file)
# Data read from the files is read in a streaming sort of way ( read lot of data in the form of sequential access).

Because of this, they decided to lay emphasis on better performance for sequential access. These two assumption are also the reason because of which they chose to keep the chunk size so huge (64 MB). You can easily read large blocks if you get rid of sequential access.

* Very different design because of the workload that it is designed for:
** Because of the number of small files that have to be indexed for the web, it is no longer practical to have a file system that stores these individually. Too much overhead. Punts problem to userspace, incl. record delimitation.
* Don't care about latency, surprising considering it's Google, the guys who change the TCP IW standard recommendations for latency.
* Mostly seeking (sequentially) through entire file.
* Paper from 2003, mentions still using 100BASE-T links.
* Data-heavy, metadata light. Contacting the metadata server is a rare event.
* Consider hardware failures as normal operating conditions:
** uses commodity hardware
** All the replication (!)
** Data checksumming
* Performance degrades for small random access workload; use other filesystem.
* Path of least resistance to scale, not to do something super CS-smart.
* Google used to re-index every month, swapping out indexes. Now, it's much more online. GFS is now just a layer to support a more dynamic layer.
* The paper seems to lack any mention of security. This FS probably could only exist on a trusted network.

== How other filesystems compare to GFS and Ceph ==

* Other File Systems: AFS, NFS, Plan 9, traditional Unix

* Data and metadata are held together.
** They did not optimize for different access patterns:
*** Data → big, long transfers
*** Metadata → small, low latency
** Can't scale separately

* Designed for lower latency

* (Mostly) designed for POSIX semantics
** how the requirements that lead to the ‘standard’ evolved

* Assumed that a file is a fraction of the size of a server
** eg. files on a Unix system were meant to be text files.
** Huge files spread over many servers not even in the cards for NFS
** Meant for small problems, not web-scale
*** Google has a copy of the publicly accessible internet
**** Their strategy is to copy the internet to index it
**** Insane → insane filesystem
**** One file may span multiple servers

* Even mainframes, scale-up solutions, ultra-reliable systems, with data sets bigger than RAM don't have the scale of GFS or CEPH.

* Point-to-point access; much less load-balancing, even in AFS
** One server to service multiple clients.
** Single point of entry, single point of failure, bottleneck

* Less focus on fault tolerance
** No notion of data replication.

* Reliability was a property of the host, not the network

==Ceph==

* Ceph is crazy and tries to do everything
* GFS was very specifically designed to work in a limited scenario, under certain specific conditions, whereas CEPH is sort of generic solution- for how to build a scalable distributed file system

* Achieves high performance, relaibility and availabily through three design features: decoupled data and metadata, dynamically distributed meta data, reliable autonomic distributed object storage.
** Decoupled data and meta data: Metadata operations (open, close) happen to metadata clusters, clients interact directly with OSD's for IO.
** Distributed Meta Data: Meta data operations make up a lot of work load. Ceph distributes this workload to many Meta Data Servers (MDS) to maintail a file hierarchy.
** automic object storage: OSD's organise amongst themselves, taking advantage of their onboard CPU and Memory. Ceph delegates datamigration, replication, failure detection, recovery, to the cluster of OSDs.

* Distributed Meta Data
** Unlike GFS
** Clusters of MDSes.
** Utilizes Dynamic Subtree partitioning: Dynamically mapped subtrees of directories to MDSes. Workloads for every subtree are monitored. Subtrees assigned to MDSes accordingly, in a coarse way.

* Near-Posix like interface: selectively extend interface while relaxing consistency semantics.
** ex: ''readdirplus'' is an extension which optimizes for a common sequence of operations: ''readdir'' followed by multiple ''stats''. This requires brief caching to improve performance which may let small concurrent changes to go unnoticed.
* Unlike GFS, the Object Storage Devices (OSDs) have some intelligence, and autonomously distribute the data, rather than being controlled by a master.
** Uses hashing in the distribution process to '''uniformly''' distribute data
** The actual algorithm for distributing data is as follows:
*** file + offset → hash(object ID) → CRUSH(placement group) → OSD
** Each client has knowledge of the entire storage network
** Tracks failure groups (same breaker, switch, etc.), hot data, etc.
** Number of replicas is changeable on the fly, but the placement group is not
*** For example, if every client on the planet is accessing the same file, you can scale out for that data.
** You don't ask where to go, you just go, which makes this very scalable

Any distributed file system that aims to be scalable, need to cut down the number of messages floating around, instead of the actual data transfer, which is what Ceph aims to do with the CRUSH function. basically Client or OSD just need to be aware of this CRUSH algorithm(function) and they can find the location of a file on their own (instead of asking a master server about it), so basically it eliminates the traditional File allocation list approach.

* CRUSH is sufficiently advanced to be called magic.
** O(log n) of the size of the data
** CPUs stupidly fast, so the above is of minimal overhead
*** the network, despite being fast, has latency, etc.
*** Computation scales much better than communication.

* Storage is composed of variable-length atoms

= Class Discussion =

== File Size ==
In Anil’s opinion “how file system size compares to the server storage size?” is a key parameter that distinguishes GFS, NFS designs from the early file systems NFS, AFS, Plan 9. In the early files system designs, file system size was a fraction of the server storage size where as in GFS and Ceph the file system size can be of several times magnitude than that of the server.

== Segue on drives and sequential access following GFS section ==

* Structure of GFS does match some other modern systems:
** Hard drives are like parallel tapes, very suited for streaming.
** Flash devices are log-structured too, but have an abstracting firmware.
*** They do erasure in bulk, in the '''background'''.
*** Used to be we needed specialized FS for [http://en.wikipedia.org/wiki/Memory_Technology_Device MTDs] to get better performance; though now we have better micro-controllers in some embedded systems to abstract away the hardware.
* Architectures that start big, often end up in the smallest things.

== Lookups vs hashing ==
One key aspect in the Ceph design is the attempt to replace communication with computation by using hashing based mechanism CRUSH. Following line from Anil epitomizes the general approach that is followed in the field of Computer Science “If one abstraction does not work stick another one in”.

DistOS 2014W Lecture 10

2014-02-23T18:49:25Z

Cdelahou: POSIX w/relaxing

==GFS and Ceph (Feb. 4)==
* [http://research.google.com/archive/gfs-sosp2003.pdf Sanjay Ghemawat et al., "The Google File System" (SOSP 2003)]
* [http://www.usenix.org/events/osdi06/tech/weil.html Weil et al., Ceph: A Scalable, High-Performance Distributed File System (OSDI 2006)].

== GFS ==
GFS is a distributed file system designed specifically for Google' needs and they made two assumption while designing GFS:

# Most of the Data is written in the form of appends ( write at the end of a file)
# Data read from the files is read in a streaming sort of way ( read lot of data in the form of sequential access).

Because of this, they decided to lay emphasis on better performance for sequential access. These two assumption are also the reason because of which they chose to keep the chunk size so huge (64 MB). You can easily read large blocks if you get rid of sequential access.

* Very different design because of the workload that it is designed for:
** Because of the number of small files that have to be indexed for the web, it is no longer practical to have a file system that stores these individually. Too much overhead. Punts problem to userspace, incl. record delimitation.
* Don't care about latency, surprising considering it's Google, the guys who change the TCP IW standard recommendations for latency.
* Mostly seeking (sequentially) through entire file.
* Paper from 2003, mentions still using 100BASE-T links.
* Data-heavy, metadata light. Contacting the metadata server is a rare event.
* Consider hardware failures as normal operating conditions:
** uses commodity hardware
** All the replication (!)
** Data checksumming
* Performance degrades for small random access workload; use other filesystem.
* Path of least resistance to scale, not to do something super CS-smart.
* Google used to re-index every month, swapping out indexes. Now, it's much more online. GFS is now just a layer to support a more dynamic layer.
* The paper seems to lack any mention of security. This FS probably could only exist on a trusted network.

== How other filesystems compare to GFS and Ceph ==

* Other File Systems: AFS, NFS, Plan 9, traditional Unix

* Data and metadata are held together.
** They did not optimize for different access patterns:
*** Data → big, long transfers
*** Metadata → small, low latency
** Can't scale separately

* Designed for lower latency

* (Mostly) designed for POSIX semantics
** how the requirements that lead to the ‘standard’ evolved

* Assumed that a file is a fraction of the size of a server
** eg. files on a Unix system were meant to be text files.
** Huge files spread over many servers not even in the cards for NFS
** Meant for small problems, not web-scale
*** Google has a copy of the publicly accessible internet
**** Their strategy is to copy the internet to index it
**** Insane → insane filesystem
**** One file may span multiple servers

* Even mainframes, scale-up solutions, ultra-reliable systems, with data sets bigger than RAM don't have the scale of GFS or CEPH.

* Point-to-point access; much less load-balancing, even in AFS
** One server to service multiple clients.
** Single point of entry, single point of failure, bottleneck

* Less focus on fault tolerance
** No notion of data replication.

* Reliability was a property of the host, not the network

==Ceph==

* Ceph is crazy and tries to do everything
* GFS was very specifically designed to work in a limited scenario, under certain specific conditions, whereas CEPH is sort of generic solution- for how to build a scalable distributed file system

* Achieves high performance, relaibility and availabily through three design features: decoupled data and metadata, dynamically distributed meta data, reliable autonomic distributed object storage.
** Decoupled data and meta data: Metadata operations (open, close) happen to metadata clusters, clients interact directly with OSD's for IO.
** Distributed Meta Data: Meta data operations make up a lot of work load. Ceph distributes this workload to many Meta Data Servers (MDS) to maintail a file hierarchy.
** automic object storage: OSD's organise amongst themselves, taking advantage of their onboard CPU and Memory. Ceph delegates datamigration, replication, failure detection, recovery, to the cluster of OSDs.

* Unlike GFS, distributes metadata, not just for read-only copies
* Near-Posix like interface: selectively extend interface while relaxing consistency semantics.
** ex: ''readdirplus'' is an extension which optimizes for a common sequence of operations: ''readdir'' followed by multiple ''stats''. This requires brief caching to improve performance which may let small concurrent changes to go unnoticed.
* Unlike GFS, the Object Storage Devices (OSDs) have some intelligence, and autonomously distribute the data, rather than being controlled by a master.
** Uses hashing in the distribution process to '''uniformly''' distribute data
** The actual algorithm for distributing data is as follows:
*** file + offset → hash(object ID) → CRUSH(placement group) → OSD
** Each client has knowledge of the entire storage network
** Tracks failure groups (same breaker, switch, etc.), hot data, etc.
** Number of replicas is changeable on the fly, but the placement group is not
*** For example, if every client on the planet is accessing the same file, you can scale out for that data.
** You don't ask where to go, you just go, which makes this very scalable

Any distributed file system that aims to be scalable, need to cut down the number of messages floating around, instead of the actual data transfer, which is what Ceph aims to do with the CRUSH function. basically Client or OSD just need to be aware of this CRUSH algorithm(function) and they can find the location of a file on their own (instead of asking a master server about it), so basically it eliminates the traditional File allocation list approach.

* CRUSH is sufficiently advanced to be called magic.
** O(log n) of the size of the data
** CPUs stupidly fast, so the above is of minimal overhead
*** the network, despite being fast, has latency, etc.
*** Computation scales much better than communication.

* Storage is composed of variable-length atoms

= Class Discussion =

== File Size ==
In Anil’s opinion “how file system size compares to the server storage size?” is a key parameter that distinguishes GFS, NFS designs from the early file systems NFS, AFS, Plan 9. In the early files system designs, file system size was a fraction of the server storage size where as in GFS and Ceph the file system size can be of several times magnitude than that of the server.

== Segue on drives and sequential access following GFS section ==

* Structure of GFS does match some other modern systems:
** Hard drives are like parallel tapes, very suited for streaming.
** Flash devices are log-structured too, but have an abstracting firmware.
*** They do erasure in bulk, in the '''background'''.
*** Used to be we needed specialized FS for [http://en.wikipedia.org/wiki/Memory_Technology_Device MTDs] to get better performance; though now we have better micro-controllers in some embedded systems to abstract away the hardware.
* Architectures that start big, often end up in the smallest things.

== Lookups vs hashing ==
One key aspect in the Ceph design is the attempt to replace communication with computation by using hashing based mechanism CRUSH. Following line from Anil epitomizes the general approach that is followed in the field of Computer Science “If one abstraction does not work stick another one in”.

DistOS 2014W Lecture 10

2014-02-23T18:32:33Z

Cdelahou: Added more overview information.

==GFS and Ceph (Feb. 4)==
* [http://research.google.com/archive/gfs-sosp2003.pdf Sanjay Ghemawat et al., "The Google File System" (SOSP 2003)]
* [http://www.usenix.org/events/osdi06/tech/weil.html Weil et al., Ceph: A Scalable, High-Performance Distributed File System (OSDI 2006)].

== GFS ==
GFS is a distributed file system designed specifically for Google' needs and they made two assumption while designing GFS:

# Most of the Data is written in the form of appends ( write at the end of a file)
# Data read from the files is read in a streaming sort of way ( read lot of data in the form of sequential access).

Because of this, they decided to lay emphasis on better performance for sequential access. These two assumption are also the reason because of which they chose to keep the chunk size so huge (64 MB). You can easily read large blocks if you get rid of sequential access.

* Very different design because of the workload that it is designed for:
** Because of the number of small files that have to be indexed for the web, it is no longer practical to have a file system that stores these individually. Too much overhead. Punts problem to userspace, incl. record delimitation.
* Don't care about latency, surprising considering it's Google, the guys who change the TCP IW standard recommendations for latency.
* Mostly seeking (sequentially) through entire file.
* Paper from 2003, mentions still using 100BASE-T links.
* Data-heavy, metadata light. Contacting the metadata server is a rare event.
* Consider hardware failures as normal operating conditions:
** uses commodity hardware
** All the replication (!)
** Data checksumming
* Performance degrades for small random access workload; use other filesystem.
* Path of least resistance to scale, not to do something super CS-smart.
* Google used to re-index every month, swapping out indexes. Now, it's much more online. GFS is now just a layer to support a more dynamic layer.
* The paper seems to lack any mention of security. This FS probably could only exist on a trusted network.

== How other filesystems compare to GFS and Ceph ==

* Other File Systems: AFS, NFS, Plan 9, traditional Unix

* Data and metadata are held together.
** They did not optimize for different access patterns:
*** Data → big, long transfers
*** Metadata → small, low latency
** Can't scale separately

* Designed for lower latency

* (Mostly) designed for POSIX semantics
** how the requirements that lead to the ‘standard’ evolved

* Assumed that a file is a fraction of the size of a server
** eg. files on a Unix system were meant to be text files.
** Huge files spread over many servers not even in the cards for NFS
** Meant for small problems, not web-scale
*** Google has a copy of the publicly accessible internet
**** Their strategy is to copy the internet to index it
**** Insane → insane filesystem
**** One file may span multiple servers

* Even mainframes, scale-up solutions, ultra-reliable systems, with data sets bigger than RAM don't have the scale of GFS or CEPH.

* Point-to-point access; much less load-balancing, even in AFS
** One server to service multiple clients.
** Single point of entry, single point of failure, bottleneck

* Less focus on fault tolerance
** No notion of data replication.

* Reliability was a property of the host, not the network

==Ceph==

* Ceph is crazy and tries to do everything
* GFS was very specifically designed to work in a limited scenario, under certain specific conditions, whereas CEPH is sort of generic solution- for how to build a scalable distributed file system

* Achieves high performance, relaibility and availabily through three design features: decoupled data and metadata, dynamically distributed meta data, reliable autonomic distributed object storage.
** Decoupled data and meta data: Metadata operations (open, close) happen to metadata clusters, clients interact directly with OSD's for IO.
** Distributed Meta Data: Meta data operations make up a lot of work load. Ceph distributes this workload to many Meta Data Servers (MDS) to maintail a file hierarchy.
** automic object storage: OSD's organise amongst themselves, taking advantage of their onboard CPU and Memory. Ceph delegates datamigration, replication, failure detection, recovery, to the cluster of OSDs.

* Unlike GFS, distributes metadata, not just for read-only copies
* Near-Posix like interface: selectively extend interface while relaxing consistency semantics.
* Unlike GFS, the Object Storage Devices (OSDs) have some intelligence, and autonomously distribute the data, rather than being controlled by a master.
** Uses hashing in the distribution process to '''uniformly''' distribute data
** The actual algorithm for distributing data is as follows:
*** file + offset → hash(object ID) → CRUSH(placement group) → OSD
** Each client has knowledge of the entire storage network
** Tracks failure groups (same breaker, switch, etc.), hot data, etc.
** Number of replicas is changeable on the fly, but the placement group is not
*** For example, if every client on the planet is accessing the same file, you can scale out for that data.
** You don't ask where to go, you just go, which makes this very scalable

Any distributed file system that aims to be scalable, need to cut down the number of messages floating around, instead of the actual data transfer, which is what Ceph aims to do with the CRUSH function. basically Client or OSD just need to be aware of this CRUSH algorithm(function) and they can find the location of a file on their own (instead of asking a master server about it), so basically it eliminates the traditional File allocation list approach.

* CRUSH is sufficiently advanced to be called magic.
** O(log n) of the size of the data
** CPUs stupidly fast, so the above is of minimal overhead
*** the network, despite being fast, has latency, etc.
*** Computation scales much better than communication.

* Storage is composed of variable-length atoms

= Class Discussion =

== File Size ==
In Anil’s opinion “how file system size compares to the server storage size?” is a key parameter that distinguishes GFS, NFS designs from the early file systems NFS, AFS, Plan 9. In the early files system designs, file system size was a fraction of the server storage size where as in GFS and Ceph the file system size can be of several times magnitude than that of the server.

== Segue on drives and sequential access following GFS section ==

* Structure of GFS does match some other modern systems:
** Hard drives are like parallel tapes, very suited for streaming.
** Flash devices are log-structured too, but have an abstracting firmware.
*** They do erasure in bulk, in the '''background'''.
*** Used to be we needed specialized FS for [http://en.wikipedia.org/wiki/Memory_Technology_Device MTDs] to get better performance; though now we have better micro-controllers in some embedded systems to abstract away the hardware.
* Architectures that start big, often end up in the smallest things.

== Lookups vs hashing ==
One key aspect in the Ceph design is the attempt to replace communication with computation by using hashing based mechanism CRUSH. Following line from Anil epitomizes the general approach that is followed in the field of Computer Science “If one abstraction does not work stick another one in”.

DistOS 2014W Lecture 10

2014-02-21T22:03:23Z

Cdelahou: linebreak

==GFS and Ceph (Feb. 4)==
* [http://research.google.com/archive/gfs-sosp2003.pdf Sanjay Ghemawat et al., "The Google File System" (SOSP 2003)]
* [http://www.usenix.org/events/osdi06/tech/weil.html Weil et al., Ceph: A Scalable, High-Performance Distributed File System (OSDI 2006)].

== GFS ==
GFS is a distributed file system designed specifically for Google' needs and they made two assumption while designing GFS:

# Most of the Data is written in the form of appends ( write at the end of a file)
# Data read from the files is read in a streaming sort of way ( read lot of data in the form of sequential access).

Because of this, they decided to lay emphasis on better performance for sequential access. These two assumption are also the reason because of which they chose to keep the chunk size so huge (64 MB). You can easily read large blocks if you get rid of sequential access.

* Very different design because of the workload that it is designed for:
** Because of the number of small files that have to be indexed for the web, it is no longer practical to have a file system that stores these individually. Too much overhead. Punts problem to userspace, incl. record delimitation.
* Don't care about latency, surprising considering it's Google, the guys who change the TCP IW standard recommendations for latency.
* Mostly seeking (sequentially) through entire file.
* Paper from 2003, mentions still using 100BASE-T links.
* Data-heavy, metadata light. Contacting the metadata server is a rare event.
* Consider hardware failures as normal operating conditions:
** uses commodity hardware
** All the replication (!)
** Data checksumming
* Performance degrades for small random access workload; use other filesystem.
* Path of least resistance to scale, not to do something super CS-smart.
* Google used to re-index every month, swapping out indexes. Now, it's much more online. GFS is now just a layer to support a more dynamic layer.
* The paper seems to lack any mention of security. This FS probably could only exist on a trusted network.

== How other filesystems compare to GFS and Ceph ==

* Other File Systems: AFS, NFS, Plan 9, traditional Unix

* Data and metadata are held together.
** They did not optimize for different access patterns:
*** Data → big, long transfers
*** Metadata → small, low latency
** Can't scale separately

* Designed for lower latency

* (Mostly) designed for POSIX semantics
** how the requirements that lead to the ‘standard’ evolved

* Assumed that a file is a fraction of the size of a server
** eg. files on a Unix system were meant to be text files.
** Huge files spread over many servers not even in the cards for NFS
** Meant for small problems, not web-scale
*** Google has a copy of the publicly accessible internet
**** Their strategy is to copy the internet to index it
**** Insane → insane filesystem
**** One file may span multiple servers

* Even mainframes, scale-up solutions, ultra-reliable systems, with data sets bigger than RAM don't have the scale of GFS or CEPH.

* Point-to-point access; much less load-balancing, even in AFS
** One server to service multiple clients.
** Single point of entry, single point of failure, bottleneck

* Less focus on fault tolerance
** No notion of data replication.

* Reliability was a property of the host, not the network

==Ceph==

* Ceph is crazy and tries to do everything
* GFS was very specifically designed to work in a limited scenario, under certain specific conditions, whereas CEPH is sort of generic solution- for how to build a scalable distributed file system
* Unlike GFS, distributes metadata, not just for read-only copies
* Unlike GFS, the OSDs have some intelligence, and autonomously distribute the data, rather than being controlled by a master.
** Uses hashing in the distribution process to '''uniformly''' distribute data
** The actual algorithm for distributing data is as follows:
*** file + offset → hash(object ID) → CRUSH(placement group) → OSD
** Each client has knowledge of the entire storage network
** Tracks failure groups (same breaker, switch, etc.), hot data, etc.
** Number of replicas is changeable on the fly, but the placement group is not
*** For example, if every client on the planet is accessing the same file, you can scale out for that data.
** You don't ask where to go, you just go, which makes this very scalable

Any distributed file system that aims to be scalable, need to cut down the number of messages floating around, instead of the actual data transfer, which is what Ceph aims to do with the CRUSH function. basically Client or OSD just need to be aware of this CRUSH algorithm(function) and they can find the location of a file on their own (instead of asking a master server about it), so basically it eliminates the traditional File allocation list approach.

* CRUSH is sufficiently advanced to be called magic.
** O(log n) of the size of the data
** CPUs stupidly fast, so the above is of minimal overhead
*** the network, despite being fast, has latency, etc.
*** Computation scales much better than communication.

* Storage is composed of variable-length atoms

= Class Discussion =

== File Size ==
In Anil’s opinion “how file system size compares to the server storage size?” is a key parameter that distinguishes GFS, NFS designs from the early file systems NFS, AFS, Plan 9. In the early files system designs, file system size was a fraction of the server storage size where as in GFS and Ceph the file system size can be of several times magnitude than that of the server.

== Segue on drives and sequential access following GFS section ==

* Structure of GFS does match some other modern systems:
** Hard drives are like parallel tapes, very suited for streaming.
** Flash devices are log-structured too, but have an abstracting firmware.
*** They do erasure in bulk, in the '''background'''.
*** Used to be we needed specialized FS for [http://en.wikipedia.org/wiki/Memory_Technology_Device MTDs] to get better performance; though now we have better micro-controllers in some embedded systems to abstract away the hardware.
* Architectures that start big, often end up in the smallest things.

== Lookups vs hashing ==
One key aspect in the Ceph design is the attempt to replace communication with computation by using hashing based mechanism CRUSH. Following line from Anil epitomizes the general approach that is followed in the field of Computer Science “If one abstraction does not work stick another one in”.

DistOS 2014W Lecture 10

2014-02-21T22:02:58Z

Cdelahou: /* GFS */ Added a few tid bits.

==GFS and Ceph (Feb. 4)==
* [http://research.google.com/archive/gfs-sosp2003.pdf Sanjay Ghemawat et al., "The Google File System" (SOSP 2003)]
* [http://www.usenix.org/events/osdi06/tech/weil.html Weil et al., Ceph: A Scalable, High-Performance Distributed File System (OSDI 2006)].

== GFS ==
GFS is a distributed file system designed specifically for Google' needs and they made two assumption while designing GFS:

# Most of the Data is written in the form of appends ( write at the end of a file)
# Data read from the files is read in a streaming sort of way ( read lot of data in the form of sequential access).

Because of this, they decided to lay emphasis on better performance for sequential access. These two assumption are also the reason because of which they chose to keep the chunk size so huge (64 MB). You can easily read large blocks if you get rid of sequential access.

* Very different design because of the workload that it is designed for:
** Because of the number of small files that have to be indexed for the web, it is no longer practical to have a file system that stores these individually. Too much overhead. Punts problem to userspace, incl. record delimitation.
* Don't care about latency, surprising considering it's Google, the guys who change the TCP IW standard recommendations for latency.
* Mostly seeking (sequentially) through entire file.
* Paper from 2003, mentions still using 100BASE-T links.
* Data-heavy, metadata light. Contacting the metadata server is a rare event.
* Consider hardware failures as normal operating conditions:
** uses commodity hardware
** All the replication (!)
** Data checksumming
* Performance degrades for small random access workload; use other filesystem.
* Path of least resistance to scale, not to do something super CS-smart.
* Google used to re-index every month, swapping out indexes. Now, it's much more online. GFS is now just a layer to support a more dynamic layer.
* The paper seems to lack any mention of security. This FS probably could only
exist on a trusted network.

== How other filesystems compare to GFS and Ceph ==

* Other File Systems: AFS, NFS, Plan 9, traditional Unix

* Data and metadata are held together.
** They did not optimize for different access patterns:
*** Data → big, long transfers
*** Metadata → small, low latency
** Can't scale separately

* Designed for lower latency

* (Mostly) designed for POSIX semantics
** how the requirements that lead to the ‘standard’ evolved

* Assumed that a file is a fraction of the size of a server
** eg. files on a Unix system were meant to be text files.
** Huge files spread over many servers not even in the cards for NFS
** Meant for small problems, not web-scale
*** Google has a copy of the publicly accessible internet
**** Their strategy is to copy the internet to index it
**** Insane → insane filesystem
**** One file may span multiple servers

* Even mainframes, scale-up solutions, ultra-reliable systems, with data sets bigger than RAM don't have the scale of GFS or CEPH.

* Point-to-point access; much less load-balancing, even in AFS
** One server to service multiple clients.
** Single point of entry, single point of failure, bottleneck

* Less focus on fault tolerance
** No notion of data replication.

* Reliability was a property of the host, not the network

==Ceph==

* Ceph is crazy and tries to do everything
* GFS was very specifically designed to work in a limited scenario, under certain specific conditions, whereas CEPH is sort of generic solution- for how to build a scalable distributed file system
* Unlike GFS, distributes metadata, not just for read-only copies
* Unlike GFS, the OSDs have some intelligence, and autonomously distribute the data, rather than being controlled by a master.
** Uses hashing in the distribution process to '''uniformly''' distribute data
** The actual algorithm for distributing data is as follows:
*** file + offset → hash(object ID) → CRUSH(placement group) → OSD
** Each client has knowledge of the entire storage network
** Tracks failure groups (same breaker, switch, etc.), hot data, etc.
** Number of replicas is changeable on the fly, but the placement group is not
*** For example, if every client on the planet is accessing the same file, you can scale out for that data.
** You don't ask where to go, you just go, which makes this very scalable

Any distributed file system that aims to be scalable, need to cut down the number of messages floating around, instead of the actual data transfer, which is what Ceph aims to do with the CRUSH function. basically Client or OSD just need to be aware of this CRUSH algorithm(function) and they can find the location of a file on their own (instead of asking a master server about it), so basically it eliminates the traditional File allocation list approach.

* CRUSH is sufficiently advanced to be called magic.
** O(log n) of the size of the data
** CPUs stupidly fast, so the above is of minimal overhead
*** the network, despite being fast, has latency, etc.
*** Computation scales much better than communication.

* Storage is composed of variable-length atoms

= Class Discussion =

== File Size ==
In Anil’s opinion “how file system size compares to the server storage size?” is a key parameter that distinguishes GFS, NFS designs from the early file systems NFS, AFS, Plan 9. In the early files system designs, file system size was a fraction of the server storage size where as in GFS and Ceph the file system size can be of several times magnitude than that of the server.

== Segue on drives and sequential access following GFS section ==

* Structure of GFS does match some other modern systems:
** Hard drives are like parallel tapes, very suited for streaming.
** Flash devices are log-structured too, but have an abstracting firmware.
*** They do erasure in bulk, in the '''background'''.
*** Used to be we needed specialized FS for [http://en.wikipedia.org/wiki/Memory_Technology_Device MTDs] to get better performance; though now we have better micro-controllers in some embedded systems to abstract away the hardware.
* Architectures that start big, often end up in the smallest things.

== Lookups vs hashing ==
One key aspect in the Ceph design is the attempt to replace communication with computation by using hashing based mechanism CRUSH. Following line from Anil epitomizes the general approach that is followed in the field of Computer Science “If one abstraction does not work stick another one in”.

DistOS 2014W Lecture 3

2014-02-21T21:56:58Z

Cdelahou: added link to paper and title

==The Early Internet (Jan. 14)==

* [https://homeostasis.scs.carleton.ca/~soma/distos/2014w/kahn1972-resource.pdf Robert E. Kahn, "Resource-Sharing Computer Communications Networks" (1972)] [http://dx.doi.org/10.1109/PROC.1972.8911 (DOI)]
* [https://archive.org/details/ComputerNetworks_TheHeraldsOfResourceSharing Computer Networks: The Heralds of Resource Sharing (1972)] - video

== Questions to consider: ==
* What were the purposes envisioned for computer networks? How do those compare with the uses they are put to today?
* What sort of resources were shared? What resources are shared today?
* What network architecture did they envision? Do we still have the same architecture?
* What surprised you about this paper?
* What was unclear?

==Group 1==
* video was mostly a summary of Kahn's paper
* process migration through different zones of air traffic control
* "distributed OS" meant something different than we normally think about, because many people would log in remotely to a single machine, it is very much like cloud infrastructure that we talk about today
* alto paper makes reference to Kahn's paper, and the alto designers had the foresight to see that networks like arpanet would be necessary
* would it be useful to have a co-processor responsible for maintaining shared resources even today? Like the IMPs of the arpanet? Today, computers are usually so fast it doesn't really matter.

=== Questions ===

* What were the purposes envisioned for computer networks?
** big computation, storage, resource sharing - "having a library on a hard disk"

* How do those compare with the uses they are put to today?
** those things are being done, but mostly communication like instant messaging, email

* What sort of resources were shared?
** databases, CPU time

* What resources are shared today?
** mostly storage

* What network architecture did they envision?
** they had a checksum and acknowledge on each packet
** the IMPs were the network interface and the routers
** packet-switching

* Do we still have the same architecture?
** packet-switching definitely won
** no, now IP doesn't checksum or acknowledge, but TCP has end-to-end checksum and acknowledge
** Kahn went on to learn from the errors of arpanet to design TCP/IP
** the job of network interface and router have been decoupled

* What surprised you about this paper?
** everything
** how they were able to do this
** a network interface card and router was the size of a fridge
** high-level languages
** bootstrap protocol, bootstrapping an application
** primitive computers
** desktop publishing
** the logistics of running a cable from one university to another
** how old the idea of distributed operating systems is

* What was unclear?
** much of the more technical specifications, but we mostly skipped over those

==Group 2==
1. The main purpose of early networks was resource sharing. Abstraction for transmission. Message reliability was a by-product. The underlying idea is the same.

2. Specialized Hardware/software and information sharing. super set of sharing.

3. AD-HOC routing, it was TCP without saying it. Largely unchanged today.

==Group 3==
===Envisioned computer network purposes===
* Improving reliability of services, due to redundant resource sets
* Resource sharing
* Usage modes:t
** Users can use a remote terminal, from a remote office or home, to access those resources.
** Would allow centralization of resources, to improve ease of management and do away with inefficiencies
* Allow specialization of various sites. rather than each site trying to do it all
* Distributed simulations (notably air traffic control)

Information-sharing is still relevant today, especially in research and large simulations. Remote access has mostly devolved into a specialized need.

===Resources shared===
* Computing resources (especially expensive mainframes)
* Data sets

===Network architecture===
* A primitive layered architecture
* Dedicated routing functions
* Various topologies:
** star
** loop
** bus
* Primarily (packet|mesage)-switched
** Circuit-switching too expensive and has large setup times
** Doesn't require committing resources
* Primitive flow control and buffering
* Predates proper congestion control such as Van Jacobsen's slow start
* Ad-hoc routing or based on something similar to RIP
* Anticipation of elephants and mice latency issues
* Unlike modern internet, error control and retransmission at every step

The architecture today is similar, but the link-layer is very different: use of Ethernet and ATM. The modern internet is a collection of autonomous systems, not a single network. Routing propogation is now large-scale, and semi-automated (e.g., BGP externally, IS-IS and OSPF internally)

===Surprising aspects===

===Unclear portions===
* Weird packet format: Page 1400 (4 of PDF): “Node 6, discovering the message is for itself,
replaces the destination address by the source address

==Group 4==

* What were the purposes envisioned for computer networks? How do those compare with the uses they are put to today?

Networks were envisioned as providing remote access to other computers, because useful resources such as computing power, large databases, and non-portable software were local to a particular computer, not themselves shared over the network.

Today, we use networks mostly for sharing data, although with services like Amazon AWS, we're starting to share computing resources again. We're also moving to support collaboration (e.g. Google Docs, GitHub, etc.).

* What sort of resources were shared? What resources are shared today?

Computing power was the key resource being shared; today, it's access to data. (See above.)

* What network architecture did they envision? Do we still have the same architecture?

Surprisingly, yes: modern networks have substantially similar architecures to the ones described in these papers.
Packet-switched networks are now ubiquitous. We no longer bother with circuit-switching even for telephony, in contrast to the assumption that non-network data would continue to use the circuit-switched common-carrier network.

* What surprised you about this paper?

We were surprised by the accuracy of the predictions given how early the paper was written — even things like electronic banking. Also surprising were technological advances since the paper was written, such as data transfer speeds (we have networks that are faster than the integrated bus in the Alto), and the predicted resolution requirements (which we are nowhere near meeting). The amount of detail in the description of the 'mouse pointing device' was interesting too.

* What was unclear?

Nothing significant; we're looking at these with the benefit of hindsight.

==Summary of the discussion from lecture==
Anil's view is that even these days we can imagine Computer Networks as more of a resource sharing platform. For example when we access the web or search Google we are making use of the resource sharing facilitated by the Internet(Network of interconnected Computer Networks). It's not possible to put 20,000 computers in our basements’, instead the Internet facilitates access to computing power/databases which are built of hundred thousands of computers. In fact Google and other popular search engines has a local copy of the entire web in their data centers, centralized copy of a large distributed system. Kind of a contradictory phenomenon if you think about in terms of the design goals of the distributed system.

Another important takeaway from the discussion was the point that "Early to market/ first player" with a new product/solution to a niche problem and the one which offer solutions based on simple mechanisms as opposed to one relying on complex mechanism gets adopted faster. Classic example is the Internet. ARPANET which was supposed to be an academic research project which was based on simple mechanisms, open and first of its kind got adopted widely and evolved in to the Internet as we see it today. It is to note that this approach is not without its own drawbacks example being the security aspects were not factored in while designing the ARPANET since it was intended to be a network between trusted parties, which was fine then. But when ARPANET evolved in to the Internet, security aspect was one area which required a major focus on. In Silicon Valley the focus is on being the "first player" in a niche market to meet that objective often simple framework/mechanisms are used. In doing so there is also a possibility of leaving out some components which can turn out to be a vital missing link, recent example being security flaw in 'snapchat' that lead to user data being exposed.

DistOS 2014W Lecture 4

2014-02-21T21:54:55Z

Cdelahou: added link to paper and title

==The Alto (Jan. 16)==

* [https://homeostasis.scs.carleton.ca/~soma/distos/2014w/alto.pdf Thacker et al., "Alto: A Personal computer" (1979)] ([https://archive.org/details/bitsavers_xeroxparcttoAPersonalComputer_6560658 archive.org])

Discussions on the Alto

==CPU, Memory, Disk==

====CPU====

The general hardware architecture of the CPU was biased towards the user, meaning that a greater focus was put on IO capabilities and less focus was put on computational power (arithmetic etc). There were two levels of task-switching; the CPU provided sixteen fixed-priority tasks with hardware interrupts, each of which was permanently assigned to a piece of hardware. Only one of these tasks (the lowest-priority) was dedicated to the user. This task actually ran a virtualized BCPL machine (a C-like language); the user had no access at all to the underlying microcode. Other languages could be emulated as well.

====Memory====

The Alto started with 64K of 16-bit words of memory and eventually grew to 256K words. However, the higher memory was not accessible except through special tricks, similar to the way that memory above 4GB is not accessible today on 32-bit systems without special tricks.

====Task Switching====

One thing that was confusing was that they refer to tasks both as the 16 fixed hardware tasks and the many software tasks that could be multiplexed onto the lowest-priority of those hardware tasks. In either case, task switching was cooperative; until a task gave up control by running a specific instruction, no other task could run. From a modern perspective this looks like a major security problem, since malicious software could simply never relinquish the CPU. However, the fact that hardware was first-class in this sense (with full access to the CPU and memory) made the hardware simpler because much of the complexity could be done in software. Perhaps the first hints of what we now think of as drivers?

====Disk and Filesystem====

To make use of disk controller read,write,truncate,delete and etc. commands were made available.To reduce the risk of global damage structural information was saved to label in each page.hints mechanism was also a available using directory get where file resides in disk.file integrity a was check using seal bit and label.

==Ethernet, Networking protocols==
Although the original motive of Alto as a personal computer was to serve the needs of a single user, it was figured out that communicating with other Alto’s/computers would facilitate resource sharing – for collaboration and economic reasons. The main design objectives for the computer network connecting personal computers (Altos) were:

Data transmission speed: Bandwidth which should at least match the memory bus speed to give the end user a consistent notion that the resources accessed over the network should also have the same latency as compared to resource accessed within the computer

Size of network: Capability to connect large number of nodes together

Reliability: Once the user starts to use resources/service over a network it is vital to ensure that the network is reliable enough so that the user gets the quality of service required.

Alto uses a general packet transport system which can be thought of as a set of standard communication protocols towards facilitating interoperability.

The key element enabling the communication system between Alto and other computers was the Ethernet, a Layer 2 protocol and mechanism developed in-house at Xerox by Robert Metcalf et al. Following are the characteristics of the Ethernet – Broadcast, packet-switched network with bandwidth – 3Mbits/sec which can connect 256 computers together, and allows up have a distance of 1 Km between two connected nodes. Another important aspect of Ethernet was new nodes/computers could be added/removed/powered-on/powered-off from the network without disturbing the already existing network communications. Since Ethernet offered only best effort service without guarantees for an error free service, towards achieving reliable communication over it a hierarchy of layered communication protocols were implemented in Alto.

Alto had the capability to act as a gateway connecting different networks together. Xerox had a “Xerox Internet” consisting of several hundred computers, 25 networks and 20 gateways providing internet service back in 1979.

Ethernet communications system had two components – Ethernet controller and transceiver. Ethernet controller performed the encoding/decoding, buffering and micromachine interfacing functionalities whereas the transceiver deals with the transmission/reception of bits, which operated in half-duplex mode.

One important different with respect to the design of the Ethernet controller task as opposed to the ones for display and disk were that there were no periodic events to wake this task up instead a S-group instruction was used to set a flip flop in Ethernet hardware which was used to wake up the Ethernet controller task. Also the Ethernet used interrupt based mechanism used to indicate completion since the packet reception/transmission happens asynchronously. Ethernet microcode implements a packet filtering mechanism which checks for the reception of (1) destined for the host (2) broadcast packets. It can also operate in a promiscuous mode with host address set to zero receiving all packets, which can be used for debugging purposes.

Ethernet had no security mechanism built into it. Since Ethernet was a collision domain an exponential backoff algorithm was implemented towards avoiding collisions (which occurs when two Ethernet transmitters tries to use the ether at the same time).

==Graphics, Mouse, Printing==

===Graphics===

A lot of time was spent on what paper and ink provides us in a display sense, constantly referencing an 8.5 by 11 piece of paper as the type of display they were striving for. This showed what they were attempting to emulate in the Alto's display. The authors proposed 500 - 1000 black or white bits per inch of display (i.e. 500 - 1000 dpi). However, they were unable to pursue this goal, instead settling for 70 dpi for the display, allowing them to show things such as 10 pt text. They state that a 30 Hz refresh rate was found to not be objectionable. Interestingly, however, we would find this objectionable today--most likely from being spoiled with the sheer speed of computers today, whereas the authors were used to slower performance. The Alto's display took up '''half''' the Alto's memory, a choice we found very interesting.

Another interesting point was that the authors state that they thought it was beneficial that they could access display memory directly rather than using conventional frame buffer organizations. While we are unsure of what they meant by traditional frame buffer organizations, it is interesting to note that frame buffer organizations is what we use today for our displays.

===Mouse===

The mouse outlined in the paper was 200 dpi (vs. a standard mouse from Apple which is 1300 dpi) and had three buttons (one of the standard configurations of mice that are produced today). They were already using different mouse cursors (i.e., the pointer image of the cursor on screen). The real interesting point here is that the design outlined in the paper was so similar to designs we still use today. The only real divergence was the use of optical mice, although the introduction of optical mice did not altogether halt the use of non-optical mice. Today, we just have more flexibility with regards to how we design mice (e.g., having a scroll wheel, more buttons, etc.).

===Printer===

They state that the printer should print, in one second, an 8.5 by 11 inch page defined with 350 dots/inch (roughly 4000 horizontal scan lines of 3000 dots each). Ironically enough, this is not even what they had wanted for the actual Alto display. However, they did not have enough memory to do this and had to work around this by using things such as an incremental algorithm and reducing the number of scan lines. We were disappointed that they did not actually discuss the hardware implementation of the printer, only the software controller. However, it is interesting that the fact they are dividing the memory requirements of the printer between the hardware itself and the computer was quite a modern idea at the time, and still is.

===Other Interesting Notes===

We found it interesting that peripheral devices were included at all.

The author makes a passing mention to having a tablet to draw on. However, he stated that no one really liked having the tablet as it got in the way of the keyboard.

The recurring theme of lack of memory to implement what they had originally envisioned.

==Applications, Programming Environment==

=== Emulation ===
A notable feature is that the Alto implemented a BCPL emulator in the PROM microstore. Other emulators were available, but they were loaded in RAM. BCPL was used as the main implementation language for the computer's applications. Very little assembly was used.

=== Programming Environments ===
The Alto ran the gamut of available programming environment. A conventional toolchain implemented in BCPL was offered (ie. compiler, linker, debugger, file manager, etc.) Interactive programming environments such as Smalltalk and Interlisp were also available. These fell pray to the Alto's limited main memory (64k) and suffered crippling performance issues.

The only standardized facilities for programming environments to use was the file system and communication protocols. All other hardware had to be accessed using custom methods.

=== Personal Applications ===
Applications made use of the display, mouse, and keyboard. They were mostly involved with document production. For example, there was a text editor where the user could specify formatting and typefaces. The PC also helped facilitate and automate aspects of logic board design and assembling.

=== Communication in applications ===
Most applications were designed with the assumption that the computer would exploit networked resources. For example, printing services would be handled by a printing server, file storage could be local or distributed. The Alto made use of existing services too. Its clock was set by a 'time of day' service and it could be bootstrapped over ethernet.

Communication in applications was also used in new and novel ways. For example, the debugger mentioned above was network aware. It could help programmers debug software remotely.

DistOS 2014W Lecture 5

2014-02-21T21:53:47Z

Cdelahou: /* The Mother of all Demos (Jan. 21) */

DistOS 2014W Lecture 5

2014-02-21T21:53:34Z

Cdelahou: added link to paper and title

=The Mother of all Demos (Jan. 21)=

If you can, watch the whole demo. The Stanford version with annotated clips is good if you are short for time.

* [http://www.dougengelbart.org/firsts/dougs-1968-demo.html Doug Engelbart Institute, "Doug's 1968 Demo"]
* [http://en.wikipedia.org/wiki/The_Mother_of_All_Demos Wikipedia's page on "The Mother of all Demos"]

= Introduction =

Anil set the theme of the discussion for the week as - to try and understand what the early visionaries/researchers wanted the computer to be and what it has become. Putting in other words what was considered fundamental those days and where those stands today. It is to be noted that features that were easier to implement using simple mechanisms were carried forward where as the ones which demanded more complex systems or the one which were found out to add not much value in the near feature were pegged down in the order. In the same context following observations were made: (1) truly distributed computational infrastructure really makes sense only when we have something to distribute (2) use cases drive the large distributed systems, a good example is The Web. Another key observation from Anil was that there was always a Utopian aspect to the early systems be it NLS, ARPANET or Alto. One good example is that security aspects were never considered essential in those systems assuming them to operate in a trusted environment.

; Operating system
: The software that turns the computer you have into the one you want (Anil)

* What sort of computer did we want to have?
* What sort of abstractions did they want to be easy? Hard?
* What could we build with the internet (not just WAN, but also LAN)?
* Most dreams people had of their computers smacked into the wall of reality.

= MOAD review in groups =

* Chorded keyboard unfortunately obscure, partly because the attendees disagreed with the long-term investment of training the user.
* View control → hyperlinking system, but in a lightweight (more like nanoweight) markup language.
* Ad-hoc ticketing system
* Ad-hoc messaging system
** Used on a time-sharing systme with shared storage,
* Primitive revision control system
* Different vocabulary:
** Bug and bug smear (mouse and trail)
** Point rather than click

= Class review =

* Doug died Jul 2 2013
* Doug himself called it an “online system”, rather than offline composition of code using card punchers as was common in the day.
* What became of the tech:
** Chorded keyboards:
*** Exist but obscure
** Pre-ARPANET network:
*** Time-sharing mainframe
*** 13 workstations
*** Telephone and television circuit
** Mouse
*** “I sometimes apologize for calling it a mouse”
** Collaborative document editing integrated with screen sharing
** Videoconferencing
*** Part of the vision, but more for the demo at the time,
** Hyperlinks
*** The web on a mainframe
** Languages
*** Metalanguages
**** “Part and parcel of their entire vision of augmenting human intelligence.”
**** You must teach the computer about the language you are using.
**** They were the use case. It was almost designed more for augmenting programmer intelligence rather than human intelligence.
*** It was normal for the time to build new languages (domain-specific) for new systems. Nowadays, we standardize on one but develop large APIs, at the expense of conciseness. We look for short-term benefits; we minimize programmer effort.
*** Compiler compiler
** Freeze-pane
** Folding—Zoomable UI (ZUI)
*** Lots of systems do it, but not the default
*** Much easier to just present everything.
** Technologies the required further investment got left behind.
* The NLS had little to no security
** There was a minimal notion of a user
** There was a utopian aspect. Meanwhile, the Mac had no utopian aspect. Data exchange was through floppies. Any network was small, local, ad-hoc, and among trusted peers.
** The system wasn't envisioned to scale up to masses of people who didn't trust each other.
** How do you enforce secrecy.
* Part of the reason for lack of adoption of some of the tech was hardware. We can posit that a bigger reason would be infrastructure.
* Differentiate usability of system from usability of vision
** What was missing was the polish, the ‘sexiness’, and the intuitiveness of later systems like the Apple II and the Lisa.
** The usability of the later Alto is still less than commercial systems.
*** The word processor was modal, which is apt to confuse unmotivated and untrained users.
* In the context of the Mother of All Demos, the Alto doesn't seem entirely revolutionary. Xerox PARC raided his team. They almost had a GUI; rather they had what we call today a virtual console, with a few things above.
* What happens with visionaries that present a big vision is that the spectators latch onto specific aspects.
* To be comfortable with not adopting the vision, one must ostracize the visionary. People pay attention to things that fit into their world view.
* Use cases of networking have changed little, though the means did
* Fundamentally a resource-sharing system; everything is shared, unlike later systems where you would need to explicitly do so. Resources shared fundamentally sense to share: documents, printers, etc.
* Resource sharing was never enough. '''Information-sharing''' was the focus.

“Mother of all demos” is nickname for Engelbart who could make the computers help humans become smarter.

*More interesting in this works that:
"His idea included seeing computing devices as a means to communicate and retrieve information, rather than just crunch numbers. This idea is represented in NLS”On-Line system”.

*Some information about NLS system:
1) NLS was a revolutionary computer collaboration system from the 1960s.
2) Designed by Douglas Engelbart and implemented by researchers at the Augmentation Research Center (ARC) at the Stanford Research Institute (SRI).
3) The NLS system was the first to employ the practical use of :
a) hypertext links,
b) the mouse,
c) raster-scan video monitors,
d) information organized by relevance,
e) screen windowing,
f) presentation programs,
g) and other modern computing concepts.

= Alto review =

* Fundamentally a personal computer
* Applications:
** Drawing program with curves and arcs for drawing
** Hardware design tools (mostly logic boards)
** Time server
* Less designed for reading than the NLS. More designed around paper. Xerox had a laser printer, and you would read what you printed. Hypertext was deprioritized, unlike the NLS vision had focused on what could not be expressed on paper.
* Xerox had almost an obsession with making documents print beautifully.

DistOS 2014W Lecture 6

2014-02-21T21:52:15Z

Cdelahou: /* The Early Web (Jan. 23) */

'''the point form notes for this lecture could be turned into full sentences/paragraphs'''

==The Early Web (Jan. 23)==

* [https://archive.org/details/02Kahle000673 Berners-Lee et al., "World-Wide Web: The Information Universe" (1992)], pp. 52-58
* [http://www.youtube.com/watch?v=72nfrhXroo8 Alex Wright, "The Web That Wasn't" (2007)], Google Tech Talk

== Group Discussion on "The Early Web" ==

Questions to discuss:

# How do you think the web would have been if not like the present way?
# What kind of infrastructure changes would you like to make?

=== Group 1 ===
: Relatively satisfied with the present structure of the web some changes suggested are in the below areas:
* Make use of the greater potential of Protocols
* More communication and interaction capabilities.
* Implementation changes in the present payment method systems. Example usage of "Micro-computation" - a discussion we would get back to in future classes. Also, Cryptographic currencies.
* Augmented reality.
* More towards individual privacy.

=== Group 2 ===
==== Problem of unstructured information ====
A large portion of the web serves content that is overwhelmingly concerned about presentation rather than structuring content. Tim Berner-Lees himself bemoaned the death of the semantic web. His original vision of it was as follows:


<blockquote>I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A "Semantic Web", which makes this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The "intelligent agents" people have touted for ages will finally materialize.<ref>{{cite book |last=Berners-Lee |first=Tim |authorlink=Tim Berners-Lee |coauthors=Fischetti, Mark |title=Weaving the Web |publisher=HarperSanFrancisco |year=1999 |pages=chapter 12 |isbn=978-0-06-251587-2 |nopp=true }}</ref></blockquote>

For this vision to be true, information arguably needs to be structured, maybe even classified. The idea of a universal information classification system has been floated. The modern web is mostly developed by software developers and similar, not librarians and the like.



Also, how does one differentiate satire from fact?

==== Valuation and deduplication of information ====
Another problem common with the current wwww is the duplication of information. Redundancy is not in itself harmful to increase the availability of information, but is ad-hoc duplication of the information itself?

One then comes to the problem of assigning a value to the information found therein. How does one rate information, and according to what criteria? How does one authenticate the information? Often, popularity is used as an indicator of veracity, almost in a sophistic manner. See excessive reliance on Google page ranking or Reddit score for various types of information consumption for research or news consumption respectively.

=== On the current infrastructure ===
The current internet infrastructure should remain as is, at least in countries with not just a modicum of freedom of access to information. Centralization of of control of access to information is a terrible power. See China and parts of the Middle-East. On that note, what can be said of popular sites, such as Google or Wikipedia that serve as the main entry point for many access patterns?

The problem, if any, in the current web infrastructure is of the web itself, not the internet.

=== Group 3 ===
* What we want to keep
** Linking mechanisms
** Minimum permissions to publish
* What we don't like
** Relying on one source for document
** Privacy links for security
* Proposal
** Peer-peer to distributed mechanisms for documenting
** Reverse links with caching - distributed cache
** More availability for user - what happens when system fails?
** Key management to be considered - Is it good to have centralized or distributed mechanism?

=== Group 4 ===
* An idea of web searching for us
* A suggestion of a different web if it would have been implemented by "AI" people
** AI programs searching for data - A notion already being implemented by Google slowly.
* Generate report forums
* HTML equivalent is inspired by the AI communication
* Higher semantics apart from just indexing the data
** Problem : "How to bridge the semantic gap?"
** Search for more data patterns

== Group design exercise — The web that could be ==

* “The web that wasn't” mentioned the moans of librarians.
* A universal classification system is needed.
* The training overhead of classifiers (e.g., librarians) is high. See the master's that a librarian would need.
* More structured content, both classification, and organization
* Current indexing by crude brute-force searching for words, etc., rather than searching metadata
* Information doesn't have the same persistence, see bitrot and Vint Cerf's talk.
* Too concerned with presentation now.
* Tim Berner-Lees bemoaning the death of the semantic web.
* The problem of information duplication when information gets redistributed across the web. However, we do want redundancy.
* Too much developed by software developers
* Too reliant on Google for web structure
** See search-engine optimization
* Problem of authentication (of the information, not the presenter)
** Too dependent at times on the popularity of a site, almost in a sophistic manner.
** See Reddit
* How do you programmatically distinguish satire from fact
* The web's structure is also “shaped by inbound links but would be nice a bit more”
* Infrastructure doesn't need to change per se.
** The distributed architecture should still stay. Centralization of control of allowed information and access is terrible power. See China and the Middle-East.
** Information, for the most part, in itself, exists centrally (as per-page), though communities (to use a generic term) are distributed.
* Need more sophisticated natural language processing.

== Class discussion ==

Focusing on vision, not the mechanism.

* Reverse linking
* Distributed content distribution (glorified cache)
** Both for privacy and redunancy reasons
** Suggested centralized content certification, but doesn't address the problem of root of trust and distributed consistency checking.
*** Distributed key management is a holy grail
*** What about detecting large-scale subversion attempts, like in China
* What is the new revenue model?
** What was TBL's revenue model (tongue-in-cheek, none)?
** Organisations like Google monetized the internet, and this mechanism could destroy their ability to do so.
* Search work is semi-distributed. Suggested letting the web do the work for you.
* Trying to structure content in a manner simultaneously palatable to both humans and machines.
* Using spare CPU time on servers for natural language processing (or other AI) of cached or locally available resources.
* Imagine a smushed Wolfram Alpha, Google, Wikipedia, and Watson, and then distributed over the net.
* The document was TBL's idea of the atom of content, whereas nowaday we really need something more granular.
* We want to extract higher-level semantics.
* Google may not be pure keyword search anymore. It is essentially AI now, but we still struggle with expressing what we want to Google.
* What about the adversarial aspect of content hosters, vying for attention?
* People do actively try to fool you.
* Compare to Google News, though that is very specific to that domain. Their vision is a semantic web, but they are incrementally building it.
* In a scary fashion, Google is one of the central points of failure of the web. Even scarier is less technically competent people who depend on Facebook for that.
* There is a semantic gap between how we express and query information, and how AI understands it.
* Can think of Facebook as a distributed human search infrastructure.
* A core service of an operating system is locating information. '''Search is infrastructure.'''
* The problem is not purely technical. There are political and social aspects.
** Searching for a file on a local filesystem should have a unambiguous answer.
** Asking the web is a different thing. “What is the best chocolate bar?”
* Is the web a network database, as understood in COMP 3005, which we consider harmful.
* For two-way links, there is the problem of restructuring data and all the dependencies.
* Privacy issues when tracing paths across the web.
* What about the problem of information revocation?
* Need more augmented reality and distributed and micro payment systems.
* We need distributed, mutually untrusting social networks.
** Now we have the problem of storage and computation, but also take away some of of the monetizationable aspect.
* Distribution is not free. It is very expensive in very funny ways.
* The dream of harvesting all the computational power of the internet is not new.
** Startups have come and gone many times over that problem.
* Google's indexers understands quite well many documents on the web. However, it only '''presents''' a primitive keyword-like interface. It doesn't expose the ontology.
* Organising information does not necessarily mean applying an ontology to it.
* The organisational methods we now use don't use ontologies, but rather are supplemented by them.

Adding couple of related points Anil mentioned during the discussion:
Distributed key management is a holy grail no one has ever managed to get it working. Now a days databases have become important building blocks of the Distributed Operating System. Anil stressed the fact that Databases can in fact be considered as an OS service these days. The question “How you navigate the complex information space?” has remained a prominent question that The Web have always faced.

DistOS 2014W Lecture 6

2014-02-21T21:51:51Z

Cdelahou: Replaced previous change. added link to paper and title.

'''the point form notes for this lecture could be turned into full sentences/paragraphs'''

===The Early Web (Jan. 23)===

* [https://archive.org/details/02Kahle000673 Berners-Lee et al., "World-Wide Web: The Information Universe" (1992)], pp. 52-58
* [http://www.youtube.com/watch?v=72nfrhXroo8 Alex Wright, "The Web That Wasn't" (2007)], Google Tech Talk

== Group Discussion on "The Early Web" ==

Questions to discuss:

# How do you think the web would have been if not like the present way?
# What kind of infrastructure changes would you like to make?

=== Group 1 ===
: Relatively satisfied with the present structure of the web some changes suggested are in the below areas:
* Make use of the greater potential of Protocols
* More communication and interaction capabilities.
* Implementation changes in the present payment method systems. Example usage of "Micro-computation" - a discussion we would get back to in future classes. Also, Cryptographic currencies.
* Augmented reality.
* More towards individual privacy.

=== Group 2 ===
==== Problem of unstructured information ====
A large portion of the web serves content that is overwhelmingly concerned about presentation rather than structuring content. Tim Berner-Lees himself bemoaned the death of the semantic web. His original vision of it was as follows:


<blockquote>I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A "Semantic Web", which makes this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The "intelligent agents" people have touted for ages will finally materialize.<ref>{{cite book |last=Berners-Lee |first=Tim |authorlink=Tim Berners-Lee |coauthors=Fischetti, Mark |title=Weaving the Web |publisher=HarperSanFrancisco |year=1999 |pages=chapter 12 |isbn=978-0-06-251587-2 |nopp=true }}</ref></blockquote>

For this vision to be true, information arguably needs to be structured, maybe even classified. The idea of a universal information classification system has been floated. The modern web is mostly developed by software developers and similar, not librarians and the like.



Also, how does one differentiate satire from fact?

==== Valuation and deduplication of information ====
Another problem common with the current wwww is the duplication of information. Redundancy is not in itself harmful to increase the availability of information, but is ad-hoc duplication of the information itself?

One then comes to the problem of assigning a value to the information found therein. How does one rate information, and according to what criteria? How does one authenticate the information? Often, popularity is used as an indicator of veracity, almost in a sophistic manner. See excessive reliance on Google page ranking or Reddit score for various types of information consumption for research or news consumption respectively.

=== On the current infrastructure ===
The current internet infrastructure should remain as is, at least in countries with not just a modicum of freedom of access to information. Centralization of of control of access to information is a terrible power. See China and parts of the Middle-East. On that note, what can be said of popular sites, such as Google or Wikipedia that serve as the main entry point for many access patterns?

The problem, if any, in the current web infrastructure is of the web itself, not the internet.

=== Group 3 ===
* What we want to keep
** Linking mechanisms
** Minimum permissions to publish
* What we don't like
** Relying on one source for document
** Privacy links for security
* Proposal
** Peer-peer to distributed mechanisms for documenting
** Reverse links with caching - distributed cache
** More availability for user - what happens when system fails?
** Key management to be considered - Is it good to have centralized or distributed mechanism?

=== Group 4 ===
* An idea of web searching for us
* A suggestion of a different web if it would have been implemented by "AI" people
** AI programs searching for data - A notion already being implemented by Google slowly.
* Generate report forums
* HTML equivalent is inspired by the AI communication
* Higher semantics apart from just indexing the data
** Problem : "How to bridge the semantic gap?"
** Search for more data patterns

== Group design exercise — The web that could be ==

* “The web that wasn't” mentioned the moans of librarians.
* A universal classification system is needed.
* The training overhead of classifiers (e.g., librarians) is high. See the master's that a librarian would need.
* More structured content, both classification, and organization
* Current indexing by crude brute-force searching for words, etc., rather than searching metadata
* Information doesn't have the same persistence, see bitrot and Vint Cerf's talk.
* Too concerned with presentation now.
* Tim Berner-Lees bemoaning the death of the semantic web.
* The problem of information duplication when information gets redistributed across the web. However, we do want redundancy.
* Too much developed by software developers
* Too reliant on Google for web structure
** See search-engine optimization
* Problem of authentication (of the information, not the presenter)
** Too dependent at times on the popularity of a site, almost in a sophistic manner.
** See Reddit
* How do you programmatically distinguish satire from fact
* The web's structure is also “shaped by inbound links but would be nice a bit more”
* Infrastructure doesn't need to change per se.
** The distributed architecture should still stay. Centralization of control of allowed information and access is terrible power. See China and the Middle-East.
** Information, for the most part, in itself, exists centrally (as per-page), though communities (to use a generic term) are distributed.
* Need more sophisticated natural language processing.

== Class discussion ==

Focusing on vision, not the mechanism.

* Reverse linking
* Distributed content distribution (glorified cache)
** Both for privacy and redunancy reasons
** Suggested centralized content certification, but doesn't address the problem of root of trust and distributed consistency checking.
*** Distributed key management is a holy grail
*** What about detecting large-scale subversion attempts, like in China
* What is the new revenue model?
** What was TBL's revenue model (tongue-in-cheek, none)?
** Organisations like Google monetized the internet, and this mechanism could destroy their ability to do so.
* Search work is semi-distributed. Suggested letting the web do the work for you.
* Trying to structure content in a manner simultaneously palatable to both humans and machines.
* Using spare CPU time on servers for natural language processing (or other AI) of cached or locally available resources.
* Imagine a smushed Wolfram Alpha, Google, Wikipedia, and Watson, and then distributed over the net.
* The document was TBL's idea of the atom of content, whereas nowaday we really need something more granular.
* We want to extract higher-level semantics.
* Google may not be pure keyword search anymore. It is essentially AI now, but we still struggle with expressing what we want to Google.
* What about the adversarial aspect of content hosters, vying for attention?
* People do actively try to fool you.
* Compare to Google News, though that is very specific to that domain. Their vision is a semantic web, but they are incrementally building it.
* In a scary fashion, Google is one of the central points of failure of the web. Even scarier is less technically competent people who depend on Facebook for that.
* There is a semantic gap between how we express and query information, and how AI understands it.
* Can think of Facebook as a distributed human search infrastructure.
* A core service of an operating system is locating information. '''Search is infrastructure.'''
* The problem is not purely technical. There are political and social aspects.
** Searching for a file on a local filesystem should have a unambiguous answer.
** Asking the web is a different thing. “What is the best chocolate bar?”
* Is the web a network database, as understood in COMP 3005, which we consider harmful.
* For two-way links, there is the problem of restructuring data and all the dependencies.
* Privacy issues when tracing paths across the web.
* What about the problem of information revocation?
* Need more augmented reality and distributed and micro payment systems.
* We need distributed, mutually untrusting social networks.
** Now we have the problem of storage and computation, but also take away some of of the monetizationable aspect.
* Distribution is not free. It is very expensive in very funny ways.
* The dream of harvesting all the computational power of the internet is not new.
** Startups have come and gone many times over that problem.
* Google's indexers understands quite well many documents on the web. However, it only '''presents''' a primitive keyword-like interface. It doesn't expose the ontology.
* Organising information does not necessarily mean applying an ontology to it.
* The organisational methods we now use don't use ontologies, but rather are supplemented by them.

Adding couple of related points Anil mentioned during the discussion:
Distributed key management is a holy grail no one has ever managed to get it working. Now a days databases have become important building blocks of the Distributed Operating System. Anil stressed the fact that Databases can in fact be considered as an OS service these days. The question “How you navigate the complex information space?” has remained a prominent question that The Web have always faced.

DistOS 2014W Lecture 6

2014-02-21T21:51:04Z

Cdelahou: added link to paper and title

'''the point form notes for this lecture could be turned into full sentences/paragraphs'''

==UNIX and Plan 9 (Jan. 28)==

* [http://homeostasis.scs.carleton.ca/~soma/distos/fall2008/unix.pdf Dennis M. Ritchie and Ken Thompson, "The UNIX Time-Sharing System" (1974)]
* [http://homeostasis.scs.carleton.ca/~soma/distos/2014w/presotto-plan9.pdf Presotto et. al, Plan 9, A Distributed System (1991)]
* [http://homeostasis.scs.carleton.ca/~soma/distos/2014w/pike-plan9.pdf Pike et al., Plan 9 from Bell Labs (1995)]

== Group Discussion on "The Early Web" ==

Questions to discuss:

# How do you think the web would have been if not like the present way?
# What kind of infrastructure changes would you like to make?

=== Group 1 ===
: Relatively satisfied with the present structure of the web some changes suggested are in the below areas:
* Make use of the greater potential of Protocols
* More communication and interaction capabilities.
* Implementation changes in the present payment method systems. Example usage of "Micro-computation" - a discussion we would get back to in future classes. Also, Cryptographic currencies.
* Augmented reality.
* More towards individual privacy.

=== Group 2 ===
==== Problem of unstructured information ====
A large portion of the web serves content that is overwhelmingly concerned about presentation rather than structuring content. Tim Berner-Lees himself bemoaned the death of the semantic web. His original vision of it was as follows:


<blockquote>I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A "Semantic Web", which makes this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The "intelligent agents" people have touted for ages will finally materialize.<ref>{{cite book |last=Berners-Lee |first=Tim |authorlink=Tim Berners-Lee |coauthors=Fischetti, Mark |title=Weaving the Web |publisher=HarperSanFrancisco |year=1999 |pages=chapter 12 |isbn=978-0-06-251587-2 |nopp=true }}</ref></blockquote>

For this vision to be true, information arguably needs to be structured, maybe even classified. The idea of a universal information classification system has been floated. The modern web is mostly developed by software developers and similar, not librarians and the like.



Also, how does one differentiate satire from fact?

==== Valuation and deduplication of information ====
Another problem common with the current wwww is the duplication of information. Redundancy is not in itself harmful to increase the availability of information, but is ad-hoc duplication of the information itself?

One then comes to the problem of assigning a value to the information found therein. How does one rate information, and according to what criteria? How does one authenticate the information? Often, popularity is used as an indicator of veracity, almost in a sophistic manner. See excessive reliance on Google page ranking or Reddit score for various types of information consumption for research or news consumption respectively.

=== On the current infrastructure ===
The current internet infrastructure should remain as is, at least in countries with not just a modicum of freedom of access to information. Centralization of of control of access to information is a terrible power. See China and parts of the Middle-East. On that note, what can be said of popular sites, such as Google or Wikipedia that serve as the main entry point for many access patterns?

The problem, if any, in the current web infrastructure is of the web itself, not the internet.

=== Group 3 ===
* What we want to keep
** Linking mechanisms
** Minimum permissions to publish
* What we don't like
** Relying on one source for document
** Privacy links for security
* Proposal
** Peer-peer to distributed mechanisms for documenting
** Reverse links with caching - distributed cache
** More availability for user - what happens when system fails?
** Key management to be considered - Is it good to have centralized or distributed mechanism?

=== Group 4 ===
* An idea of web searching for us
* A suggestion of a different web if it would have been implemented by "AI" people
** AI programs searching for data - A notion already being implemented by Google slowly.
* Generate report forums
* HTML equivalent is inspired by the AI communication
* Higher semantics apart from just indexing the data
** Problem : "How to bridge the semantic gap?"
** Search for more data patterns

== Group design exercise — The web that could be ==

* “The web that wasn't” mentioned the moans of librarians.
* A universal classification system is needed.
* The training overhead of classifiers (e.g., librarians) is high. See the master's that a librarian would need.
* More structured content, both classification, and organization
* Current indexing by crude brute-force searching for words, etc., rather than searching metadata
* Information doesn't have the same persistence, see bitrot and Vint Cerf's talk.
* Too concerned with presentation now.
* Tim Berner-Lees bemoaning the death of the semantic web.
* The problem of information duplication when information gets redistributed across the web. However, we do want redundancy.
* Too much developed by software developers
* Too reliant on Google for web structure
** See search-engine optimization
* Problem of authentication (of the information, not the presenter)
** Too dependent at times on the popularity of a site, almost in a sophistic manner.
** See Reddit
* How do you programmatically distinguish satire from fact
* The web's structure is also “shaped by inbound links but would be nice a bit more”
* Infrastructure doesn't need to change per se.
** The distributed architecture should still stay. Centralization of control of allowed information and access is terrible power. See China and the Middle-East.
** Information, for the most part, in itself, exists centrally (as per-page), though communities (to use a generic term) are distributed.
* Need more sophisticated natural language processing.

== Class discussion ==

Focusing on vision, not the mechanism.

* Reverse linking
* Distributed content distribution (glorified cache)
** Both for privacy and redunancy reasons
** Suggested centralized content certification, but doesn't address the problem of root of trust and distributed consistency checking.
*** Distributed key management is a holy grail
*** What about detecting large-scale subversion attempts, like in China
* What is the new revenue model?
** What was TBL's revenue model (tongue-in-cheek, none)?
** Organisations like Google monetized the internet, and this mechanism could destroy their ability to do so.
* Search work is semi-distributed. Suggested letting the web do the work for you.
* Trying to structure content in a manner simultaneously palatable to both humans and machines.
* Using spare CPU time on servers for natural language processing (or other AI) of cached or locally available resources.
* Imagine a smushed Wolfram Alpha, Google, Wikipedia, and Watson, and then distributed over the net.
* The document was TBL's idea of the atom of content, whereas nowaday we really need something more granular.
* We want to extract higher-level semantics.
* Google may not be pure keyword search anymore. It is essentially AI now, but we still struggle with expressing what we want to Google.
* What about the adversarial aspect of content hosters, vying for attention?
* People do actively try to fool you.
* Compare to Google News, though that is very specific to that domain. Their vision is a semantic web, but they are incrementally building it.
* In a scary fashion, Google is one of the central points of failure of the web. Even scarier is less technically competent people who depend on Facebook for that.
* There is a semantic gap between how we express and query information, and how AI understands it.
* Can think of Facebook as a distributed human search infrastructure.
* A core service of an operating system is locating information. '''Search is infrastructure.'''
* The problem is not purely technical. There are political and social aspects.
** Searching for a file on a local filesystem should have a unambiguous answer.
** Asking the web is a different thing. “What is the best chocolate bar?”
* Is the web a network database, as understood in COMP 3005, which we consider harmful.
* For two-way links, there is the problem of restructuring data and all the dependencies.
* Privacy issues when tracing paths across the web.
* What about the problem of information revocation?
* Need more augmented reality and distributed and micro payment systems.
* We need distributed, mutually untrusting social networks.
** Now we have the problem of storage and computation, but also take away some of of the monetizationable aspect.
* Distribution is not free. It is very expensive in very funny ways.
* The dream of harvesting all the computational power of the internet is not new.
** Startups have come and gone many times over that problem.
* Google's indexers understands quite well many documents on the web. However, it only '''presents''' a primitive keyword-like interface. It doesn't expose the ontology.
* Organising information does not necessarily mean applying an ontology to it.
* The organisational methods we now use don't use ontologies, but rather are supplemented by them.

Adding couple of related points Anil mentioned during the discussion:
Distributed key management is a holy grail no one has ever managed to get it working. Now a days databases have become important building blocks of the Distributed Operating System. Anil stressed the fact that Databases can in fact be considered as an OS service these days. The question “How you navigate the complex information space?” has remained a prominent question that The Web have always faced.

DistOS 2014W Lecture 7

2014-02-21T21:49:15Z

Cdelahou: added link to paper and title

== Project ==

We discussed moving the proposal due date back a week. We also discussed spending the class prior to that date discussing the primary papers people had chosen in order to provide preliminary feedback. Anil spent some time going through the papers from OSDI12 and discussing which ones would make good projects and why.

* Pick a primary paper.
* Find papers that cite that paper, papers it cites, etc. to collect a body of related work.
* Don't just give a history, tell a story!
* Do not try to summarize papers.
* Try to identify a pattern, a common ground between the papers.

==UNIX and Plan 9 (Jan. 28)==

* [http://homeostasis.scs.carleton.ca/~soma/distos/fall2008/unix.pdf Dennis M. Ritchie and Ken Thompson, "The UNIX Time-Sharing System" (1974)]
* [http://homeostasis.scs.carleton.ca/~soma/distos/2014w/presotto-plan9.pdf Presotto et. al, Plan 9, A Distributed System (1991)]
* [http://homeostasis.scs.carleton.ca/~soma/distos/2014w/pike-plan9.pdf Pike et al., Plan 9 from Bell Labs (1995)]

== Unix and Plan 9 ==

UNIX was built as "a castrated version of Multics", which was a very complex system. Multics was, arguably, so far ahead of its time that we are only just achieving their ambitions now. Unix was much more modest, and therefore much more achievable and successful. Just enough infrastructure to avoid reinventing the wheel. Just a couple of programmers making something for their own use. Unix was not designed as product or commercial entity at all. It was licensed out because AT&T was under severe antitrust scrutiny at the time.

They wanted few, simple abstractions so they made everything a file. Berkeley promptly broke this abstraction by introducing sockets for networking. Plan 9 finally introduced networking using the right abstractions, but was too late. Arguably the reason the BSD folks didn't use the file abstraction was because of the difference in reliability. SUN microsystems licensed Berkeley Unix and commercialized it. Files are generally reliable, and failures with them are catastrophic so many applications simply didn't include logic to handle such IO errors. Networks are much less reliable and applications have to be able to deal gracefully with timeouts and other errors.

In Anil's opinion Plan 9's design of using file abstraction to represent Network was n't a good design idea. The reason being file I/O breaking is uncommon but Network has an inherent flakiness and loss of connectivity is normal in networks. Using file system abstractions to represent Network does n't properly takes care of the flakiness inherent in the Network. Put in other words Network does n't have the reliability characteristics of mass storage and how to deal with this fact while using the file abstraction to deal with network was a major question which was left unanswered by the Plan 9 designers. Anil also added that Plan 9 was a elegant attempt at representing everything using file abstraction but they were trying too hard with this approach as pointed out above. In distributed systems the best approach to use is - if things have different semantics then they should have abstractions that reflect their characteristics, the APIs should reflect their characteristics rather than hide it away and try to pretend or treat them as if they were having characteristics of something else in an attempt towards too much generalizations. In Anil's opinion another reason why Plan 9 was not widely adopted was that it was a bit late to the scene, by the time Plan 9 came out in the 90s systems running UNIX with networking was widely adopted driven by the success of Internet.

Another valuable point Anil mentioned was that for a technology to get adopted and become successful it should serve or address a niche area for which there are no successful incumbents.

== Simon's Notes ==
'''These notes should be merged with the text above'''

* project proposal
** We will discuss the primary papers we've chosen on Thursday, February 6th
* possible papers, remember to pick a topic you have some chance of understanding
** OSDI 2012
*** datacenter (filesystems for doing X, heat management, etc...)
*** web stuff
*** distributed shared memory
*** distributed network I/O infrastructure
*** distributed databases (potentially)
*** anonymity systems
** Pick a conference (usenix is pretty systems oriented, maybe Lisa), go through their papers and find something interesting
** tell a story that connects several papers in the topic you choose

* UNIX
** Relation to multics
*** Multics was a complex system which was bad because it was used less, slower, etc...
*** Multics was not for end users, it was designed to support "utility computing" wherein computation was a service to be charged for
** What?
*** Just enough infrastructure to run my programs
*** It was really just supposed to be used by programmers
*** "By programmers for programmers"
*** Software and source licensed for a nominal fee
*** "Everything is a file"
*** only difference was files that you could use seek or ones you couldn't
*** simple abstractions
** Networking
*** Berkeley folks made sockets, not files which upset the folks at Bell labs
*** Networks aren't exactly like files because they're unreliable

* Plan 9
** major ideas
*** procfs, later adopted by linux
** summary
*** a very elegant attempt to follow the philosophy "everything is a file"
*** trying too hard
** opinions
*** things that have different failure modes deserve different APIs
** niche?
*** they never found one

* Tangent about programming languages
** C was for system programming
** Java was for enterprise programming

DistOS 2014W Lecture 8

2014-02-21T21:47:03Z

Cdelahou: added link to paper and title

==NFS and AFS (Jan 30)==

* [http://homeostasis.scs.carleton.ca/~soma/distos/2008-02-11/sandberg-nfs.pdf Russel Sandberg et al., "Design and Implementation of the Sun Network Filesystem" (1985)]
* [http://homeostasis.scs.carleton.ca/~soma/distos/2008-02-11/howard-afs.pdf John H. Howard et al., "Scale and Performance in a Distributed File System" (1988)]

==Group 1==

'''NFS:'''

1) per operating traffic

2) rpc based

3) unreliable

'''AFS:'''

1) design for 5000 clients

2) high integrity.

==Group 2==

'''NFS:'''

1) designed to share disks over a network, not files

2) more UNIX like

3) portable

4) use UDP

5) it is not minimize network traffic.

6) used VNODE

7) not have much hardware equipment

8) later versions took on features of AFS

9) stateless protocol conflicts with files being state-full by nature.

'''AFS:'''

1) designed to share files over a network, not disks

2) better scalability

3) better security.

4) minimize network traffic.

5) less UNIX like

6) plugin authentication

7) needs more kernel storage due to complex commands

8) inode concept replaced with fid

==Group 3==

'''NFS:'''

1) cache assumption invalid.

2) no locking

3) bad security

'''AFS:'''

1) cache assumption valid

2) locking

3) good security.

==Group 4==

----
Additional notes from the class discussion: Capturing some of Anil's Observations about NFS and AFS: The reason why NFS does not try to share at block level instead of file level is that sharing at block level is complicated from the implementation point of view. NFS use UDP as the transport protocol since UDP being a stateless protocol is in-line with the NFS design philosophy of not maintaining state information. Security and unreliability issues in NFS are an implication of using RPC. RPC is a nice way for programming but RPC is not designed for networks (where flakiness is an inherent characteristic) which is better explained by the analogy that you never expect from a programming point of view your function call to fail(not to return) because of communication error. AFS designers considered network as a bottle neck and tried to reduce the number of chatter over network by using caching. In Anil's opinion 'open' and 'close' operations in AFS were critical and the 'close' operation assumes importance to the same proportions of a 'commit' operation in a well-designed database system. Anil mentioned that security model of AFS is interesting in that rather than going for the UNIX access list based implementation AFS used a single sign on system based on Kerberos. In Anil's opinion cool thing about Kerberos is that idea of using tickets to get access. Another interesting fact that was mentioned was that irrespective of having better features compared to NFS, AFS did not get widely adopted. The reason for this was that the administrative mechanism for AFS was complex and it required highly trained/skilled people to setup AFS and it also required quite a number of day’s effort to set it up and maintain.