Soma-notes - User contributions [en]

DistOS 2014W Lecture 23

2014-04-07T20:16:39Z

Sjoy: /* Survey on Control Plane Frameworks for Software Defined Networking - Sijo */

'''Presentations'''
===Distributed Shared Memory Systems - Mojgan===
* Introduction to DSM systems
* Advantages and Disadvantages
* Classification of DSM systems
* Design considerations
* Examples of DSM systems
- OpenSSI
- Mermaid
- MOSIX
- DDM

===Survey: Fault Tolerance in Distributed File System - Mohammed===
* Abstract
* Introductions
** About fault tolerance in any distributed system. Comparison between different file systems.
** Whats more suitable for Mobile based systems.
** Why satisfaction high for fault tolerance is one of the main issues for DFS's ?
* Replication and fault tolerance
** What is the Replica and Placement policy? What is the synchronization? What is its benefit?
- Synchronous Method
- Asynchronous Method
- Semi-Asynchronous Method
* Cache consistency and fault tolerance
** What is the cache? What is its benefit? Cache consistency?
- Write only Read Many (WORM)
- Transactional Locking - Read and write locks
- Leasing
* Example DFS mentioned in the paper
** Google File Systems
** HDFS
** MOOSEFS
** iRODS
** GlusterFS
** Lustre
** Ceph
** PARADISE for mobile
* Conclusion

===Survey on Control Plane Frameworks for Software Defined Networking - Sijo===
* Introduction
** Traditional Networks - Control Plane and Forwarding Plane
** Software Defined Networking
- Proposes decoupling of layers into independent layers
- Network entities or nodes are specialized elements which does the forwarding
- Control applications works on the logical view of the network provided by the controller with out having to worry about
managing state distribution, toplogy discovery etc.
* Theme, Argument Outline
- Need for using distributed systems design principles, tools in SDN controller design to achieve scalability and reliability
* Controller Platforms
- Centralized and Distributed approaches
- Identify the need to use in controller platforms
- For centralized it started with NOX - Maestro - Beacon - Floodlight - POX - OpenDayLight
- For Distributed : ONIX - Hyperflow - YANC - ONOS
- Leverage parallel processing capabilities
* In detail about two systems:
** ONIX
** ONOS
* References

===Metadata management in Distributed File System - Sandarbh===
* What is metadata?
- Define by bare-minimum functions for MDS (Metadata Server)
- Monitor the performance of DFS so that it can be used further
- Structure of metadata in Paper
* Why is Metadata management difficult?
- 50% file operations are metadata operations
- Size of metadata
- Distribute the load evenly across all MDS
- Be able to handle thousands of clients
- Be able to handle file/directory permission change
- Recover data if some MDS goes down
- Be POSIX compliant
- Be able to scale- addition of new MDS shoudn't cause ripples
- Contrasting goals - replication and consistency - Average case improvements vs guaranteed performance for each access
* Static sub-tree partitioning
- Advantage - Clients know which MDS to contact for the file - Prefix caching
- Disadvantage - Directory hot spot formation
* Static hashing based partitioning
- Hash the filename or File identifier and assign to MDS
- Advantage - Distributes load evenly - Gets rid of hotpsot info
- Disadvantage
* Don't ask me where your server is approach
- Ex : Ceph , GlusterFS, OceanStore, Hierarchical Bloom filters, Cassandra
- Responsibilities - Replica mgmt, Consistency, Access control, Recover metadata in case of crash, Talk to each other to handle the load dynamically
* What's not in the slides
- Not focused on replication of metadata
- Semantic based search
* Structure of the survey
- Conventional metadata systems
- No-metadata approach
- Metadata approach of the file systems designed for specific goals 0 GFS, Haystack etcs
- Evolution history
- Comparison with in ctageory
- Cover reliability and consistency part
- Summarize learnings with expected trends

===Distributed Stream Processing - Ronak Chaudhari===
* About Stream processing
- Data streams
- DBMS vs Stream processing
* Applications
- Monitoring applications
- Militia applications
- Financial analysis
- Tracking applications
* Aurora
- Process incoming streams
- It has its own query algebra
- System Model - Query Model - Runtime Architecture
- QOS criteria
- SQuAL - Query algebra
- Aurora GUI
- Challenges in distribute operation
* Aurora vs Medusa
* Medusa
- Architecture
- Addition to Aurora - Lookup and Brain
- Failure detection
- Transfer of processing
- System API
- Load management
- High availability
- Benefits
* References

DistOS 2014W Lecture 23

2014-04-07T20:16:17Z

Sjoy: /* Survey on Control Plane Frameworks for Software Defined Networking - Sijo */

'''Presentations'''
===Distributed Shared Memory Systems - Mojgan===
* Introduction to DSM systems
* Advantages and Disadvantages
* Classification of DSM systems
* Design considerations
* Examples of DSM systems
- OpenSSI
- Mermaid
- MOSIX
- DDM

===Survey: Fault Tolerance in Distributed File System - Mohammed===
* Abstract
* Introductions
** About fault tolerance in any distributed system. Comparison between different file systems.
** Whats more suitable for Mobile based systems.
** Why satisfaction high for fault tolerance is one of the main issues for DFS's ?
* Replication and fault tolerance
** What is the Replica and Placement policy? What is the synchronization? What is its benefit?
- Synchronous Method
- Asynchronous Method
- Semi-Asynchronous Method
* Cache consistency and fault tolerance
** What is the cache? What is its benefit? Cache consistency?
- Write only Read Many (WORM)
- Transactional Locking - Read and write locks
- Leasing
* Example DFS mentioned in the paper
** Google File Systems
** HDFS
** MOOSEFS
** iRODS
** GlusterFS
** Lustre
** Ceph
** PARADISE for mobile
* Conclusion

===Survey on Control Plane Frameworks for Software Defined Networking - Sijo===
* Introduction
** Traditional Networks - Control Plane and Forwarding Plane
** Software Defined Networking
- Proposes decoupling of layers into independent layers
- Network entities or nodes are specialized elements which does the forwarding
- Control applications works on the logical view of the network provided by the controller with out having to worry about managing state distribution, toplogy discovery etc.
* Theme, Argument Outline
- Need for using distributed systems design principles, tools in SDN controller design to achieve scalability and reliability
* Controller Platforms
- Centralized and Distributed approaches
- Identify the need to use in controller platforms
- For centralized it started with NOX - Maestro - Beacon - Floodlight - POX - OpenDayLight
- For Distributed : ONIX - Hyperflow - YANC - ONOS
- Leverage parallel processing capabilities
* In detail about two systems:
** ONIX
** ONOS
* References

===Metadata management in Distributed File System - Sandarbh===
* What is metadata?
- Define by bare-minimum functions for MDS (Metadata Server)
- Monitor the performance of DFS so that it can be used further
- Structure of metadata in Paper
* Why is Metadata management difficult?
- 50% file operations are metadata operations
- Size of metadata
- Distribute the load evenly across all MDS
- Be able to handle thousands of clients
- Be able to handle file/directory permission change
- Recover data if some MDS goes down
- Be POSIX compliant
- Be able to scale- addition of new MDS shoudn't cause ripples
- Contrasting goals - replication and consistency - Average case improvements vs guaranteed performance for each access
* Static sub-tree partitioning
- Advantage - Clients know which MDS to contact for the file - Prefix caching
- Disadvantage - Directory hot spot formation
* Static hashing based partitioning
- Hash the filename or File identifier and assign to MDS
- Advantage - Distributes load evenly - Gets rid of hotpsot info
- Disadvantage
* Don't ask me where your server is approach
- Ex : Ceph , GlusterFS, OceanStore, Hierarchical Bloom filters, Cassandra
- Responsibilities - Replica mgmt, Consistency, Access control, Recover metadata in case of crash, Talk to each other to handle the load dynamically
* What's not in the slides
- Not focused on replication of metadata
- Semantic based search
* Structure of the survey
- Conventional metadata systems
- No-metadata approach
- Metadata approach of the file systems designed for specific goals 0 GFS, Haystack etcs
- Evolution history
- Comparison with in ctageory
- Cover reliability and consistency part
- Summarize learnings with expected trends

===Distributed Stream Processing - Ronak Chaudhari===
* About Stream processing
- Data streams
- DBMS vs Stream processing
* Applications
- Monitoring applications
- Militia applications
- Financial analysis
- Tracking applications
* Aurora
- Process incoming streams
- It has its own query algebra
- System Model - Query Model - Runtime Architecture
- QOS criteria
- SQuAL - Query algebra
- Aurora GUI
- Challenges in distribute operation
* Aurora vs Medusa
* Medusa
- Architecture
- Addition to Aurora - Lookup and Brain
- Failure detection
- Transfer of processing
- System API
- Load management
- High availability
- Benefits
* References

DistOS 2014W Lecture 23

2014-04-07T20:14:09Z

Sjoy: /* Survey on Control Plane Frameworks for Software Defined Networking - Sijo */

'''Presentations'''
===Distributed Shared Memory Systems - Mojgan===
* Introduction to DSM systems
* Advantages and Disadvantages
* Classification of DSM systems
* Design considerations
* Examples of DSM systems
- OpenSSI
- Mermaid
- MOSIX
- DDM

===Survey: Fault Tolerance in Distributed File System - Mohammed===
* Abstract
* Introductions
** About fault tolerance in any distributed system. Comparison between different file systems.
** Whats more suitable for Mobile based systems.
** Why satisfaction high for fault tolerance is one of the main issues for DFS's ?
* Replication and fault tolerance
** What is the Replica and Placement policy? What is the synchronization? What is its benefit?
- Synchronous Method
- Asynchronous Method
- Semi-Asynchronous Method
* Cache consistency and fault tolerance
** What is the cache? What is its benefit? Cache consistency?
- Write only Read Many (WORM)
- Transactional Locking - Read and write locks
- Leasing
* Example DFS mentioned in the paper
** Google File Systems
** HDFS
** MOOSEFS
** iRODS
** GlusterFS
** Lustre
** Ceph
** PARADISE for mobile
* Conclusion

===Survey on Control Plane Frameworks for Software Defined Networking - Sijo===
* Introduction
** Traditional Networks - Control Plane and Forwarding Plane
** Software Defined Networking
- Proposes decoupling of layers into independent layers
- Network entities or nodes are specialized elements which does the forwarding
- Control applications do not need to worry about installation of the underlying network
* Theme, Argument Outline
- Need for using distributed systems design principles, tools in SDN controller design to achieve scalability and reliability
* Controller Platforms
- Centralized and Distributed approaches
- Identify the need to use in controller platforms
- For centralized it started with NOX - Maestro - Beacon - Floodlight - POX - OpenDayLight
- For Distributed : ONIX - Hyperflow - YANC - ONOS
- Leverage parallel processing capabilities
* In detail about two systems:
** ONIX
** ONOS
* References

===Metadata management in Distributed File System - Sandarbh===
* What is metadata?
- Define by bare-minimum functions for MDS (Metadata Server)
- Monitor the performance of DFS so that it can be used further
- Structure of metadata in Paper
* Why is Metadata management difficult?
- 50% file operations are metadata operations
- Size of metadata
- Distribute the load evenly across all MDS
- Be able to handle thousands of clients
- Be able to handle file/directory permission change
- Recover data if some MDS goes down
- Be POSIX compliant
- Be able to scale- addition of new MDS shoudn't cause ripples
- Contrasting goals - replication and consistency - Average case improvements vs guaranteed performance for each access
* Static sub-tree partitioning
- Advantage - Clients know which MDS to contact for the file - Prefix caching
- Disadvantage - Directory hot spot formation
* Static hashing based partitioning
- Hash the filename or File identifier and assign to MDS
- Advantage - Distributes load evenly - Gets rid of hotpsot info
- Disadvantage
* Don't ask me where your server is approach
- Ex : Ceph , GlusterFS, OceanStore, Hierarchical Bloom filters, Cassandra
- Responsibilities - Replica mgmt, Consistency, Access control, Recover metadata in case of crash, Talk to each other to handle the load dynamically
* What's not in the slides
- Not focused on replication of metadata
- Semantic based search
* Structure of the survey
- Conventional metadata systems
- No-metadata approach
- Metadata approach of the file systems designed for specific goals 0 GFS, Haystack etcs
- Evolution history
- Comparison with in ctageory
- Cover reliability and consistency part
- Summarize learnings with expected trends

===Distributed Stream Processing - Ronak Chaudhari===
* About Stream processing
- Data streams
- DBMS vs Stream processing
* Applications
- Monitoring applications
- Militia applications
- Financial analysis
- Tracking applications
* Aurora
- Process incoming streams
- It has its own query algebra
- System Model - Query Model - Runtime Architecture
- QOS criteria
- SQuAL - Query algebra
- Aurora GUI
- Challenges in distribute operation
* Aurora vs Medusa
* Medusa
- Architecture
- Addition to Aurora - Lookup and Brain
- Failure detection
- Transfer of processing
- System API
- Load management
- High availability
- Benefits
* References

DistOS 2014W Lecture 12

2014-02-24T22:26:41Z

Sjoy: Additional notes from the class discussion

=Chubby (Feb 13)=
[https://www.usenix.org/legacy/events/osdi06/tech/burrows.html Burrows, The Chubby Lock Service for Loosely-Coupled Distributed Systems (OSDI 2006)]

==Introduction==

[http://en.wikipedia.org/wiki/Distributed_lock_manager#Google.27s_Chubby_lock_service Chubby], developed at Google, was designed to be a coarse-grained locking service for use within loosely coupled distributed systems (i.e., a network consisting of a high number of small machines). The key contribution was the implementation of Chubby (i.e., there were no new algorithms designed/introduced).

Its purpose is to allow clients to synchronize their activities and to agree on basic information about their environment. It is used to varying degrees by other Google project such as the GFS, MapReduce, and BigTable.

By course grained locking, we mean locking resources for extended lengths of time. For example, electing a primary would handle all access to given data for hours or days.

It is basically a ultra reliable and available file system for very small files that is used as a locking service.

Anil: "Once implemented, Chubby abstracts away all the crazy complicated stuff so you can more easily build your distributed system". Chubby is a tool that gives Google devs important guarantees to build on.

==Design==

The funny thing is that Chubby is essentially a filesystem (with files, file permissions, reading/writing, a hierarchal structure, etc.) with a few caveats. Mainly that any file can act as a reader/writer lock and that only whole file operations are performed (i.e., the whole file is written or read), as the files are quite small (256K max). The main reason for implementing the distributed lock service (Chubby) using file system rather than using may be a library based approach was because of the need to make it an easier to use system.

All the locks are fully advisory, meaning others can "go around" whoever has the lock to access the resource (for reading and, sometimes, writing), as opposed to mandatory, mandatory locks giving completely exclusive access to a resource. The reason why chubby goes for advisory locks is that if a client having a lock ends in a problem for some reason there should be a way to release the lock graciously rather than requiring the entire system to be brought down or rebooted.

It can be noted that Linux also utilizes advisory locks as opposed to Windows, which only utilizes mandatory locks. This could be a shortcoming of Windows as, when anything changes regarding the system, the system must be completely rebooted as the locks on files are never broken. With advisory locks, as in Linux, the system need only be rebooted when the kernel is modified/updated.

Chubby also functions as a name server, but only really for functional names/roles , such as for the mail server or a GFS server (i.e., Chubby is mainly used as a name server for logical/symbolic names for roles). It is a centralized place that maps names to resources. A unified interface to do so. The name-value mappings in Chubby allow for a consistent, real-time, overall view of the entire system.

As a name server, Chubby provides guarantees not given with DNS (e.g., DNS is subject to a stale cache) as Chubby provides a unified view of the way things are in the system.

Chubby was made coarse-grained for scalability as coarse-grained locks give the ability to create a distributed system while the fine-grained locks wouldn't scale well. It can also be noted that a fine-grained lock could be implemented on top of the coarse-grained locks. The entire point of Chubby was to give ultra-high availability and integrity.

==Implementation==

* Uses [http://en.wikipedia.org/wiki/Paxos_(computer_science) paxos], which is an insanely complicated way of solving the distributed consensus problem.
** Given many proposed values, it chooses one to be agreed upon.

* Master chubby server with 4 slaves (5 servers total make up a Chubby Cell)
** Master and slaves have all the data.
** Nothing particularity special about the master
** If the master fails, one slave is elected as a new master

==use cases==

==Discussion==

Where else do we see things such as Chubby? Where would you want this consistent, overall view?

You would want this consistent view in any sort of synchronized set of files across a set of systems, such as Dropbox. The main tenants of Chubby's design would hold where you would want to make sure there was an online consensus. It should be noted that this is not like version control as, with version control, everyone has their own copy which are all merged later. However, in this type of system, there is only one version available throughout the distributed system. Chubby's design would differ from Dropbox in that Dropbox is designed so that you can work offline and then synchronize your changes once you are online again (i.e., there can sometimes be more than one version of a file meaning you lack the consistent, overall view given by Chubby).

In Anil's opinion we can think about Chubby as an example of bootstrapping, based on the idea of having/building one good thing to realize your needs rather than adding mechanisms to existing systems. It is nice to have consistency in the world of distributed systems but it comes with a cost, are you willing to pay for it? is one main question distributed system designers, users often encounters. In Anil's view chubby brings down this cost a bit lower. Anil mentioned that one of the main ideas of the Distributed Operating systems course is to understand why you need different algorithms/mechanisms to build a Distributed System rather than looking at the internals of each algorithm in depth.

DistOS 2014W Lecture 12

2014-02-24T22:22:36Z

Sjoy: additional notes from the lecture

DistOS 2014W Lecture 8

2014-02-08T20:48:10Z

Sjoy: formatting

==Group 1==

'''NFS:'''

1) per operating traffic

2) rpc based

3) unreliable

'''AFS:'''

1) design for 5000 clients

2) high integrity.

==Group 2==

'''NFS:'''

1) designed to share disks over a network, not files

2) more UNIX like

3) portable

4) use UDP

5) it is not minimize network traffic.

6) used VNODE

7) not have much hardware equipment

8) later versions took on features of AFS

9) stateless protocol conflicts with files being state-full by nature.

'''AFS:'''

1) designed to share files over a network, not disks

2) better scalability

3) better security.

4) minimize network traffic.

5) less UNIX like

6) plugin authentication

7) needs more kernel storage due to complex commands

8) inode concept replaced with fid

==Group 3==

'''NFS:'''

1) cache assumption invalid.

2) no locking

3) bad security

'''AFS:'''

1) cache assumption valid

2) locking

3) good security.

==Group 4==

----
Additional notes from the class discussion: Capturing some of Anil's Observations about NFS and AFS: The reason why NFS does not try to share at block level instead of file level is that sharing at block level is complicated from the implementation point of view. NFS use UDP as the transport protocol since UDP being a stateless protocol is in-line with the NFS design philosophy of not maintaining state information. Security and unreliability issues in NFS are an implication of using RPC. RPC is a nice way for programming but RPC is not designed for networks (where flakiness is an inherent characteristic) which is better explained by the analogy that you never expect from a programming point of view your function call to fail(not to return) because of communication error. AFS designers considered network as a bottle neck and tried to reduce the number of chatter over network by using caching. In Anil's opinion 'open' and 'close' operations in AFS were critical and the 'close' operation assumes importance to the same proportions of a 'commit' operation in a well-designed database system. Anil mentioned that security model of AFS is interesting in that rather than going for the UNIX access list based implementation AFS used a single sign on system based on Kerberos. In Anil's opinion cool thing about Kerberos is that idea of using tickets to get access. Another interesting fact that was mentioned was that irrespective of having better features compared to NFS, AFS did not get widely adopted. The reason for this was that the administrative mechanism for AFS was complex and it required highly trained/skilled people to setup AFS and it also required quite a number of day’s effort to set it up and maintain.

DistOS 2014W Lecture 10

2014-02-08T20:47:01Z

Sjoy: formatting

==Context==

== GFS ==

* Very different because of the workload that it is desgined for.
** Because of the number of small files that have to be indexed for the web, etc., it is no longer practical to have a filesystem that stores these individually. Too much overhead. Punts problem to userspace, incl. record delimitation.
* Don't care about latency, surprising considering it's Google, the guys who change the TCP IW standard recommendations for latency.
* Mostly seeking through entire file.
* Paper from 2003, mentions still using 100BASE-T links.
* Data-heavy, metadata light. Contacting the metadata server is a rare event.
* Really good that they designed for unreliable hardware:
** All the replication
** Data checksumming
* Performance degrades for small random access workload; use other filesystem.
* Path of least resistance to scale, not to do something super CS-smart.
* Google used to re-index every month, swapping out indexes. Now, it's much more online. GFS is now just a layer to support a more dynamic layer.

=== Segue on drives ===

* Structure of GFS does match some other modern systems:
** Hard drives are like parallel tapes, very suited for streaming.
** Flash devices are log-structured too, but have an abstracting firmware. You want to do erasure in bulk, in the '''background'''. Used to be we needed specialized FS for MTDs to get better performance; though now we have better microcontrollers in some embedded systems to abstract away the hardware.
* Architectures that start big, often end up in the smallest things.

== How other filesystems compare to GFS and Ceph ==

* Other File Systems: AFS, NFS, Plan 9, traditional Unix

* Data and metadata are held together.
** They did not optimize for different access patterns:
*** Data → big, long transfers
*** Metadata → small, low latency
** Can't scale separately

* Designed for lower latency

* (Mostly) designed for POSIX semantics
** how the requirements that lead to the ‘standard’ evolved

* Assumed that a file is a fraction of the size of a server
** eg. files on a Unix system were meant to be text files.
** Huge files spread over many servers not even in the cards for NFS
** Meant for small problems, not web-scale
*** Google has a copy of the publicly accessible internet
**** Their strategy is to copy the internet to index it
**** Insane → insane filesystem
**** One file may span multiple servers

* Even mainframes, scale-up solutions, ultra-reliable systems, with data sets bigger than RAM don't have the scale of GFS or CEPH.

* Point-to-point access; much less load-balancing, even in AFS
** One server to service multiple clients.
** Single point of entry, single point of failure, bottleneck

* Less focus on fault tolerance
** No notion of data replication.

* Reliability was a property of the host, not the network

==Ceph==

<ul>
<li>Ceph is crazy and tries to do everything</li>
<li>Unlike GFS, distributes metadata, not just for read-only copies</li>
<li>Unlike GFS, the OSDs have some intelligence, and autonomously distribute the data, rather than being controlled by a master.
<ul>
<li>Uses hashing in the distribution process to '''uniformly''' distribute data</li>
<li>The actual algorithm for distributing data is as follows:
<math>file + offset → hash(object ID) → CRUSH(placement group) → OSD</math></li>
<li>Each client has knowledge of the entire storage network.</li>
<li>Tracks failure groups (same breaker, switch, etc.), hot data, etc.</li>
<li>Number of replicas is changeable on the fly, but the placement group is not
<ul>
<li>For example, if every client on the planet is accessing the same file, you can scale out for that data.</li></ul>
</li>
<li>You don't ask where to go, you just go, which makes this very scalable</li></ul>
</li>
<li>CRUSH is sufficiently advanced to be called magic.
<ul>
<li><math>O(log n)</math> of the size of the data</li>
<li>CPUs stupidly fast, so the above is of minimal overhead, whereas the network, despite being fast, has latency, etc. Computation scales much better than communication.</li></ul>
</li>
<li>Storage is composed of variable-length atoms</li></ul>

----
Additional notes from the class discussion: In Anil’s opinion “how file system size compares to the server storage size?” is a key parameter that distinguishes GFS, NFS designs from the early file systems NFS, AFS, Plan 9. In the early files system designs, file system size was a fraction of the server storage size where as in GFS and Ceph the file system size can be of several times magnitude than that of the server. One key aspect in the Ceph design is the attempt to replace communication with computation by using hashing based mechanism CRUSH. Following line from Anil epitomizes the general approach that is followed in the field of Computer Science “If one abstraction does not work stick another one in”.

DistOS 2014W Lecture 10

2014-02-08T20:46:33Z

Sjoy: formatting

==Context==

== GFS ==

* Very different because of the workload that it is desgined for.
** Because of the number of small files that have to be indexed for the web, etc., it is no longer practical to have a filesystem that stores these individually. Too much overhead. Punts problem to userspace, incl. record delimitation.
* Don't care about latency, surprising considering it's Google, the guys who change the TCP IW standard recommendations for latency.
* Mostly seeking through entire file.
* Paper from 2003, mentions still using 100BASE-T links.
* Data-heavy, metadata light. Contacting the metadata server is a rare event.
* Really good that they designed for unreliable hardware:
** All the replication
** Data checksumming
* Performance degrades for small random access workload; use other filesystem.
* Path of least resistance to scale, not to do something super CS-smart.
* Google used to re-index every month, swapping out indexes. Now, it's much more online. GFS is now just a layer to support a more dynamic layer.

=== Segue on drives ===

* Structure of GFS does match some other modern systems:
** Hard drives are like parallel tapes, very suited for streaming.
** Flash devices are log-structured too, but have an abstracting firmware. You want to do erasure in bulk, in the '''background'''. Used to be we needed specialized FS for MTDs to get better performance; though now we have better microcontrollers in some embedded systems to abstract away the hardware.
* Architectures that start big, often end up in the smallest things.

== How other filesystems compare to GFS and Ceph ==

* Other File Systems: AFS, NFS, Plan 9, traditional Unix

* Data and metadata are held together.
** They did not optimize for different access patterns:
*** Data → big, long transfers
*** Metadata → small, low latency
** Can't scale separately

* Designed for lower latency

* (Mostly) designed for POSIX semantics
** how the requirements that lead to the ‘standard’ evolved

* Assumed that a file is a fraction of the size of a server
** eg. files on a Unix system were meant to be text files.
** Huge files spread over many servers not even in the cards for NFS
** Meant for small problems, not web-scale
*** Google has a copy of the publicly accessible internet
**** Their strategy is to copy the internet to index it
**** Insane → insane filesystem
**** One file may span multiple servers

* Even mainframes, scale-up solutions, ultra-reliable systems, with data sets bigger than RAM don't have the scale of GFS or CEPH.

* Point-to-point access; much less load-balancing, even in AFS
** One server to service multiple clients.
** Single point of entry, single point of failure, bottleneck

* Less focus on fault tolerance
** No notion of data replication.

* Reliability was a property of the host, not the network

==Ceph==

<ul>
<li>Ceph is crazy and tries to do everything</li>
<li>Unlike GFS, distributes metadata, not just for read-only copies</li>
<li>Unlike GFS, the OSDs have some intelligence, and autonomously distribute the data, rather than being controlled by a master.
<ul>
<li>Uses hashing in the distribution process to '''uniformly''' distribute data</li>
<li>The actual algorithm for distributing data is as follows:
<math>file + offset → hash(object ID) → CRUSH(placement group) → OSD</math></li>
<li>Each client has knowledge of the entire storage network.</li>
<li>Tracks failure groups (same breaker, switch, etc.), hot data, etc.</li>
<li>Number of replicas is changeable on the fly, but the placement group is not
<ul>
<li>For example, if every client on the planet is accessing the same file, you can scale out for that data.</li></ul>
</li>
<li>You don't ask where to go, you just go, which makes this very scalable</li></ul>
</li>
<li>CRUSH is sufficiently advanced to be called magic.
<ul>
<li><math>O(log n)</math> of the size of the data</li>
<li>CPUs stupidly fast, so the above is of minimal overhead, whereas the network, despite being fast, has latency, etc. Computation scales much better than communication.</li></ul>
</li>
<li>Storage is composed of variable-length atoms</li></ul>

----
Additional notes from the class discussion:

In Anil’s opinion “how file system size compares to the server storage size?” is a key parameter that distinguishes GFS, NFS designs from the early file systems NFS, AFS, Plan 9. In the early files system designs, file system size was a fraction of the server storage size where as in GFS and Ceph the file system size can be of several times magnitude than that of the server. One key aspect in the Ceph design is the attempt to replace communication with computation by using hashing based mechanism CRUSH. Following line from Anil epitomizes the general approach that is followed in the field of Computer Science “If one abstraction does not work stick another one in”.

DistOS 2014W Lecture 10

2014-02-08T20:45:07Z

Sjoy: formatting change

==Context==

== GFS ==

* Very different because of the workload that it is desgined for.
** Because of the number of small files that have to be indexed for the web, etc., it is no longer practical to have a filesystem that stores these individually. Too much overhead. Punts problem to userspace, incl. record delimitation.
* Don't care about latency, surprising considering it's Google, the guys who change the TCP IW standard recommendations for latency.
* Mostly seeking through entire file.
* Paper from 2003, mentions still using 100BASE-T links.
* Data-heavy, metadata light. Contacting the metadata server is a rare event.
* Really good that they designed for unreliable hardware:
** All the replication
** Data checksumming
* Performance degrades for small random access workload; use other filesystem.
* Path of least resistance to scale, not to do something super CS-smart.
* Google used to re-index every month, swapping out indexes. Now, it's much more online. GFS is now just a layer to support a more dynamic layer.

=== Segue on drives ===

* Structure of GFS does match some other modern systems:
** Hard drives are like parallel tapes, very suited for streaming.
** Flash devices are log-structured too, but have an abstracting firmware. You want to do erasure in bulk, in the '''background'''. Used to be we needed specialized FS for MTDs to get better performance; though now we have better microcontrollers in some embedded systems to abstract away the hardware.
* Architectures that start big, often end up in the smallest things.

== How other filesystems compare to GFS and Ceph ==

* Other File Systems: AFS, NFS, Plan 9, traditional Unix

* Data and metadata are held together.
** They did not optimize for different access patterns:
*** Data → big, long transfers
*** Metadata → small, low latency
** Can't scale separately

* Designed for lower latency

* (Mostly) designed for POSIX semantics
** how the requirements that lead to the ‘standard’ evolved

* Assumed that a file is a fraction of the size of a server
** eg. files on a Unix system were meant to be text files.
** Huge files spread over many servers not even in the cards for NFS
** Meant for small problems, not web-scale
*** Google has a copy of the publicly accessible internet
**** Their strategy is to copy the internet to index it
**** Insane → insane filesystem
**** One file may span multiple servers

* Even mainframes, scale-up solutions, ultra-reliable systems, with data sets bigger than RAM don't have the scale of GFS or CEPH.

* Point-to-point access; much less load-balancing, even in AFS
** One server to service multiple clients.
** Single point of entry, single point of failure, bottleneck

* Less focus on fault tolerance
** No notion of data replication.

* Reliability was a property of the host, not the network

==Ceph==

<ul>
<li>Ceph is crazy and tries to do everything</li>
<li>Unlike GFS, distributes metadata, not just for read-only copies</li>
<li>Unlike GFS, the OSDs have some intelligence, and autonomously distribute the data, rather than being controlled by a master.
<ul>
<li>Uses hashing in the distribution process to '''uniformly''' distribute data</li>
<li>The actual algorithm for distributing data is as follows:
<math>file + offset → hash(object ID) → CRUSH(placement group) → OSD</math></li>
<li>Each client has knowledge of the entire storage network.</li>
<li>Tracks failure groups (same breaker, switch, etc.), hot data, etc.</li>
<li>Number of replicas is changeable on the fly, but the placement group is not
<ul>
<li>For example, if every client on the planet is accessing the same file, you can scale out for that data.</li></ul>
</li>
<li>You don't ask where to go, you just go, which makes this very scalable</li></ul>
</li>
<li>CRUSH is sufficiently advanced to be called magic.
<ul>
<li><math>O(log n)</math> of the size of the data</li>
<li>CPUs stupidly fast, so the above is of minimal overhead, whereas the network, despite being fast, has latency, etc. Computation scales much better than communication.</li></ul>
</li>
<li>Storage is composed of variable-length atoms</li></ul>

=Additional notes from the class discussion=

In Anil’s opinion “how file system size compares to the server storage size?” is a key parameter that distinguishes GFS, NFS designs from the early file systems NFS, AFS, Plan 9. In the early files system designs, file system size was a fraction of the server storage size where as in GFS and Ceph the file system size can be of several times magnitude than that of the server. One key aspect in the Ceph design is the attempt to replace communication with computation by using hashing based mechanism CRUSH. Following line from Anil epitomizes the general approach that is followed in the field of Computer Science “If one abstraction does not work stick another one in”.

DistOS 2014W Lecture 8

2014-02-08T20:43:31Z

Sjoy: formatting change

==Group 1==

'''NFS:'''

1) per operating traffic

2) rpc based

3) unreliable

'''AFS:'''

1) design for 5000 clients

2) high integrity.

==Group 2==

'''NFS:'''

1) designed to share disks over a network, not files

2) more UNIX like

3) portable

4) use UDP

5) it is not minimize network traffic.

6) used VNODE

7) not have much hardware equipment

8) later versions took on features of AFS

9) stateless protocol conflicts with files being state-full by nature.

'''AFS:'''

1) designed to share files over a network, not disks

2) better scalability

3) better security.

4) minimize network traffic.

5) less UNIX like

6) plugin authentication

7) needs more kernel storage due to complex commands

8) inode concept replaced with fid

==Group 3==

'''NFS:'''

1) cache assumption invalid.

2) no locking

3) bad security

'''AFS:'''

1) cache assumption valid

2) locking

3) good security.

==Group 4==

==Additional notes from the class discussion==

Capturing some of Anil's Observations about NFS and AFS: The reason why NFS does not try to share at block level instead of file level is that sharing at block level is complicated from the implementation point of view. NFS use UDP as the transport protocol since UDP being a stateless protocol is in-line with the NFS design philosophy of not maintaining state information. Security and unreliability issues in NFS are an implication of using RPC. RPC is a nice way for programming but RPC is not designed for networks (where flakiness is an inherent characteristic) which is better explained by the analogy that you never expect from a programming point of view your function call to fail(not to return) because of communication error. AFS designers considered network as a bottle neck and tried to reduce the number of chatter over network by using caching. In Anil's opinion 'open' and 'close' operations in AFS were critical and the 'close' operation assumes importance to the same proportions of a 'commit' operation in a well-designed database system. Anil mentioned that security model of AFS is interesting in that rather than going for the UNIX access list based implementation AFS used a single sign on system based on Kerberos. In Anil's opinion cool thing about Kerberos is that idea of using tickets to get access. Another interesting fact that was mentioned was that irrespective of having better features compared to NFS, AFS did not get widely adopted. The reason for this was that the administrative mechanism for AFS was complex and it required highly trained/skilled people to setup AFS and it also required quite a number of day’s effort to set it up and maintain.

DistOS 2014W Lecture 8

2014-02-08T20:42:02Z

Sjoy: formatting change

==Group 1==

'''NFS:'''

1) per operating traffic

2) rpc based

3) unreliable

'''AFS:'''

1) design for 5000 clients

2) high integrity.

==Group 2==

'''NFS:'''

1) designed to share disks over a network, not files

2) more UNIX like

3) portable

4) use UDP

5) it is not minimize network traffic.

6) used VNODE

7) not have much hardware equipment

8) later versions took on features of AFS

9) stateless protocol conflicts with files being state-full by nature.

'''AFS:'''

1) designed to share files over a network, not disks

2) better scalability

3) better security.

4) minimize network traffic.

5) less UNIX like

6) plugin authentication

7) needs more kernel storage due to complex commands

8) inode concept replaced with fid

==Group 3==

'''NFS:'''

1) cache assumption invalid.

2) no locking

3) bad security

'''AFS:'''

1) cache assumption valid

2) locking

3) good security.

==Group 4==

Additional notes from the class discussion:

Capturing some of Anil's Observations about NFS and AFS: The reason why NFS does not try to share at block level instead of file level is that sharing at block level is complicated from the implementation point of view. NFS use UDP as the transport protocol since UDP being a stateless protocol is in-line with the NFS design philosophy of not maintaining state information. Security and unreliability issues in NFS are an implication of using RPC. RPC is a nice way for programming but RPC is not designed for networks (where flakiness is an inherent characteristic) which is better explained by the analogy that you never expect from a programming point of view your function call to fail(not to return) because of communication error. AFS designers considered network as a bottle neck and tried to reduce the number of chatter over network by using caching. In Anil's opinion 'open' and 'close' operations in AFS were critical and the 'close' operation assumes importance to the same proportions of a 'commit' operation in a well-designed database system. Anil mentioned that security model of AFS is interesting in that rather than going for the UNIX access list based implementation AFS used a single sign on system based on Kerberos. In Anil's opinion cool thing about Kerberos is that idea of using tickets to get access. Another interesting fact that was mentioned was that irrespective of having better features compared to NFS, AFS did not get widely adopted. The reason for this was that the administrative mechanism for AFS was complex and it required highly trained/skilled people to setup AFS and it also required quite a number of day’s effort to set it up and maintain.

DistOS 2014W Lecture 8

2014-02-08T20:40:58Z

Sjoy: formatting change

==Group 1==

'''NFS:'''

1) per operating traffic

2) rpc based

3) unreliable

'''AFS:'''

1) design for 5000 clients

2) high integrity.

==Group 2==

'''NFS:'''

1) designed to share disks over a network, not files

2) more UNIX like

3) portable

4) use UDP

5) it is not minimize network traffic.

6) used VNODE

7) not have much hardware equipment

8) later versions took on features of AFS

9) stateless protocol conflicts with files being state-full by nature.

'''AFS:'''

1) designed to share files over a network, not disks

2) better scalability

3) better security.

4) minimize network traffic.

5) less UNIX like

6) plugin authentication

7) needs more kernel storage due to complex commands

8) inode concept replaced with fid

==Group 3==

'''NFS:'''

1) cache assumption invalid.

2) no locking

3) bad security

'''AFS:'''

1) cache assumption valid

2) locking

3) good security.

==Group 4==

==
Additional notes from the class discussion:

Capturing some of Anil's Observations about NFS and AFS: The reason why NFS does not try to share at block level instead of file level is that sharing at block level is complicated from the implementation point of view. NFS use UDP as the transport protocol since UDP being a stateless protocol is in-line with the NFS design philosophy of not maintaining state information. Security and unreliability issues in NFS are an implication of using RPC. RPC is a nice way for programming but RPC is not designed for networks (where flakiness is an inherent characteristic) which is better explained by the analogy that you never expect from a programming point of view your function call to fail(not to return) because of communication error. AFS designers considered network as a bottle neck and tried to reduce the number of chatter over network by using caching. In Anil's opinion 'open' and 'close' operations in AFS were critical and the 'close' operation assumes importance to the same proportions of a 'commit' operation in a well-designed database system. Anil mentioned that security model of AFS is interesting in that rather than going for the UNIX access list based implementation AFS used a single sign on system based on Kerberos. In Anil's opinion cool thing about Kerberos is that idea of using tickets to get access. Another interesting fact that was mentioned was that irrespective of having better features compared to NFS, AFS did not get widely adopted. The reason for this was that the administrative mechanism for AFS was complex and it required highly trained/skilled people to setup AFS and it also required quite a number of day’s effort to set it up and maintain.

DistOS 2014W Lecture 10

2014-02-06T21:02:14Z

Sjoy: Additional notes from the class discussion

==Context==

== GFS ==

* Very different because of the workload that it is desgined for.
** Because of the number of small files that have to be indexed for the web, etc., it is no longer practical to have a filesystem that stores these individually. Too much overhead. Punts problem to userspace, incl. record delimitation.
* Don't care about latency, surprising considering it's Google, the guys who change the TCP IW standard recommendations for latency.
* Mostly seeking through entire file.
* Paper from 2003, mentions still using 100BASE-T links.
* Data-heavy, metadata light. Contacting the metadata server is a rare event.
* Really good that they designed for unreliable hardware:
** All the replication
** Data checksumming
* Performance degrades for small random access workload; use other filesystem.
* Path of least resistance to scale, not to do something super CS-smart.
* Google used to re-index every month, swapping out indexes. Now, it's much more online. GFS is now just a layer to support a more dynamic layer.

=== Segue on drives ===

* Structure of GFS does match some other modern systems:
** Hard drives are like parallel tapes, very suited for streaming.
** Flash devices are log-structured too, but have an abstracting firmware. You want to do erasure in bulk, in the '''background'''. Used to be we needed specialized FS for MTDs to get better performance; though now we have better microcontrollers in some embedded systems to abstract away the hardware.
* Architectures that start big, often end up in the smallest things.

== How other filesystems compare to GFS and Ceph ==

* Data and metadata are held together.
** Doesn't account for different access patterns:
*** Data → big, long transfers
*** Metadata → small, low latency
** Can't scale separately
* By design, a file is a fraction of the size of a server
** Huge files spread over many servers not even in the cards for NFS
** Meant for small problems, not web-scale
*** Google has a copy of the publicly accessible internet
**** Their strategy is to copy the internet to index it
**** Insane → insane filesystem
* Designed for lower latency
* Designed for POSIX semantics; how the requirements that lead to the ‘standard’ evolved
* Even mainframes, scale-up solutions, ultra-reliable systems, with data sets bigger than RAM don't have this scale.
* Reliability was a property of the host, not the network
* Point-to-point access; much less load-balancing, even in AFS
** Single point of entry, single point of failure, bottleneck
* No notion of data replication.

==Ceph==

<ul>
<li>Ceph is crazy and tries to do everything</li>
<li>Unlike GFS, distributes metadata, not just for read-only copies</li>
<li>Unlike GFS, the OSDs have some intelligence, and autonomously distribute the data, rather than being controlled by a master.
<ul>
<li>Uses hashing in the distribution process to '''uniformly''' distribute data</li>
<li>The actual algorithm for distributing data is as follows:
<math>file + offset → hash(object ID) → CRUSH(placement group) → OSD</math></li>
<li>Each client has knowledge of the entire storage network.</li>
<li>Tracks failure groups (same breaker, switch, etc.), hot data, etc.</li>
<li>Number of replicas is changeable on the fly, but the placement group is not
<ul>
<li>For example, if every client on the planet is accessing the same file, you can scale out for that data.</li></ul>
</li>
<li>You don't ask where to go, you just go, which makes this very scalable</li></ul>
</li>
<li>CRUSH is sufficiently advanced to be called magic.
<ul>
<li><math>O(log n)</math> of the size of the data</li>
<li>CPUs stupidly fast, so the above is of minimal overhead, whereas the network, despite being fast, has latency, etc. Computation scales much better than communication.</li></ul>
</li>
<li>Storage is composed of variable-length atoms</li></ul>

== ==
Additional notes from the class discussion:

In Anil’s opinion “how file system size compares to the server storage size?” is a key parameter that distinguishes GFS, NFS designs from the early file systems NFS, AFS, Plan 9. In the early files system designs, file system size was a fraction of the server storage size where as in GFS and Ceph the file system size can be of several times magnitude than that of the server. One key aspect in the Ceph design is the attempt to replace communication with computation by using hashing based mechanism CRUSH. Following line from Anil epitomizes the general approach that is followed in the field of Computer Science “If one abstraction does not work stick another one in”.

DistOS 2014W Lecture 8

2014-02-06T20:40:38Z

Sjoy: formatting change

==Group 1==

'''NFS:'''

1) per operating traffic

2) rpc based

3) unreliable

'''AFS:'''

1) design for 5000 clients

2) high integrity.

==Group 2==

'''NFS:'''

1) designed to share disks over a network, not files

2) more UNIX like

3) portable

4) use UDP

5) it is not minimize network traffic.

6) used VNODE

7) not have much hardware equipment

8) later versions took on features of AFS

9) stateless protocol conflicts with files being state-full by nature.

'''AFS:'''

1) designed to share files over a network, not disks

2) better scalability

3) better security.

4) minimize network traffic.

5) less UNIX like

6) plugin authentication

7) needs more kernel storage due to complex commands

8) inode concept replaced with fid

==Group 3==

'''NFS:'''

1) cache assumption invalid.

2) no locking

3) bad security

'''AFS:'''

1) cache assumption valid

2) locking

3) good security.

==Group 4==

== ==
Additional notes from the class discussion:

Capturing some of Anil's Observations about NFS and AFS: The reason why NFS does not try to share at block level instead of file level is that sharing at block level is complicated from the implementation point of view. NFS use UDP as the transport protocol since UDP being a stateless protocol is in-line with the NFS design philosophy of not maintaining state information. Security and unreliability issues in NFS are an implication of using RPC. RPC is a nice way for programming but RPC is not designed for networks (where flakiness is an inherent characteristic) which is better explained by the analogy that you never expect from a programming point of view your function call to fail(not to return) because of communication error. AFS designers considered network as a bottle neck and tried to reduce the number of chatter over network by using caching. In Anil's opinion 'open' and 'close' operations in AFS were critical and the 'close' operation assumes importance to the same proportions of a 'commit' operation in a well-designed database system. Anil mentioned that security model of AFS is interesting in that rather than going for the UNIX access list based implementation AFS used a single sign on system based on Kerberos. In Anil's opinion cool thing about Kerberos is that idea of using tickets to get access. Another interesting fact that was mentioned was that irrespective of having better features compared to NFS, AFS did not get widely adopted. The reason for this was that the administrative mechanism for AFS was complex and it required highly trained/skilled people to setup AFS and it also required quite a number of day’s effort to set it up and maintain.

DistOS 2014W Lecture 8

2014-02-06T20:40:02Z

Sjoy: formatting change

==Group 1==

'''NFS:'''

1) per operating traffic

2) rpc based

3) unreliable

'''AFS:'''

1) design for 5000 clients

2) high integrity.

==Group 2==

'''NFS:'''

1) designed to share disks over a network, not files

2) more UNIX like

3) portable

4) use UDP

5) it is not minimize network traffic.

6) used VNODE

7) not have much hardware equipment

8) later versions took on features of AFS

9) stateless protocol conflicts with files being state-full by nature.

'''AFS:'''

1) designed to share files over a network, not disks

2) better scalability

3) better security.

4) minimize network traffic.

5) less UNIX like

6) plugin authentication

7) needs more kernel storage due to complex commands

8) inode concept replaced with fid

==Group 3==

'''NFS:'''

1) cache assumption invalid.

2) no locking

3) bad security

'''AFS:'''

1) cache assumption valid

2) locking

3) good security.

==Group 4==

==
Additional notes from the class discussion:

Capturing some of Anil's Observations about NFS and AFS: The reason why NFS does not try to share at block level instead of file level is that sharing at block level is complicated from the implementation point of view. NFS use UDP as the transport protocol since UDP being a stateless protocol is in-line with the NFS design philosophy of not maintaining state information. Security and unreliability issues in NFS are an implication of using RPC. RPC is a nice way for programming but RPC is not designed for networks (where flakiness is an inherent characteristic) which is better explained by the analogy that you never expect from a programming point of view your function call to fail(not to return) because of communication error. AFS designers considered network as a bottle neck and tried to reduce the number of chatter over network by using caching. In Anil's opinion 'open' and 'close' operations in AFS were critical and the 'close' operation assumes importance to the same proportions of a 'commit' operation in a well-designed database system. Anil mentioned that security model of AFS is interesting in that rather than going for the UNIX access list based implementation AFS used a single sign on system based on Kerberos. In Anil's opinion cool thing about Kerberos is that idea of using tickets to get access. Another interesting fact that was mentioned was that irrespective of having better features compared to NFS, AFS did not get widely adopted. The reason for this was that the administrative mechanism for AFS was complex and it required highly trained/skilled people to setup AFS and it also required quite a number of day’s effort to set it up and maintain.

DistOS 2014W Lecture 8

2014-02-06T20:39:27Z

Sjoy: Additional notes from the class discussion

==Group 1==

'''NFS:'''

1) per operating traffic

2) rpc based

3) unreliable

'''AFS:'''

1) design for 5000 clients

2) high integrity.

==Group 2==

'''NFS:'''

1) designed to share disks over a network, not files

2) more UNIX like

3) portable

4) use UDP

5) it is not minimize network traffic.

6) used VNODE

7) not have much hardware equipment

8) later versions took on features of AFS

9) stateless protocol conflicts with files being state-full by nature.

'''AFS:'''

1) designed to share files over a network, not disks

2) better scalability

3) better security.

4) minimize network traffic.

5) less UNIX like

6) plugin authentication

7) needs more kernel storage due to complex commands

8) inode concept replaced with fid

==Group 3==

'''NFS:'''

1) cache assumption invalid.

2) no locking

3) bad security

'''AFS:'''

1) cache assumption valid

2) locking

3) good security.

==Group 4==

====
Additional notes from the class discussion:

Capturing some of Anil's Observations about NFS and AFS: The reason why NFS does not try to share at block level instead of file level is that sharing at block level is complicated from the implementation point of view. NFS use UDP as the transport protocol since UDP being a stateless protocol is in-line with the NFS design philosophy of not maintaining state information. Security and unreliability issues in NFS are an implication of using RPC. RPC is a nice way for programming but RPC is not designed for networks (where flakiness is an inherent characteristic) which is better explained by the analogy that you never expect from a programming point of view your function call to fail(not to return) because of communication error. AFS designers considered network as a bottle neck and tried to reduce the number of chatter over network by using caching. In Anil's opinion 'open' and 'close' operations in AFS were critical and the 'close' operation assumes importance to the same proportions of a 'commit' operation in a well-designed database system. Anil mentioned that security model of AFS is interesting in that rather than going for the UNIX access list based implementation AFS used a single sign on system based on Kerberos. In Anil's opinion cool thing about Kerberos is that idea of using tickets to get access. Another interesting fact that was mentioned was that irrespective of having better features compared to NFS, AFS did not get widely adopted. The reason for this was that the administrative mechanism for AFS was complex and it required highly trained/skilled people to setup AFS and it also required quite a number of day’s effort to set it up and maintain.

DistOS 2014W Lecture 6

2014-01-29T22:07:07Z

Sjoy: minor edit

'''the point form notes for this lecture could be turned into full sentences/paragraphs'''

== Group Discussion on "The Early Web" ==

Questions to discuss:

# How do you think the web would have been if not like the present way?
# What kind of infrastructure changes would you like to make?

=== Group 1 ===
: Relatively satisfied with the present structure of the web some changes suggested are in the below areas:
* Make use of the greater potential of Protocols
* More communication and interaction capabilities.
* Implementation changes in the present payment method systems. Example usage of "Micro-computation" - a discussion we would get back to in future classes. Also, Cryptographic currencies.
* Augmented reality.
* More towards individual privacy.

=== Group 2 ===
==== Problem of unstructured information ====
A large portion of the web serves content that is overwhelmingly concerned about presentation rather than structuring content. Tim Berner-Lees himself bemoaned the death of the semantic web. His original vision of it was as follows:


<blockquote>I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A "Semantic Web", which makes this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The "intelligent agents" people have touted for ages will finally materialize.<ref>{{cite book |last=Berners-Lee |first=Tim |authorlink=Tim Berners-Lee |coauthors=Fischetti, Mark |title=Weaving the Web |publisher=HarperSanFrancisco |year=1999 |pages=chapter 12 |isbn=978-0-06-251587-2 |nopp=true }}</ref></blockquote>

For this vision to be true, information arguably needs to be structured, maybe even classified. The idea of a universal information classification system has been floated. The modern web is mostly developed by software developers and similar, not librarians and the like.



Also, how does one differentiate satire from fact?

==== Valuation and deduplication of information ====
Another problem common with the current wwww is the duplication of information. Redundancy is not in itself harmful to increase the availability of information, but is ad-hoc duplication of the information itself?

One then comes to the problem of assigning a value to the information found therein. How does one rate information, and according to what criteria? How does one authenticate the information? Often, popularity is used as an indicator of veracity, almost in a sophistic manner. See excessive reliance on Google page ranking or Reddit score for various types of information consumption for research or news consumption respectively.

=== On the current infrastructure ===
The current internet infrastructure should remain as is, at least in countries with not just a modicum of freedom of access to information. Centralization of of control of access to information is a terrible power. See China and parts of the Middle-East. On that note, what can be said of popular sites, such as Google or Wikipedia that serve as the main entry point for many access patterns?

The problem, if any, in the current web infrastructure is of the web itself, not the internet.

=== Group 3 ===
* What we want to keep
** Linking mechanisms
** Minimum permissions to publish
* What we don't like
** Relying on one source for document
** Privacy links for security
* Proposal
** Peer-peer to distributed mechanisms for documenting
** Reverse links with caching - distributed cache
** More availability for user - what happens when system fails?
** Key management to be considered - Is it good to have centralized or distributed mechanism?

=== Group 4 ===
* An idea of web searching for us
* A suggestion of a different web if it would have been implemented by "AI" people
** AI programs searching for data - A notion already being implemented by Google slowly.
* Generate report forums
* HTML equivalent is inspired by the AI communication
* Higher semantics apart from just indexing the data
** Problem : "How to bridge the semantic gap?"
** Search for more data patterns

== Group design exercise — The web that could be ==

* “The web that wasn't” mentioned the moans of librarians.
* A universal classification system is needed.
* The training overhead of classifiers (e.g., librarians) is high. See the master's that a librarian would need.
* More structured content, both classification, and organization
* Current indexing by crude brute-force searching for words, etc., rather than searching metadata
* Information doesn't have the same persistence, see bitrot and Vint Cerf's talk.
* Too concerned with presentation now.
* Tim Berner-Lees bemoaning the death of the semantic web.
* The problem of information duplication when information gets redistributed across the web. However, we do want redundancy.
* Too much developed by software developers
* Too reliant on Google for web structure
** See search-engine optimization
* Problem of authentication (of the information, not the presenter)
** Too dependent at times on the popularity of a site, almost in a sophistic manner.
** See Reddit
* How do you programmatically distinguish satire from fact
* The web's structure is also “shaped by inbound links but would be nice a bit more”
* Infrastructure doesn't need to change per se.
** The distributed architecture should still stay. Centralization of control of allowed information and access is terrible power. See China and the Middle-East.
** Information, for the most part, in itself, exists centrally (as per-page), though communities (to use a generic term) are distributed.
* Need more sophisticated natural language processing.

== Class discussion ==

Focusing on vision, not the mechanism.

* Reverse linking
* Distributed content distribution (glorified cache)
** Both for privacy and redunancy reasons
** Suggested centralized content certification, but doesn't address the problem of root of trust and distributed consistency checking.
*** Distributed key management is a holy grail
*** What about detecting large-scale subversion attempts, like in China
* What is the new revenue model?
** What was TBL's revenue model (tongue-in-cheek, none)?
** Organisations like Google monetized the internet, and this mechanism could destroy their ability to do so.
* Search work is semi-distributed. Suggested letting the web do the work for you.
* Trying to structure content in a manner simultaneously palatable to both humans and machines.
* Using spare CPU time on servers for natural language processing (or other AI) of cached or locally available resources.
* Imagine a smushed Wolfram Alpha, Google, Wikipedia, and Watson, and then distributed over the net.
* The document was TBL's idea of the atom of content, whereas nowaday we really need something more granular.
* We want to extract higher-level semantics.
* Google may not be pure keyword search anymore. It is essentially AI now, but we still struggle with expressing what we want to Google.
* What about the adversarial aspect of content hosters, vying for attention?
* People do actively try to fool you.
* Compare to Google News, though that is very specific to that domain. Their vision is a semantic web, but they are incrementally building it.
* In a scary fashion, Google is one of the central points of failure of the web. Even scarier is less technically competent people who depend on Facebook for that.
* There is a semantic gap between how we express and query information, and how AI understands it.
* Can think of Facebook as a distributed human search infrastructure.
* A core service of an operating system is locating information. '''Search is infrastructure.'''
* The problem is not purely technical. There are political and social aspects.
** Searching for a file on a local filesystem should have a unambiguous answer.
** Asking the web is a different thing. “What is the best chocolate bar?”
* Is the web a network database, as understood in COMP 3005, which we consider harmful.
* For two-way links, there is the problem of restructuring data and all the dependencies.
* Privacy issues when tracing paths across the web.
* What about the problem of information revocation?
* Need more augmented reality and distributed and micro payment systems.
* We need distributed, mutually untrusting social networks.
** Now we have the problem of storage and computation, but also take away some of of the monetizationable aspect.
* Distribution is not free. It is very expensive in very funny ways.
* The dream of harvesting all the computational power of the internet is not new.
** Startups have come and gone many times over that problem.
* Google's indexers understands quite well many documents on the web. However, it only '''presents''' a primitive keyword-like interface. It doesn't expose the ontology.
* Organising information does not necessarily mean applying an ontology to it.
* The organisational methods we now use don't use ontologies, but rather are supplemented by them.

Adding couple of related points Anil mentioned during the discussion:
Distributed key management is a holy grail no one has ever managed to get it working. Now a days databases have become important building blocks of the Distributed Operating System. Anil stressed the fact that Databases can in fact be considered as an OS service these days. The question “How you navigate the complex information space?” has remained a prominent question that The Web have always faced.

DistOS 2014W Lecture 7

2014-01-28T19:10:17Z

Sjoy: minor edit

== Project ==

We discussed moving the proposal due date back a week. We also discussed spending the class prior to that date discussing the primary papers people had chosen in order to provide preliminary feedback. Anil spent some time going through the papers from OSDI12 and discussing which ones would make good projects and why.

* Pick a primary paper.
* Find papers that cite that paper, papers it cites, etc. to collect a body of related work.
* Don't just give a history, tell a story!
* Do not try to summarize papers.
* Try to identify a pattern, a common ground between the papers.

== Unix and Plan 9 ==

UNIX was built as "a castrated version of Multics", which was a very complex system. Multics was, arguably, so far ahead of its time that we are only just achieving their ambitions now. Unix was much more modest, and therefore much more achievable and successful. Just enough infrastructure to avoid reinventing the wheel. Just a couple of programmers making something for their own use. Unix was not designed as product or commercial entity at all. It was licensed out because AT&T was under severe antitrust scrutiny at the time.

They wanted few, simple abstractions so they made everything a file. Berkeley promptly broke this abstraction by introducing sockets for networking. Plan 9 finally introduced networking using the right abstractions, but was too late. Arguably the reason the BSD folks didn't use the file abstraction was because of the difference in reliability. SUN microsystems licensed Berkeley Unix and commercialized it. Files are generally reliable, and failures with them are catastrophic so many applications simply didn't include logic to handle such IO errors. Networks are much less reliable and applications have to be able to deal gracefully with timeouts and other errors.

In Anil's opinion Plan 9's design of using file abstraction to represent Network was n't a good design idea. The reason being file I/O breaking is uncommon but Network has an inherent flakiness and loss of connectivity is normal in networks. Using file system abstractions to represent Network does n't properly takes care of the flakiness inherent in the Network. Put in other words Network does n't have the reliability characteristics of mass storage and how to deal with this fact while using the file abstraction to deal with network was a major question which was left unanswered by the Plan 9 designers. Anil also added that Plan 9 was a elegant attempt at representing everything using file abstraction but they were trying too hard with this approach as pointed out above. In distributed systems the best approach to use is - if things have different semantics then they should have abstractions that reflect their characteristics, the APIs should reflect their characteristics rather than hide it away and try to pretend or treat them as if they were having characteristics of something else in an attempt towards too much generalizations. In Anil's opinion another reason why Plan 9 was not widely adopted was that it was a bit late to the scene, by the time Plan 9 came out in the 90s systems running UNIX with networking was widely adopted driven by the success of Internet.

Another valuable point Anil mentioned was that for a technology to get adopted and become successful it should serve or address a niche area for which there are no successful incumbents.

== Simon's Notes ==

* project proposal
** We will discuss the primary papers we've chosen on Thursday, February 6th
* possible papers, remember to pick a topic you have some chance of understanding
** OSDI 2012
*** datacenter (filesystems for doing X, heat management, etc...)
*** web stuff
*** distributed shared memory
*** distributed network I/O infrastructure
*** distributed databases (potentially)
*** anonymity systems
** Pick a conference (usenix is pretty systems oriented, maybe Lisa), go through their papers and find something interesting
** tell a story that connects several papers in the topic you choose

* UNIX
** Relation to multics
*** Multics was a complex system which was bad because it was used less, slower, etc...
*** Multics was not for end users, it was designed to support "utility computing" wherein computation was a service to be charged for
** What?
*** Just enough infrastructure to run my programs
*** It was really just supposed to be used by programmers
*** "By programmers for programmers"
*** Software and source licensed for a nominal fee
*** "Everything is a file"
*** only difference was files that you could use seek or ones you couldn't
*** simple abstractions
** Networking
*** Berkeley folks made sockets, not files which upset the folks at Bell labs
*** Networks aren't exactly like files because they're unreliable

* Plan 9
** major ideas
*** procfs, later adopted by linux
** summary
*** a very elegant attempt to follow the philosophy "everything is a file"
*** trying too hard
** opinions
*** things that have different failure modes deserve different APIs
** niche?
*** they never found one

* Tangent about programming languages
** C was for system programming
** Java was for enterprise programming

DistOS 2014W Lecture 7

2014-01-28T19:08:54Z

Sjoy: additional notes from the lecture

== Project ==

We discussed moving the proposal due date back a week. We also discussed spending the class prior to that date discussing the primary papers people had chosen in order to provide preliminary feedback. Anil spent some time going through the papers from OSDI12 and discussing which ones would make good projects and why.

* Pick a primary paper.
* Find papers that cite that paper, papers it cites, etc. to collect a body of related work.
* Don't just give a history, tell a story!
* Do not try to summarize papers.
* Try to identify a pattern, a common ground between the papers.

== Unix and Plan 9 ==

UNIX was built as "a castrated version of Multics", which was a very complex system. Multcs was, arguably, so far ahead of its time that we are only just achieving their ambitions now. Unix was much more modest, and therefore much more achievable and successful. Just enough infrastructure to avoid reinventing the wheel. Just a couple of programmers making something for their own use. Unix was not designed as product or commercial entity at all. It was licensed out because AT&T was under severe antitrust scrutiny at the time.

They wanted few, simple abstractions so they made everything a file. Berkeley promptly broke this abstraction by introducing sockets for networking. Plan 9 finally introduced networking using the right abstractions, but was too late. Arguably the reason the BSD folks didn't use the file abstraction was because of the difference in reliability. SUN microsystems licensed Berkeley Unix and commercialized it. Files are generally reliable, and failures with them are catastrophic so many applications simply didn't include logic to handle such IO errors. Networks are much less reliable and applications have to be able to deal gracefully with timeouts and other errors.

In Anil's opinion Plan 9's design of using file abstraction to represent Network was n't a good design idea. The reason being file I/O breaking is uncommon but Network has an inherent flakiness and loss of connectivity is normal in networks. Using file system abstractions to represent Network does n't properly takes care of the flakiness inherent in the Network. Put in other words Network does n't have the reliability characteristics of mass storage and how to deal with this fact while using the file abstraction to deal with network was a major question which was left unanswered by the Plan 9 designers. Anil also added that Plan 9 was a elegant attempt at representing everything using file abstraction but they were trying too hard with this approach as pointed out above. In distributed systems the best approach to use is - if things have different semantics then they should have abstractions that reflect their characteristics, the APIs should reflect their characteristics rather than hide it away and try to pretend or treat them as if they were having characteristics of something else in an attempt towards too much generalizations. In Anil's opinion another reason why Plan 9 was not widely adopted was that it was a bit late to the scene, by the time Plan 9 came out in the 90s systems running UNIX with networking was widely adopted driven by the success of Internet.

Another valuable point Anil mentioned was that for a technology to get adopted and become successful it should serve or address a niche area for which there are no successful incumbents.

== Simon's Notes ==

* project proposal
** We will discuss the primary papers we've chosen on Thursday, February 6th
* possible papers, remember to pick a topic you have some chance of understanding
** OSDI 2012
*** datacenter (filesystems for doing X, heat management, etc...)
*** web stuff
*** distributed shared memory
*** distributed network I/O infrastructure
*** distributed databases (potentially)
*** anonymity systems
** Pick a conference (usenix is pretty systems oriented, maybe Lisa), go through their papers and find something interesting
** tell a story that connects several papers in the topic you choose

* UNIX
** Relation to multics
*** Multics was a complex system which was bad because it was used less, slower, etc...
*** Multics was not for end users, it was designed to support "utility computing" wherein computation was a service to be charged for
** What?
*** Just enough infrastructure to run my programs
*** It was really just supposed to be used by programmers
*** "By programmers for programmers"
*** Software and source licensed for a nominal fee
*** "Everything is a file"
*** only difference was files that you could use seek or ones you couldn't
*** simple abstractions
** Networking
*** Berkeley folks made sockets, not files which upset the folks at Bell labs
*** Networks aren't exactly like files because they're unreliable

* Plan 9
** major ideas
*** procfs, later adopted by linux
** summary
*** a very elegant attempt to follow the philosophy "everything is a file"
*** trying too hard
** opinions
*** things that have different failure modes deserve different APIs
** niche?
*** they never found one

* Tangent about programming languages
** C was for system programming
** Java was for enterprise programming

DistOS 2014W Lecture 7

2014-01-28T18:59:26Z

Sjoy: additional notes from the lecture

== Project ==

We discussed moving the proposal due date back a week. We also discussed spending the class prior to that date discussing the primary papers people had chosen in order to provide preliminary feedback. Anil spent some time going through the papers from OSDI12 and discussing which ones would make good projects and why.

* Pick a primary paper.
* Find papers that cite that paper, papers it cites, etc. to collect a body of related work.
* Don't just give a history, tell a story!
* Do not try to summarize papers.
* Try to identify a pattern, a common ground between the papers.

== Unix and Plan 9 ==

UNIX was built as "a castrated version of Multics", which was a very complex system. Multcs was, arguably, so far ahead of its time that we are only just achieving their ambitions now. Unix was much more modest, and therefore much more achievable and successful. Just enough infrastructure to avoid reinventing the wheel. Just a couple of programmers making something for their own use. Unix was not designed as product or commercial entity at all. It was licensed out because AT&T was under severe antitrust scrutiny at the time.

They wanted few, simple abstractions so they made everything a file. Berkeley promptly broke this abstraction by introducing sockets for networking. Plan 9 finally introduced networking using the right abstractions, but was too late. Arguably the reason the BSD folks didn't use the file abstraction was because of the difference in reliability. SUN microsystems licensed Berkeley Unix and commercialized it. Files are generally reliable, and failures with them are catastrophic so many applications simply didn't include logic to handle such IO errors. Networks are much less reliable and applications have to be able to deal gracefully with timeouts and other errors.

In Anil's opinion Plan 9's design of using file abstraction to represent Network was n't a good design idea. The reason being file I/O breaking is uncommon but Network has an inherent flakiness and loss of connectivity is normal in networks. Using file system abstractions to represent Network does n't properly takes care of the flakiness inherent in the Network. Put in other words Network does n't have the reliability characteristics of mass storage and how to deal with this fact while using the file abstraction to deal with network was a major question which was left unanswered by the Plan 9 designers. Anil also added that Plan 9 was a elegant attempt at representing everything using file abstraction but they were trying too hard with this approach as pointed out above. In distributed systems the best approach to use is - if things have different semantics then they should have abstractions that reflect their characteristics, the APIs should reflect their characteristics rather than hide it away and try to pretend or treat them as if they were having characteristics of something else in an attempt towards too much generalizations. In Anil's opinion another reason why Plan 9 was not widely adopted was that it was a bit late to the scene, by the time Plan 9 came out in the 90s systems running UNIX with networking was widely adopted driven by the success of Internet.

== Simon's Notes ==

* project proposal
** We will discuss the primary papers we've chosen on Thursday, February 6th
* possible papers, remember to pick a topic you have some chance of understanding
** OSDI 2012
*** datacenter (filesystems for doing X, heat management, etc...)
*** web stuff
*** distributed shared memory
*** distributed network I/O infrastructure
*** distributed databases (potentially)
*** anonymity systems
** Pick a conference (usenix is pretty systems oriented, maybe Lisa), go through their papers and find something interesting
** tell a story that connects several papers in the topic you choose

* UNIX
** Relation to multics
*** Multics was a complex system which was bad because it was used less, slower, etc...
*** Multics was not for end users, it was designed to support "utility computing" wherein computation was a service to be charged for
** What?
*** Just enough infrastructure to run my programs
*** It was really just supposed to be used by programmers
*** "By programmers for programmers"
*** Software and source licensed for a nominal fee
*** "Everything is a file"
*** only difference was files that you could use seek or ones you couldn't
*** simple abstractions
** Networking
*** Berkeley folks made sockets, not files which upset the folks at Bell labs
*** Networks aren't exactly like files because they're unreliable

* Plan 9
** major ideas
*** procfs, later adopted by linux
** summary
*** a very elegant attempt to follow the philosophy "everything is a file"
*** trying too hard
** opinions
*** things that have different failure modes deserve different APIs
** niche?
*** they never found one

* Tangent about programming languages
** C was for system programming
** Java was for enterprise programming

DistOS 2014W Lecture 5

2014-01-26T04:35:12Z

Sjoy: additional notes

= Introduction =

Anil set the theme of the discussion for the week as - to try and understand what the early visionaries/researchers wanted the computer to be and what it has become. Putting in other words what was considered fundamental those days and where those stands today. It is to be noted that features that were easier to implement using simple mechanisms were carried forward where as the ones which demanded more complex systems or the one which were found out to add not much value in the near feature were pegged down in the order. In the same context following observations were made: (1) truly distributed computational infrastructure really makes sense only when we have something to distribute (2) use cases drive the large distributed systems, a good example is The Web. Another key observation from Anil was that there was always a Utopian aspect to the early systems be it NLS, ARPANET or Alto. One good example is that security aspects were never considered essential in those systems assuming them to operate in a trusted environment.

; Operating system
: The software that turns the computer you have into the one you want (Anil)

* What sort of computer did we want to have?
* What sort of abstractions did they want to be easy? Hard?
* What could we build with the internet (not just WAN, but also LAN)?
* Most dreams people had of their computers smacked into the wall of reality.

= MOAD review in groups =

* Chorded keyboard unfortunately obscure, partly because the attendees disagreed with the long-term investment of training the user.
* View control → hyperlinking system, but in a lightweight (more like nanoweight) markup language.
* Ad-hoc ticketing system
* Ad-hoc messaging system
** Used on a time-sharing systme with shared storage,
* Primitive revision control system
* Different vocabulary:
** Bug and bug smear (mouse and trail)
** Point rather than click

= Class review =

* Doug died Jul 2 2013
* Doug himself called it an “online system”, rather than offline composition of code using card punchers as was common in the day.
* What became of the tech:
** Chorded keyboards:
*** Exist but obscure
** Pre-ARPANET network:
*** Time-sharing mainframe
*** 13 workstations
*** Telephone and television circuit
** Mouse
*** “I sometimes apologize for calling it a mouse”
** Collaborative document editing integrated with screen sharing
** Videoconferencing
*** Part of the vision, but more for the demo at the time,
** Hyperlinks
*** The web on a mainframe
** Languages
*** Metalanguages
**** “Part and parcel of their entire vision of augmenting human intelligence.”
**** You must teach the computer about the language you are using.
**** They were the use case. It was almost designed more for augmenting programmer intelligence rather than human intelligence.
*** It was normal for the time to build new languages (domain-specific) for new systems. Nowadays, we standardize on one but develop large APIs, at the expense of conciseness. We look for short-term benefits; we minimize programmer effort.
*** Compiler compiler
** Freeze-pane
** Folding—Zoomable UI (ZUI)
*** Lots of systems do it, but not the default
*** Much easier to just present everything.
** Technologies the required further investment got left behind.
* The NLS had little to no security
** There was a minimal notion of a user
** There was a utopian aspect. Meanwhile, the Mac had no utopian aspect. Data exchange was through floppies. Any network was small, local, ad-hoc, and among trusted peers.
** The system wasn't envisioned to scale up to masses of people who didn't trust each other.
** How do you enforce secrecy.
* Part of the reason for lack of adoption of some of the tech was hardware. We can posit that a bigger reason would be infrastructure.
* Differentiate usability of system from usability of vision
** What was missing was the polish, the ‘sexiness’, and the intuitiveness of later systems like the Apple II and the Lisa.
** The usability of the later Alto is still less than commercial systems.
*** The word processor was modal, which is apt to confuse unmotivated and untrained users.
* In the context of the Mother of All Demos, the Alto doesn't seem entirely revolutionary. Xerox PARC raided his team. They almost had a GUI; rather they had what we call today a virtual console, with a few things above.
* What happens with visionaries that present a big vision is that the spectators latch onto specific aspects.
* To be comfortable with not adopting the vision, one must ostracize the visionary. People pay attention to things that fit into their world view.
* Use cases of networking have changed little, though the means did
* Fundamentally a resource-sharing system; everything is shared, unlike later systems where you would need to explicitly do so. Resources shared fundamentally sense to share: documents, printers, etc.
* Resource sharing was never enough. '''Information-sharing''' was the focus.

“Mother of all demos” is nickname for Engelbart who could make the computers help humans become smarter.

*More interesting in this works that:
"His idea included seeing computing devices as a means to communicate and retrieve information, rather than just crunch numbers. This idea is represented in NLS”On-Line system”.

*Some information about NLS system:
1) NLS was a revolutionary computer collaboration system from the 1960s.
2) Designed by Douglas Engelbart and implemented by researchers at the Augmentation Research Center (ARC) at the Stanford Research Institute (SRI).
3) The NLS system was the first to employ the practical use of :
a) hypertext links,
b) the mouse,
c) raster-scan video monitors,
d) information organized by relevance,
e) screen windowing,
f) presentation programs,
g) and other modern computing concepts.

= Alto review =

* Fundamentally a personal computer
* Applications:
** Drawing program with curves and arcs for drawing
** Hardware design tools (mostly logic boards)
** Time server
* Less designed for reading than the NLS. More designed around paper. Xerox had a laser printer, and you would read what you printed. Hypertext was deprioritized, unlike the NLS vision had focused on what could not be expressed on paper.
* Xerox had almost an obsession with making documents print beautifully.

DistOS 2014W Lecture 6

2014-01-26T02:32:19Z

Sjoy: formatting

== Group Discussion on "The Early Web" ==

Questions to discuss:

# How do you think the web would have been if not like the present way?
# What kind of infrastructure changes would you like to make?

=== Group 1 ===
: Relatively satisfied with the present structure of the web some changes suggested are in the below areas:
* Make use of the greater potential of Protocols
* More communication and interaction capabilities.
* Implementation changes in the present payment method systems. Example usage of "Micro-computation" - a discussion we would get back to in future classes. Also, Cryptographic currencies.
* Augmented reality.
* More towards individual privacy.

=== Group 2 ===
==== Problem of unstructured information ====
A large portion of the web serves content that is overwhelmingly concerned about presentation rather than structuring content. Tim Berner-Lees himself bemoaned the death of the semantic web. His original vision of it was as follows:


<blockquote>I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A "Semantic Web", which makes this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The "intelligent agents" people have touted for ages will finally materialize.<ref>{{cite book |last=Berners-Lee |first=Tim |authorlink=Tim Berners-Lee |coauthors=Fischetti, Mark |title=Weaving the Web |publisher=HarperSanFrancisco |year=1999 |pages=chapter 12 |isbn=978-0-06-251587-2 |nopp=true }}</ref></blockquote>

For this vision to be true, information arguably needs to be structured, maybe even classified. The idea of a universal information classification system has been floated. The modern web is mostly developed by software developers and similar, not librarians and the like.



Also, how does one differentiate satire from fact?

==== Valuation and deduplication of information ====
Another problem common with the current wwww is the duplication of information. Redundancy is not in itself harmful to increase the availability of information, but is ad-hoc duplication of the information itself?

One then comes to the problem of assigning a value to the information found therein. How does one rate information, and according to what criteria? How does one authenticate the information? Often, popularity is used as an indicator of veracity, almost in a sophistic manner. See excessive reliance on Google page ranking or Reddit score for various types of information consumption for research or news consumption respectively.

=== On the current infrastructure ===
The current internet infrastructure should remain as is, at least in countries with not just a modicum of freedom of access to information. Centralization of of control of access to information is a terrible power. See China and parts of the Middle-East. On that note, what can be said of popular sites, such as Google or Wikipedia that serve as the main entry point for many access patterns?

The problem, if any, in the current web infrastructure is of the web itself, not the internet.

=== Group 3 ===
* What we want to keep
** Linking mechanisms
** Minimum permissions to publish
* What we don't like
** Relying on one source for document
** Privacy links for security
* Proposal
** Peer-peer to distributed mechanisms for documenting
** Reverse links with caching - distributed cache
** More availability for user - what happens when system fails?
** Key management to be considered - Is it good to have centralized or distributed mechanism?

=== Group 4 ===
* An idea of web searching for us
* A suggestion of a different web if it would have been implemented by "AI" people
** AI programs searching for data - A notion already being implemented by Google slowly.
* Generate report forums
* HTML equivalent is inspired by the AI communication
* Higher semantics apart from just indexing the data
** Problem : "How to bridge the semantic gap?"
** Search for more data patterns

== Group design exercise — The web that could be ==

* “The web that wasn't” mentioned the moans of librarians.
* A universal classification system is needed.
* The training overhead of classifiers (e.g., librarians) is high. See the master's that a librarian would need.
* More structured content, both classification, and organization
* Current indexing by crude brute-force searching for words, etc., rather than searching metadata
* Information doesn't have the same persistence, see bitrot and Vint Cerf's talk.
* Too concerned with presentation now.
* Tim Berner-Lees bemoaning the death of the semantic web.
* The problem of information duplication when information gets redistributed across the web. However, we do want redundancy.
* Too much developed by software developers
* Too reliant on Google for web structure
** See search-engine optimization
* Problem of authentication (of the information, not the presenter)
** Too dependent at times on the popularity of a site, almost in a sophistic manner.
** See Reddit
* How do you programmatically distinguish satire from fact
* The web's structure is also “shaped by inbound links but would be nice a bit more”
* Infrastructure doesn't need to change per se.
** The distributed architecture should still stay. Centralization of control of allowed information and access is terrible power. See China and the Middle-East.
** Information, for the most part, in itself, exists centrally (as per-page), though communities (to use a generic term) are distributed.
* Need more sophisticated natural language processing.

== Class discussion ==

Focusing on vision, not the mechanism.

* Reverse linking
* Distributed content distribution (glorified cache)
** Both for privacy and redunancy reasons
** Suggested centralized content certification, but doesn't address the problem of root of trust and distributed consistency checking.
*** Distributed key management is a holy grail
*** What about detecting large-scale subversion attempts, like in China
* What is the new revenue model?
** What was TBL's revenue model (tongue-in-cheek, none)?
** Organisations like Google monetized the internet, and this mechanism could destroy their ability to do so.
* Search work is semi-distributed. Suggested letting the web do the work for you.
* Trying to structure content in a manner simultaneously palatable to both humans and machines.
* Using spare CPU time on servers for natural language processing (or other AI) of cached or locally available resources.
* Imagine a smushed Wolfram Alpha, Google, Wikipedia, and Watson, and then distributed over the net.
* The document was TBL's idea of the atom of content, whereas nowaday we really need something more granular.
* We want to extract higher-level semantics.
* Google may not be pure keyword search anymore. It is essentially AI now, but we still struggle with expressing what we want to Google.
* What about the adversarial aspect of content hosters, vying for attention?
* People do actively try to fool you.
* Compare to Google News, though that is very specific to that domain. Their vision is a semantic web, but they are incrementally building it.
* In a scary fashion, Google is one of the central points of failure of the web. Even scarier is less technically competent people who depend on Facebook for that.
* There is a semantic gap between how we express and query information, and how AI understands it.
* Can think of Facebook as a distributed human search infrastructure.
* A core service of an operating system is locating information. '''Search is infrastructure.'''
* The problem is not purely technical. There are political and social aspects.
** Searching for a file on a local filesystem should have a unambiguous answer.
** Asking the web is a different thing. “What is the best chocolate bar?”
* Is the web a network database, as understood in COMP 3005, which we consider harmful.
* For two-way links, there is the problem of restructuring data and all the dependencies.
* Privacy issues when tracing paths across the web.
* What about the problem of information revocation?
* Need more augmented reality and distributed and micro payment systems.
* We need distributed, mutually untrusting social networks.
** Now we have the problem of storage and computation, but also take away some of of the monetizationable aspect.
* Distribution is not free. It is very expensive in very funny ways.
* The dream of harvesting all the computational power of the internet is not new.
** Startups have come and gone many times over that problem.
* Google's indexers understands quite well many documents on the web. However, it only '''presents''' a primitive keyword-like interface. It doesn't expose the ontology.
* Organising information does not necessarily mean applying an ontology to it.
* The organisational methods we now use don't use ontologies, but rather are supplemented by them.

Adding couple of related points Anil mentioned during the discussion:
*Distributed key management is a holy grail no one has ever managed to get it working.
*Now a days databases have become important building blocks of the Distributed Operating System. Anil stressed the fact that Databases can in fact be considered as an OS service these days.
*The question “How you navigate the complex information space?” has remained a prominent question that The Web have always faced.

DistOS 2014W Lecture 6

2014-01-26T02:31:00Z

Sjoy: formatting

== Group Discussion on "The Early Web" ==

Questions to discuss:

# How do you think the web would have been if not like the present way?
# What kind of infrastructure changes would you like to make?

=== Group 1 ===
: Relatively satisfied with the present structure of the web some changes suggested are in the below areas:
* Make use of the greater potential of Protocols
* More communication and interaction capabilities.
* Implementation changes in the present payment method systems. Example usage of "Micro-computation" - a discussion we would get back to in future classes. Also, Cryptographic currencies.
* Augmented reality.
* More towards individual privacy.

=== Group 2 ===
==== Problem of unstructured information ====
A large portion of the web serves content that is overwhelmingly concerned about presentation rather than structuring content. Tim Berner-Lees himself bemoaned the death of the semantic web. His original vision of it was as follows:


<blockquote>I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A "Semantic Web", which makes this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The "intelligent agents" people have touted for ages will finally materialize.<ref>{{cite book |last=Berners-Lee |first=Tim |authorlink=Tim Berners-Lee |coauthors=Fischetti, Mark |title=Weaving the Web |publisher=HarperSanFrancisco |year=1999 |pages=chapter 12 |isbn=978-0-06-251587-2 |nopp=true }}</ref></blockquote>

For this vision to be true, information arguably needs to be structured, maybe even classified. The idea of a universal information classification system has been floated. The modern web is mostly developed by software developers and similar, not librarians and the like.



Also, how does one differentiate satire from fact?

==== Valuation and deduplication of information ====
Another problem common with the current wwww is the duplication of information. Redundancy is not in itself harmful to increase the availability of information, but is ad-hoc duplication of the information itself?

One then comes to the problem of assigning a value to the information found therein. How does one rate information, and according to what criteria? How does one authenticate the information? Often, popularity is used as an indicator of veracity, almost in a sophistic manner. See excessive reliance on Google page ranking or Reddit score for various types of information consumption for research or news consumption respectively.

=== On the current infrastructure ===
The current internet infrastructure should remain as is, at least in countries with not just a modicum of freedom of access to information. Centralization of of control of access to information is a terrible power. See China and parts of the Middle-East. On that note, what can be said of popular sites, such as Google or Wikipedia that serve as the main entry point for many access patterns?

The problem, if any, in the current web infrastructure is of the web itself, not the internet.

=== Group 3 ===
* What we want to keep
** Linking mechanisms
** Minimum permissions to publish
* What we don't like
** Relying on one source for document
** Privacy links for security
* Proposal
** Peer-peer to distributed mechanisms for documenting
** Reverse links with caching - distributed cache
** More availability for user - what happens when system fails?
** Key management to be considered - Is it good to have centralized or distributed mechanism?

=== Group 4 ===
* An idea of web searching for us
* A suggestion of a different web if it would have been implemented by "AI" people
** AI programs searching for data - A notion already being implemented by Google slowly.
* Generate report forums
* HTML equivalent is inspired by the AI communication
* Higher semantics apart from just indexing the data
** Problem : "How to bridge the semantic gap?"
** Search for more data patterns

== Group design exercise — The web that could be ==

* “The web that wasn't” mentioned the moans of librarians.
* A universal classification system is needed.
* The training overhead of classifiers (e.g., librarians) is high. See the master's that a librarian would need.
* More structured content, both classification, and organization
* Current indexing by crude brute-force searching for words, etc., rather than searching metadata
* Information doesn't have the same persistence, see bitrot and Vint Cerf's talk.
* Too concerned with presentation now.
* Tim Berner-Lees bemoaning the death of the semantic web.
* The problem of information duplication when information gets redistributed across the web. However, we do want redundancy.
* Too much developed by software developers
* Too reliant on Google for web structure
** See search-engine optimization
* Problem of authentication (of the information, not the presenter)
** Too dependent at times on the popularity of a site, almost in a sophistic manner.
** See Reddit
* How do you programmatically distinguish satire from fact
* The web's structure is also “shaped by inbound links but would be nice a bit more”
* Infrastructure doesn't need to change per se.
** The distributed architecture should still stay. Centralization of control of allowed information and access is terrible power. See China and the Middle-East.
** Information, for the most part, in itself, exists centrally (as per-page), though communities (to use a generic term) are distributed.
* Need more sophisticated natural language processing.

== Class discussion ==

Focusing on vision, not the mechanism.

* Reverse linking
* Distributed content distribution (glorified cache)
** Both for privacy and redunancy reasons
** Suggested centralized content certification, but doesn't address the problem of root of trust and distributed consistency checking.
*** Distributed key management is a holy grail
*** What about detecting large-scale subversion attempts, like in China
* What is the new revenue model?
** What was TBL's revenue model (tongue-in-cheek, none)?
** Organisations like Google monetized the internet, and this mechanism could destroy their ability to do so.
* Search work is semi-distributed. Suggested letting the web do the work for you.
* Trying to structure content in a manner simultaneously palatable to both humans and machines.
* Using spare CPU time on servers for natural language processing (or other AI) of cached or locally available resources.
* Imagine a smushed Wolfram Alpha, Google, Wikipedia, and Watson, and then distributed over the net.
* The document was TBL's idea of the atom of content, whereas nowaday we really need something more granular.
* We want to extract higher-level semantics.
* Google may not be pure keyword search anymore. It is essentially AI now, but we still struggle with expressing what we want to Google.
* What about the adversarial aspect of content hosters, vying for attention?
* People do actively try to fool you.
* Compare to Google News, though that is very specific to that domain. Their vision is a semantic web, but they are incrementally building it.
* In a scary fashion, Google is one of the central points of failure of the web. Even scarier is less technically competent people who depend on Facebook for that.
* There is a semantic gap between how we express and query information, and how AI understands it.
* Can think of Facebook as a distributed human search infrastructure.
* A core service of an operating system is locating information. '''Search is infrastructure.'''
* The problem is not purely technical. There are political and social aspects.
** Searching for a file on a local filesystem should have a unambiguous answer.
** Asking the web is a different thing. “What is the best chocolate bar?”
* Is the web a network database, as understood in COMP 3005, which we consider harmful.
* For two-way links, there is the problem of restructuring data and all the dependencies.
* Privacy issues when tracing paths across the web.
* What about the problem of information revocation?
* Need more augmented reality and distributed and micro payment systems.
* We need distributed, mutually untrusting social networks.
** Now we have the problem of storage and computation, but also take away some of of the monetizationable aspect.
* Distribution is not free. It is very expensive in very funny ways.
* The dream of harvesting all the computational power of the internet is not new.
** Startups have come and gone many times over that problem.
* Google's indexers understands quite well many documents on the web. However, it only '''presents''' a primitive keyword-like interface. It doesn't expose the ontology.
* Organising information does not necessarily mean applying an ontology to it.
* The organisational methods we now use don't use ontologies, but rather are supplemented by them.

Adding couple of related points Anil mentioned during the discussion:

-Distributed key management is a holy grail no one has ever managed to get it working.

-Now a days databases have become important building blocks of the Distributed Operating System. Anil stressed the fact that Databases can in fact be considered as an OS service these days.

-The question “How you navigate the complex information space?” has remained a prominent question that The Web have always faced.

DistOS 2014W Lecture 6

2014-01-26T02:30:30Z

Sjoy: Undo revision 18502 by Sjoy (talk)

== Group Discussion on "The Early Web" ==

Questions to discuss:

# How do you think the web would have been if not like the present way?
# What kind of infrastructure changes would you like to make?

=== Group 1 ===
: Relatively satisfied with the present structure of the web some changes suggested are in the below areas:
* Make use of the greater potential of Protocols
* More communication and interaction capabilities.
* Implementation changes in the present payment method systems. Example usage of "Micro-computation" - a discussion we would get back to in future classes. Also, Cryptographic currencies.
* Augmented reality.
* More towards individual privacy.

=== Group 2 ===
==== Problem of unstructured information ====
A large portion of the web serves content that is overwhelmingly concerned about presentation rather than structuring content. Tim Berner-Lees himself bemoaned the death of the semantic web. His original vision of it was as follows:


<blockquote>I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A "Semantic Web", which makes this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The "intelligent agents" people have touted for ages will finally materialize.<ref>{{cite book |last=Berners-Lee |first=Tim |authorlink=Tim Berners-Lee |coauthors=Fischetti, Mark |title=Weaving the Web |publisher=HarperSanFrancisco |year=1999 |pages=chapter 12 |isbn=978-0-06-251587-2 |nopp=true }}</ref></blockquote>

For this vision to be true, information arguably needs to be structured, maybe even classified. The idea of a universal information classification system has been floated. The modern web is mostly developed by software developers and similar, not librarians and the like.



Also, how does one differentiate satire from fact?

==== Valuation and deduplication of information ====
Another problem common with the current wwww is the duplication of information. Redundancy is not in itself harmful to increase the availability of information, but is ad-hoc duplication of the information itself?

One then comes to the problem of assigning a value to the information found therein. How does one rate information, and according to what criteria? How does one authenticate the information? Often, popularity is used as an indicator of veracity, almost in a sophistic manner. See excessive reliance on Google page ranking or Reddit score for various types of information consumption for research or news consumption respectively.

=== On the current infrastructure ===
The current internet infrastructure should remain as is, at least in countries with not just a modicum of freedom of access to information. Centralization of of control of access to information is a terrible power. See China and parts of the Middle-East. On that note, what can be said of popular sites, such as Google or Wikipedia that serve as the main entry point for many access patterns?

The problem, if any, in the current web infrastructure is of the web itself, not the internet.

=== Group 3 ===
* What we want to keep
** Linking mechanisms
** Minimum permissions to publish
* What we don't like
** Relying on one source for document
** Privacy links for security
* Proposal
** Peer-peer to distributed mechanisms for documenting
** Reverse links with caching - distributed cache
** More availability for user - what happens when system fails?
** Key management to be considered - Is it good to have centralized or distributed mechanism?

=== Group 4 ===
* An idea of web searching for us
* A suggestion of a different web if it would have been implemented by "AI" people
** AI programs searching for data - A notion already being implemented by Google slowly.
* Generate report forums
* HTML equivalent is inspired by the AI communication
* Higher semantics apart from just indexing the data
** Problem : "How to bridge the semantic gap?"
** Search for more data patterns

== Group design exercise — The web that could be ==

* “The web that wasn't” mentioned the moans of librarians.
* A universal classification system is needed.
* The training overhead of classifiers (e.g., librarians) is high. See the master's that a librarian would need.
* More structured content, both classification, and organization
* Current indexing by crude brute-force searching for words, etc., rather than searching metadata
* Information doesn't have the same persistence, see bitrot and Vint Cerf's talk.
* Too concerned with presentation now.
* Tim Berner-Lees bemoaning the death of the semantic web.
* The problem of information duplication when information gets redistributed across the web. However, we do want redundancy.
* Too much developed by software developers
* Too reliant on Google for web structure
** See search-engine optimization
* Problem of authentication (of the information, not the presenter)
** Too dependent at times on the popularity of a site, almost in a sophistic manner.
** See Reddit
* How do you programmatically distinguish satire from fact
* The web's structure is also “shaped by inbound links but would be nice a bit more”
* Infrastructure doesn't need to change per se.
** The distributed architecture should still stay. Centralization of control of allowed information and access is terrible power. See China and the Middle-East.
** Information, for the most part, in itself, exists centrally (as per-page), though communities (to use a generic term) are distributed.
* Need more sophisticated natural language processing.

== Class discussion ==

Focusing on vision, not the mechanism.

* Reverse linking
* Distributed content distribution (glorified cache)
** Both for privacy and redunancy reasons
** Suggested centralized content certification, but doesn't address the problem of root of trust and distributed consistency checking.
*** Distributed key management is a holy grail
*** What about detecting large-scale subversion attempts, like in China
* What is the new revenue model?
** What was TBL's revenue model (tongue-in-cheek, none)?
** Organisations like Google monetized the internet, and this mechanism could destroy their ability to do so.
* Search work is semi-distributed. Suggested letting the web do the work for you.
* Trying to structure content in a manner simultaneously palatable to both humans and machines.
* Using spare CPU time on servers for natural language processing (or other AI) of cached or locally available resources.
* Imagine a smushed Wolfram Alpha, Google, Wikipedia, and Watson, and then distributed over the net.
* The document was TBL's idea of the atom of content, whereas nowaday we really need something more granular.
* We want to extract higher-level semantics.
* Google may not be pure keyword search anymore. It is essentially AI now, but we still struggle with expressing what we want to Google.
* What about the adversarial aspect of content hosters, vying for attention?
* People do actively try to fool you.
* Compare to Google News, though that is very specific to that domain. Their vision is a semantic web, but they are incrementally building it.
* In a scary fashion, Google is one of the central points of failure of the web. Even scarier is less technically competent people who depend on Facebook for that.
* There is a semantic gap between how we express and query information, and how AI understands it.
* Can think of Facebook as a distributed human search infrastructure.
* A core service of an operating system is locating information. '''Search is infrastructure.'''
* The problem is not purely technical. There are political and social aspects.
** Searching for a file on a local filesystem should have a unambiguous answer.
** Asking the web is a different thing. “What is the best chocolate bar?”
* Is the web a network database, as understood in COMP 3005, which we consider harmful.
* For two-way links, there is the problem of restructuring data and all the dependencies.
* Privacy issues when tracing paths across the web.
* What about the problem of information revocation?
* Need more augmented reality and distributed and micro payment systems.
* We need distributed, mutually untrusting social networks.
** Now we have the problem of storage and computation, but also take away some of of the monetizationable aspect.
* Distribution is not free. It is very expensive in very funny ways.
* The dream of harvesting all the computational power of the internet is not new.
** Startups have come and gone many times over that problem.
* Google's indexers understands quite well many documents on the web. However, it only '''presents''' a primitive keyword-like interface. It doesn't expose the ontology.
* Organising information does not necessarily mean applying an ontology to it.
* The organisational methods we now use don't use ontologies, but rather are supplemented by them.

Adding couple of related points Anil mentioned during the discussion:
-Distributed key management is a holy grail no one has ever managed to get it working.
-Now a days databases have become important building blocks of the Distributed Operating System. Anil stressed the fact that Databases can in fact be considered as an OS service these days.
-The question “How you navigate the complex information space?” has remained a prominent question that The Web have always faced.

DistOS 2014W Lecture 6

2014-01-26T02:29:49Z

Sjoy: /* Class discussion */

== Group Discussion on "The Early Web" ==

Questions to discuss:

# How do you think the web would have been if not like the present way?
# What kind of infrastructure changes would you like to make?

=== Group 1 ===
: Relatively satisfied with the present structure of the web some changes suggested are in the below areas:
* Make use of the greater potential of Protocols
* More communication and interaction capabilities.
* Implementation changes in the present payment method systems. Example usage of "Micro-computation" - a discussion we would get back to in future classes. Also, Cryptographic currencies.
* Augmented reality.
* More towards individual privacy.

=== Group 2 ===
==== Problem of unstructured information ====
A large portion of the web serves content that is overwhelmingly concerned about presentation rather than structuring content. Tim Berner-Lees himself bemoaned the death of the semantic web. His original vision of it was as follows:


<blockquote>I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A "Semantic Web", which makes this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The "intelligent agents" people have touted for ages will finally materialize.<ref>{{cite book |last=Berners-Lee |first=Tim |authorlink=Tim Berners-Lee |coauthors=Fischetti, Mark |title=Weaving the Web |publisher=HarperSanFrancisco |year=1999 |pages=chapter 12 |isbn=978-0-06-251587-2 |nopp=true }}</ref></blockquote>

For this vision to be true, information arguably needs to be structured, maybe even classified. The idea of a universal information classification system has been floated. The modern web is mostly developed by software developers and similar, not librarians and the like.



Also, how does one differentiate satire from fact?

==== Valuation and deduplication of information ====
Another problem common with the current wwww is the duplication of information. Redundancy is not in itself harmful to increase the availability of information, but is ad-hoc duplication of the information itself?

One then comes to the problem of assigning a value to the information found therein. How does one rate information, and according to what criteria? How does one authenticate the information? Often, popularity is used as an indicator of veracity, almost in a sophistic manner. See excessive reliance on Google page ranking or Reddit score for various types of information consumption for research or news consumption respectively.

=== On the current infrastructure ===
The current internet infrastructure should remain as is, at least in countries with not just a modicum of freedom of access to information. Centralization of of control of access to information is a terrible power. See China and parts of the Middle-East. On that note, what can be said of popular sites, such as Google or Wikipedia that serve as the main entry point for many access patterns?

The problem, if any, in the current web infrastructure is of the web itself, not the internet.

=== Group 3 ===
* What we want to keep
** Linking mechanisms
** Minimum permissions to publish
* What we don't like
** Relying on one source for document
** Privacy links for security
* Proposal
** Peer-peer to distributed mechanisms for documenting
** Reverse links with caching - distributed cache
** More availability for user - what happens when system fails?
** Key management to be considered - Is it good to have centralized or distributed mechanism?

=== Group 4 ===
* An idea of web searching for us
* A suggestion of a different web if it would have been implemented by "AI" people
** AI programs searching for data - A notion already being implemented by Google slowly.
* Generate report forums
* HTML equivalent is inspired by the AI communication
* Higher semantics apart from just indexing the data
** Problem : "How to bridge the semantic gap?"
** Search for more data patterns

== Group design exercise — The web that could be ==

* “The web that wasn't” mentioned the moans of librarians.
* A universal classification system is needed.
* The training overhead of classifiers (e.g., librarians) is high. See the master's that a librarian would need.
* More structured content, both classification, and organization
* Current indexing by crude brute-force searching for words, etc., rather than searching metadata
* Information doesn't have the same persistence, see bitrot and Vint Cerf's talk.
* Too concerned with presentation now.
* Tim Berner-Lees bemoaning the death of the semantic web.
* The problem of information duplication when information gets redistributed across the web. However, we do want redundancy.
* Too much developed by software developers
* Too reliant on Google for web structure
** See search-engine optimization
* Problem of authentication (of the information, not the presenter)
** Too dependent at times on the popularity of a site, almost in a sophistic manner.
** See Reddit
* How do you programmatically distinguish satire from fact
* The web's structure is also “shaped by inbound links but would be nice a bit more”
* Infrastructure doesn't need to change per se.
** The distributed architecture should still stay. Centralization of control of allowed information and access is terrible power. See China and the Middle-East.
** Information, for the most part, in itself, exists centrally (as per-page), though communities (to use a generic term) are distributed.
* Need more sophisticated natural language processing.

== Class discussion ==

Focusing on vision, not the mechanism.

* Reverse linking
* Distributed content distribution (glorified cache)
** Both for privacy and redunancy reasons
** Suggested centralized content certification, but doesn't address the problem of root of trust and distributed consistency checking.
*** Distributed key management is a holy grail
*** What about detecting large-scale subversion attempts, like in China
* What is the new revenue model?
** What was TBL's revenue model (tongue-in-cheek, none)?
** Organisations like Google monetized the internet, and this mechanism could destroy their ability to do so.
* Search work is semi-distributed. Suggested letting the web do the work for you.
* Trying to structure content in a manner simultaneously palatable to both humans and machines.
* Using spare CPU time on servers for natural language processing (or other AI) of cached or locally available resources.
* Imagine a smushed Wolfram Alpha, Google, Wikipedia, and Watson, and then distributed over the net.
* The document was TBL's idea of the atom of content, whereas nowaday we really need something more granular.
* We want to extract higher-level semantics.
* Google may not be pure keyword search anymore. It is essentially AI now, but we still struggle with expressing what we want to Google.
* What about the adversarial aspect of content hosters, vying for attention?
* People do actively try to fool you.
* Compare to Google News, though that is very specific to that domain. Their vision is a semantic web, but they are incrementally building it.
* In a scary fashion, Google is one of the central points of failure of the web. Even scarier is less technically competent people who depend on Facebook for that.
* There is a semantic gap between how we express and query information, and how AI understands it.
* Can think of Facebook as a distributed human search infrastructure.
* A core service of an operating system is locating information. '''Search is infrastructure.'''
* The problem is not purely technical. There are political and social aspects.
** Searching for a file on a local filesystem should have a unambiguous answer.
** Asking the web is a different thing. “What is the best chocolate bar?”
* Is the web a network database, as understood in COMP 3005, which we consider harmful.
* For two-way links, there is the problem of restructuring data and all the dependencies.
* Privacy issues when tracing paths across the web.
* What about the problem of information revocation?
* Need more augmented reality and distributed and micro payment systems.
* We need distributed, mutually untrusting social networks.
** Now we have the problem of storage and computation, but also take away some of of the monetizationable aspect.
* Distribution is not free. It is very expensive in very funny ways.
* The dream of harvesting all the computational power of the internet is not new.
** Startups have come and gone many times over that problem.
* Google's indexers understands quite well many documents on the web. However, it only '''presents''' a primitive keyword-like interface. It doesn't expose the ontology.
* Organising information does not necessarily mean applying an ontology to it.
* The organisational methods we now use don't use ontologies, but rather are supplemented by them.

Adding couple of related points Anil mentioned during the discussion:
-Distributed key management is a holy grail no one has ever managed to get it working.

-Now a days databases have become important building blocks of the Distributed Operating System. Anil stressed the fact that Databases can in fact be considered as an OS service these days.

-The question “How you navigate the complex information space?” has remained a prominent question that The Web have always faced.

DistOS 2014W Lecture 6

2014-01-26T02:28:56Z

Sjoy: Adding couple of related points Anil mentioned during the discussion

== Group Discussion on "The Early Web" ==

Questions to discuss:

# How do you think the web would have been if not like the present way?
# What kind of infrastructure changes would you like to make?

=== Group 1 ===
: Relatively satisfied with the present structure of the web some changes suggested are in the below areas:
* Make use of the greater potential of Protocols
* More communication and interaction capabilities.
* Implementation changes in the present payment method systems. Example usage of "Micro-computation" - a discussion we would get back to in future classes. Also, Cryptographic currencies.
* Augmented reality.
* More towards individual privacy.

=== Group 2 ===
==== Problem of unstructured information ====
A large portion of the web serves content that is overwhelmingly concerned about presentation rather than structuring content. Tim Berner-Lees himself bemoaned the death of the semantic web. His original vision of it was as follows:


<blockquote>I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A "Semantic Web", which makes this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The "intelligent agents" people have touted for ages will finally materialize.<ref>{{cite book |last=Berners-Lee |first=Tim |authorlink=Tim Berners-Lee |coauthors=Fischetti, Mark |title=Weaving the Web |publisher=HarperSanFrancisco |year=1999 |pages=chapter 12 |isbn=978-0-06-251587-2 |nopp=true }}</ref></blockquote>

For this vision to be true, information arguably needs to be structured, maybe even classified. The idea of a universal information classification system has been floated. The modern web is mostly developed by software developers and similar, not librarians and the like.



Also, how does one differentiate satire from fact?

==== Valuation and deduplication of information ====
Another problem common with the current wwww is the duplication of information. Redundancy is not in itself harmful to increase the availability of information, but is ad-hoc duplication of the information itself?

One then comes to the problem of assigning a value to the information found therein. How does one rate information, and according to what criteria? How does one authenticate the information? Often, popularity is used as an indicator of veracity, almost in a sophistic manner. See excessive reliance on Google page ranking or Reddit score for various types of information consumption for research or news consumption respectively.

=== On the current infrastructure ===
The current internet infrastructure should remain as is, at least in countries with not just a modicum of freedom of access to information. Centralization of of control of access to information is a terrible power. See China and parts of the Middle-East. On that note, what can be said of popular sites, such as Google or Wikipedia that serve as the main entry point for many access patterns?

The problem, if any, in the current web infrastructure is of the web itself, not the internet.

=== Group 3 ===
* What we want to keep
** Linking mechanisms
** Minimum permissions to publish
* What we don't like
** Relying on one source for document
** Privacy links for security
* Proposal
** Peer-peer to distributed mechanisms for documenting
** Reverse links with caching - distributed cache
** More availability for user - what happens when system fails?
** Key management to be considered - Is it good to have centralized or distributed mechanism?

=== Group 4 ===
* An idea of web searching for us
* A suggestion of a different web if it would have been implemented by "AI" people
** AI programs searching for data - A notion already being implemented by Google slowly.
* Generate report forums
* HTML equivalent is inspired by the AI communication
* Higher semantics apart from just indexing the data
** Problem : "How to bridge the semantic gap?"
** Search for more data patterns

== Group design exercise — The web that could be ==

* “The web that wasn't” mentioned the moans of librarians.
* A universal classification system is needed.
* The training overhead of classifiers (e.g., librarians) is high. See the master's that a librarian would need.
* More structured content, both classification, and organization
* Current indexing by crude brute-force searching for words, etc., rather than searching metadata
* Information doesn't have the same persistence, see bitrot and Vint Cerf's talk.
* Too concerned with presentation now.
* Tim Berner-Lees bemoaning the death of the semantic web.
* The problem of information duplication when information gets redistributed across the web. However, we do want redundancy.
* Too much developed by software developers
* Too reliant on Google for web structure
** See search-engine optimization
* Problem of authentication (of the information, not the presenter)
** Too dependent at times on the popularity of a site, almost in a sophistic manner.
** See Reddit
* How do you programmatically distinguish satire from fact
* The web's structure is also “shaped by inbound links but would be nice a bit more”
* Infrastructure doesn't need to change per se.
** The distributed architecture should still stay. Centralization of control of allowed information and access is terrible power. See China and the Middle-East.
** Information, for the most part, in itself, exists centrally (as per-page), though communities (to use a generic term) are distributed.
* Need more sophisticated natural language processing.

== Class discussion ==

Focusing on vision, not the mechanism.

* Reverse linking
* Distributed content distribution (glorified cache)
** Both for privacy and redunancy reasons
** Suggested centralized content certification, but doesn't address the problem of root of trust and distributed consistency checking.
*** Distributed key management is a holy grail
*** What about detecting large-scale subversion attempts, like in China
* What is the new revenue model?
** What was TBL's revenue model (tongue-in-cheek, none)?
** Organisations like Google monetized the internet, and this mechanism could destroy their ability to do so.
* Search work is semi-distributed. Suggested letting the web do the work for you.
* Trying to structure content in a manner simultaneously palatable to both humans and machines.
* Using spare CPU time on servers for natural language processing (or other AI) of cached or locally available resources.
* Imagine a smushed Wolfram Alpha, Google, Wikipedia, and Watson, and then distributed over the net.
* The document was TBL's idea of the atom of content, whereas nowaday we really need something more granular.
* We want to extract higher-level semantics.
* Google may not be pure keyword search anymore. It is essentially AI now, but we still struggle with expressing what we want to Google.
* What about the adversarial aspect of content hosters, vying for attention?
* People do actively try to fool you.
* Compare to Google News, though that is very specific to that domain. Their vision is a semantic web, but they are incrementally building it.
* In a scary fashion, Google is one of the central points of failure of the web. Even scarier is less technically competent people who depend on Facebook for that.
* There is a semantic gap between how we express and query information, and how AI understands it.
* Can think of Facebook as a distributed human search infrastructure.
* A core service of an operating system is locating information. '''Search is infrastructure.'''
* The problem is not purely technical. There are political and social aspects.
** Searching for a file on a local filesystem should have a unambiguous answer.
** Asking the web is a different thing. “What is the best chocolate bar?”
* Is the web a network database, as understood in COMP 3005, which we consider harmful.
* For two-way links, there is the problem of restructuring data and all the dependencies.
* Privacy issues when tracing paths across the web.
* What about the problem of information revocation?
* Need more augmented reality and distributed and micro payment systems.
* We need distributed, mutually untrusting social networks.
** Now we have the problem of storage and computation, but also take away some of of the monetizationable aspect.
* Distribution is not free. It is very expensive in very funny ways.
* The dream of harvesting all the computational power of the internet is not new.
** Startups have come and gone many times over that problem.
* Google's indexers understands quite well many documents on the web. However, it only '''presents''' a primitive keyword-like interface. It doesn't expose the ontology.
* Organising information does not necessarily mean applying an ontology to it.
* The organisational methods we now use don't use ontologies, but rather are supplemented by them.

Adding couple of related points Anil mentioned during the discussion:
-Distributed key management is a holy grail no one has ever managed to get it working.
-Now a days databases have become important building blocks of the Distributed Operating System. Anil stressed the fact that Databases can in fact be considered as an OS service these days.
-The question “How you navigate the complex information space?” has remained a prominent question that The Web have always faced.

DistOS 2014W Lecture 2

2014-01-21T22:01:23Z

Sjoy: though I am not the one who originally volunteered for the lecture notes for lecture 2 adding the notes I have in an attempt to capture the essence of discussions we had on the day

Anil's view is that, though it is desirable to achieve a single system view, it’s a fool's errand to think that a 100% transparent single system view is achievable in the Distributed Operating system's Paradigm. Important question that needs to be answered while designing a DOS is how to ensure order in the Distributed System - in terms of (a) How to control participating systems (b) How the knowledge about constituent systems is passed around. So it becomes critical to look at how DOS achieves abstraction, enables and makes use of abstractions in realizing a system you want to use/program. Some of the aspects that needs to be considered in this context are: How well can DOS can realize the standard abstractions: Virtual Memory, File System Storage etc.

Towards building a Distributed system (scalable, reliable, maintainable) that should work perfectly it has to be based on abstractions representing the required system instead of building a system based on illusions. Anil pointed out - the tendency to make DOS behave as a centralized single system OS - as a prominent pit-fall to avoid in the design of the Distributed Operating Systems. Such an approach would not be good in terms of performance and the analogy used "such an attempt will be like Elephant riding a Unicycle" - helped to explain the reason behind the same, such system can be built but it would not be efficient/fast and the performance would not be ideal.

We now have a working definition of a Distributed OS, so we look a little closer at the underlying network. The internet (and thus the vast majority of distributed OS work today) occurs over the [https://en.wikipedia.org/wiki/TCP_IP| TCP and IP protocols].

Anil observed that the Dist. OS abstractions which succeed are ones that don't hide the network. For example, the remote procedure call (RPC) style abstractions have generally failed because they try to hide the untrusted nature of the network. The result has been a hodge-podge of firewall software which is primarily for blocking RPC-based protocols like SMB, NFS, etc. REST, on the other hand, has succeeded on the open web because it doesn't "hide the network" in this way.

Concluding the discussions Anil mentioned that the focus of the course will be on - How do you build new abstractions and software services enabling Distributed Operating System and not on approaches which takes single system Operating System and tries to turn it into a Distributed Operating System.

DistOS 2014W Lecture 4

2014-01-17T00:59:23Z

Sjoy: /* Ethernet, Networking protocols */

Discussions on the Alto

==CPU, Memory, Disk==

====CPU====

The general hardware architecture of the CPU was biased towards the user, meaning that a greater focus was put on IO capabilities and less focus was put on computational power (arithmetic etc). There were two levels of task-switching; the CPU provided sixteen fixed-priority tasks with hardware interrupts, each of which was permanently assigned to a piece of hardware. Only one of these tasks (the lowest-priority) was dedicated to the user. This task actually ran a virtualized BCPL machine (a C-like language); the user had no access at all to the underlying microcode. Other languages could be emulated as well.

====Memory====

The Alto started with 64K of 16-bit words of memory and eventually grew to 256K words. However, the higher memory was not accessible except through special tricks, similar to the way that memory above 4GB is not accessible today on 32-bit systems without special tricks.

====Task Switching====

One thing that was confusing was that they refer to tasks both as the 16 fixed hardware tasks and the many software tasks that could be multiplexed onto the lowest-priority of those hardware tasks. In either case, task switching was cooperative; until a task gave up control by running a specific instruction, no other task could run. From a modern perspective this looks like a major security problem, since malicious software could simply never relinquish the CPU. However, the fact that hardware was first-class in this sense (with full access to the CPU and memory) made the hardware simpler because much of the complexity could be done in software. Perhaps the first hints of what we now think of as drivers?

====Disk and Filesystem====

To make use of disk controller read,write,truncate,delete and etc. commands were made available.To reduce the risk of global damage structural information was saved to label in each page.hints mechanism was also a available using directory get where file resides in disk.file integrity a was check using seal bit and label.

==Ethernet, Networking protocols==
Although the original motive of Alto as a personal computer was to serve the needs of a single user, it was figured out that communicating with other Alto’s/computers would facilitate resource sharing – for collaboration and economic reasons. The main design objectives for the computer network connecting personal computers (Altos) were:

Data transmission speed: Bandwidth which should at least match the memory bus speed to give the end user a consistent notion that the resources accessed over the network should also have the same latency as compared to resource accessed within the computer

Size of network: Capability to connect large number of nodes together

Reliability: Once the user starts to use resources/service over a network it is vital to ensure that the network is reliable enough so that the user gets the quality of service required.

Alto uses a general packet transport system which can be thought of as a set of standard communication protocols towards facilitating interoperability.

The key element enabling the communication system between Alto and other computers was the Ethernet, a Layer 2 protocol and mechanism developed in-house at Xerox by Robert Metcalf et al. Following are the characteristics of the Ethernet – Broadcast, packet-switched network with bandwidth – 3Mbits/sec which can connect 256 computers together, and allows up have a distance of 1 Km between two connected nodes. Another important aspect of Ethernet was new nodes/computers could be added/removed/powered-on/powered-off from the network without disturbing the already existing network communications. Since Ethernet offered only best effort service without guarantees for an error free service, towards achieving reliable communication over it a hierarchy of layered communication protocols were implemented in Alto.

Alto had the capability to act as a gateway connecting different networks together. Xerox had a “Xerox Internet” consisting of several hundred computers, 25 networks and 20 gateways providing internet service back in 1979.

Ethernet communications system had two components – Ethernet controller and transceiver. Ethernet controller performed the encoding/decoding, buffering and micromachine interfacing functionalities whereas the transceiver deals with the transmission/reception of bits, which operated in half-duplex mode.

One important different with respect to the design of the Ethernet controller task as opposed to the ones for display and disk were that there were no periodic events to wake this task up instead a S-group instruction was used to set a flip flop in Ethernet hardware which was used to wake up the Ethernet controller task. Also the Ethernet used interrupt based mechanism used to indicate completion since the packet reception/transmission happens asynchronously. Ethernet microcode implements a packet filtering mechanism which checks for the reception of (1) destined for the host (2) broadcast packets. It can also operate in a promiscuous mode with host address set to zero receiving all packets, which can be used for debugging purposes.

Ethernet had no security mechanism built into it. Since Ethernet was a collision domain an exponential backoff algorithm was implemented towards avoiding collisions (which occurs when two Ethernet transmitters tries to use the ether at the same time).

==Graphics, Mouse, Printing==

===Graphics===

A lot of time was spent on what paper and ink provides us in a display sense, constantly referencing an 8.5 by 11 piece of paper as the type of display they were striving for. This showed what they were attempting to emulate in the Alto's display. The authors proposed 500 - 1000 black or white bits per inch of display (i.e. 500 - 1000 dpi). However, they were unable to pursue this goal, instead settling for 70 dpi for the display, allowing them to show things such as 10 pt text. They state that a 30 Hz refresh rate was found to not be objectionable. Interestingly, however, we would find this objectionable today--most likely from being spoiled with the sheer speed of computers today, whereas the authors were used to slower performance. The Alto's display took up '''half''' the Alto's memory, a choice we found very interesting.

Another interesting point was that the authors state that they thought it was beneficial that they could access display memory directly rather than using conventional frame buffer organizations. While we are unsure of what they meant by traditional frame buffer organizations, it is interesting to note that frame buffer organizations is what we use today for our displays.

===Mouse===

The mouse outlined in the paper was 200 dpi (vs. a standard mouse from Apple which is 1300 dpi) and had three buttons (one of the standard configurations of mice that are produced today). They were already using different mouse cursors (i.e., the pointer image of the cursor on screen). The real interesting point here is that the design outlined in the paper was so similar to designs we still use today. The only real divergence was the use of optical mice, although the introduction of optical mice did not altogether halt the use of non-optical mice. Today, we just have more flexibility with regards to how we design mice (e.g., having a scroll wheel, more buttons, etc.).

===Printer===

They state that the printer should print, in one second, an 8.5 by 11 inch page defined with 350 dots/inch (roughly 4000 horizontal scan lines of 3000 dots each). Ironically enough, this is not even what they had wanted for the actual Alto display. However, they did not have enough memory to do this and had to work around this by using things such as an incremental algorithm and reducing the number of scan lines. We were disappointed that they did not actually discuss the hardware implementation of the printer, only the software controller. However, it is interesting that the fact they are dividing the memory requirements of the printer between the hardware itself and the computer was quite a modern idea at the time, and still is.

===Other Interesting Notes===

We found it interesting that peripheral devices were included at all.

The author makes a passing mention to having a tablet to draw on. However, he stated that no one really liked having the tablet as it got in the way of the keyboard.

The recurring theme of lack of memory to implement what they had originally envisioned.

==Applications, Programming Environment==

DistOS 2014W Lecture 4

2014-01-17T00:58:33Z

Sjoy: /* Ethernet, Networking protocols */

Discussions on the Alto

==CPU, Memory, Disk==

====CPU====

The general hardware architecture of the CPU was biased towards the user, meaning that a greater focus was put on IO capabilities and less focus was put on computational power (arithmetic etc). There were two levels of task-switching; the CPU provided sixteen fixed-priority tasks with hardware interrupts, each of which was permanently assigned to a piece of hardware. Only one of these tasks (the lowest-priority) was dedicated to the user. This task actually ran a virtualized BCPL machine (a C-like language); the user had no access at all to the underlying microcode. Other languages could be emulated as well.

====Memory====

The Alto started with 64K of 16-bit words of memory and eventually grew to 256K words. However, the higher memory was not accessible except through special tricks, similar to the way that memory above 4GB is not accessible today on 32-bit systems without special tricks.

====Task Switching====

One thing that was confusing was that they refer to tasks both as the 16 fixed hardware tasks and the many software tasks that could be multiplexed onto the lowest-priority of those hardware tasks. In either case, task switching was cooperative; until a task gave up control by running a specific instruction, no other task could run. From a modern perspective this looks like a major security problem, since malicious software could simply never relinquish the CPU. However, the fact that hardware was first-class in this sense (with full access to the CPU and memory) made the hardware simpler because much of the complexity could be done in software. Perhaps the first hints of what we now think of as drivers?

====Disk and Filesystem====

To make use of disk controller read,write,truncate,delete and etc. commands were made available.To reduce the risk of global damage structural information was saved to label in each page.hints mechanism was also a available using directory get where file resides in disk.file integrity a was check using seal bit and label.

==Ethernet, Networking protocols==
Although the original motive of Alto as a personal computer was to serve the needs of a single user, it was figured out that communicating with other Alto’s/computers would facilitate resource sharing – for collaboration and economic reasons. The main design objectives for the computer network connecting personal computers (Altos) were:
Data transmission speed: Bandwidth which should at least match the memory bus speed to give the end user a consistent notion that the resources accessed over the network should also have the same latency as compared to resource accessed within the computer
Size of network: Capability to connect large number of nodes together
Reliability: Once the user starts to use resources/service over a network it is vital to ensure that the network is reliable enough so that the user gets the quality of service required.
Alto uses a general packet transport system which can be thought of as a set of standard communication protocols towards facilitating interoperability.
The key element enabling the communication system between Alto and other computers was the Ethernet, a Layer 2 protocol and mechanism developed in-house at Xerox by Robert Metcalf et al. Following are the characteristics of the Ethernet – Broadcast, packet-switched network with bandwidth – 3Mbits/sec which can connect 256 computers together, and allows up have a distance of 1 Km between two connected nodes. Another important aspect of Ethernet was new nodes/computers could be added/removed/powered-on/powered-off from the network without disturbing the already existing network communications. Since Ethernet offered only best effort service without guarantees for an error free service, towards achieving reliable communication over it a hierarchy of layered communication protocols were implemented in Alto.
Alto had the capability to act as a gateway connecting different networks together. Xerox had a “Xerox Internet” consisting of several hundred computers, 25 networks and 20 gateways providing internet service back in 1979.
Ethernet communications system had two components – Ethernet controller and transceiver. Ethernet controller performed the encoding/decoding, buffering and micromachine interfacing functionalities whereas the transceiver deals with the transmission/reception of bits, which operated in half-duplex mode.
One important different with respect to the design of the Ethernet controller task as opposed to the ones for display and disk were that there were no periodic events to wake this task up instead a S-group instruction was used to set a flip flop in Ethernet hardware which was used to wake up the Ethernet controller task. Also the Ethernet used interrupt based mechanism used to indicate completion since the packet reception/transmission happens asynchronously. Ethernet microcode implements a packet filtering mechanism which checks for the reception of (1) destined for the host (2) broadcast packets. It can also operate in a promiscuous mode with host address set to zero receiving all packets, which can be used for debugging purposes.
Ethernet had no security mechanism built into it. Since Ethernet was a collision domain an exponential backoff algorithm was implemented towards avoiding collisions (which occurs when two Ethernet transmitters tries to use the ether at the same time).

==Graphics, Mouse, Printing==

===Graphics===

A lot of time was spent on what paper and ink provides us in a display sense, constantly referencing an 8.5 by 11 piece of paper as the type of display they were striving for. This showed what they were attempting to emulate in the Alto's display. The authors proposed 500 - 1000 black or white bits per inch of display (i.e. 500 - 1000 dpi). However, they were unable to pursue this goal, instead settling for 70 dpi for the display, allowing them to show things such as 10 pt text. They state that a 30 Hz refresh rate was found to not be objectionable. Interestingly, however, we would find this objectionable today--most likely from being spoiled with the sheer speed of computers today, whereas the authors were used to slower performance. The Alto's display took up '''half''' the Alto's memory, a choice we found very interesting.

Another interesting point was that the authors state that they thought it was beneficial that they could access display memory directly rather than using conventional frame buffer organizations. While we are unsure of what they meant by traditional frame buffer organizations, it is interesting to note that frame buffer organizations is what we use today for our displays.

===Mouse===

The mouse outlined in the paper was 200 dpi (vs. a standard mouse from Apple which is 1300 dpi) and had three buttons (one of the standard configurations of mice that are produced today). They were already using different mouse cursors (i.e., the pointer image of the cursor on screen). The real interesting point here is that the design outlined in the paper was so similar to designs we still use today. The only real divergence was the use of optical mice, although the introduction of optical mice did not altogether halt the use of non-optical mice. Today, we just have more flexibility with regards to how we design mice (e.g., having a scroll wheel, more buttons, etc.).

===Printer===

They state that the printer should print, in one second, an 8.5 by 11 inch page defined with 350 dots/inch (roughly 4000 horizontal scan lines of 3000 dots each). Ironically enough, this is not even what they had wanted for the actual Alto display. However, they did not have enough memory to do this and had to work around this by using things such as an incremental algorithm and reducing the number of scan lines. We were disappointed that they did not actually discuss the hardware implementation of the printer, only the software controller. However, it is interesting that the fact they are dividing the memory requirements of the printer between the hardware itself and the computer was quite a modern idea at the time, and still is.

===Other Interesting Notes===

We found it interesting that peripheral devices were included at all.

The author makes a passing mention to having a tablet to draw on. However, he stated that no one really liked having the tablet as it got in the way of the keyboard.

The recurring theme of lack of memory to implement what they had originally envisioned.

==Applications, Programming Environment==

DistOS 2014W Lecture 3

2014-01-16T21:36:31Z

Sjoy:

Questions to consider:
* What were the purposes envisioned for computer networks? How do those compare with the uses they are put to today?
* What sort of resources were shared? What resources are shared today?
* What network architecture did they envision? Do we still have the same architecture?
* What surprised you about this paper?
* What was unclear?

==Group 1==
* video was mostly a summary of Kahn's paper
* process migration through different zones of air traffic control
* "distributed OS" meant something different than we normally think about, because many people would log in remotely to a single machine, it is very much like cloud infrastructure that we talk about today
* alto paper makes reference to Kahn's paper, and the alto designers had the foresight to see that networks like arpanet would be necessary
* would it be useful to have a co-processor responsible for maintaining shared resources even today? Like the IMPs of the arpanet? Today, computers are usually so fast it doesn't really matter.

=== Questions ===

* What were the purposes envisioned for computer networks?
** big computation, storage, resource sharing - "having a library on a hard disk"

* How do those compare with the uses they are put to today?
** those things are being done, but mostly communication like instant messaging, email

* What sort of resources were shared?
** databases, CPU time

* What resources are shared today?
** mostly storage

* What network architecture did they envision?
** they had a checksum and acknowledge on each packet
** the IMPs were the network interface and the routers
** packet-switching

* Do we still have the same architecture?
** packet-switching definitely won
** no, now IP doesn't checksum or acknowledge, but TCP has end-to-end checksum and acknowledge
** Kahn went on to learn from the errors of arpanet to design TCP/IP
** the job of network interface and router have been decoupled

* What surprised you about this paper?
** everything
** how they were able to do this
** a network interface card and router was the size of a fridge
** high-level languages
** bootstrap protocol, bootstrapping an application
** primitive computers
** desktop publishing
** the logistics of running a cable from one university to another
** how old the idea of distributed operating systems is

* What was unclear?
** much of the more technical specifications, but we mostly skipped over those

==Group 2==
1. The main purpose of early networks was resource sharing. Abstraction for transmission. Message reliability was a by-product. The underlying idea is the same.

2. Specialized Hardware/software and information sharing. super set of sharing.

3. AD-HOC routing, it was TCP without saying it. Largely unchanged today.

==Group 3==
===Envisioned computer network purposes===
* Improving reliability of services, due to redundant resource sets
* Resource sharing
* Usage modes:t
** Users can use a remote terminal, from a remote office or home, to access those resources.
** Would allow centralization of resources, to improve ease of management and do away with inefficiencies
* Allow specialization of various sites. rather than each site trying to do it all
* Distributed simulations (notably air traffic control)

Information-sharing is still relevant today, especially in research and large simulations. Remote access has mostly devolved into a specialized need.

===Resources shared===
* Computing resources (especially expensive mainframes)
* Data sets

===Network architecture===
* A primitive layered architecture
* Dedicated routing functions
* Various topologies:
** star
** loop
** bus
* Primarily (packet|mesage)-switched
** Circuit-switching too expensive and has large setup times
** Doesn't require committing resources
* Primitive flow control and buffering
* Predates proper congestion control such as Van Jacobsen's slow start
* Ad-hoc routing or based on something similar to RIP
* Anticipation of elephants and mice latency issues
* Unlike modern internet, error control and retransmission at every step

The architecture today is similar, but the link-layer is very different: use of Ethernet and ATM. The modern internet is a collection of autonomous systems, not a single network. Routing propogation is now large-scale, and semi-automated (e.g., BGP externally, IS-IS and OSPF internally)

===Surprising aspects===

===Unclear portions===
* Weird packet format: Page 1400 (4 of PDF): “Node 6, discovering the message is for itself,
replaces the destination address by the source address

==Group 4==

* What were the purposes envisioned for computer networks? How do those compare with the uses they are put to today?

Networks were envisioned as providing remote access to other computers, because useful resources such as computing power, large databases, and non-portable software were local to a particular computer, not themselves shared over the network.

Today, we use networks mostly for sharing data, although with services like Amazon AWS, we're starting to share computing resources again. We're also moving to support collaboration (e.g. Google Docs, GitHub, etc.).

* What sort of resources were shared? What resources are shared today?

Computing power was the key resource being shared; today, it's access to data. (See above.)

* What network architecture did they envision? Do we still have the same architecture?

Surprisingly, yes: modern networks have substantially similar architecures to the ones described in these papers.
Packet-switched networks are now ubiquitous. We no longer bother with circuit-switching even for telephony, in contrast to the assumption that non-network data would continue to use the circuit-switched common-carrier network.

* What surprised you about this paper?

We were surprised by the accuracy of the predictions given how early the paper was written — even things like electronic banking. Also surprising were technological advances since the paper was written, such as data transfer speeds (we have networks that are faster than the integrated bus in the Alto), and the predicted resolution requirements (which we are nowhere near meeting). The amount of detail in the description of the 'mouse pointing device' was interesting too.

* What was unclear?

Nothing significant; we're looking at these with the benefit of hindsight.

==Summary of the discussion from lecture==
Anil's view is that even these days we can imagine Computer Networks as more of a resource sharing platform. For example when we access the web or search Google we are making use of the resource sharing facilitated by the Internet(Network of interconnected Computer Networks). It's not possible to put 20,000 computers in our basements’, instead the Internet facilitates access to computing power/databases which are built of hundred thousands of computers. In fact Google and other popular search engines has a local copy of the entire web in their data centers, centralized copy of a large distributed system. Kind of a contradictory phenomenon if you think about in terms of the design goals of the distributed system.

Another important takeaway from the discussion was the point that "Early to market/ first player" with a new product/solution to a niche problem and the one which offer solutions based on simple mechanisms as opposed to one relying on complex mechanism gets adopted faster. Classic example is the Internet. ARPANET which was supposed to be an academic research project which was based on simple mechanisms, open and first of its kind got adopted widely and evolved in to the Internet as we see it today. It is to note that this approach is not without its own drawbacks example being the security aspects were not factored in while designing the ARPANET since it was intended to be a network between trusted parties, which was fine then. But when ARPANET evolved in to the Internet, security aspect was one area which required a major focus on. In Silicon Valley the focus is on being the "first player" in a niche market to meet that objective often simple framework/mechanisms are used. In doing so there is also a possibility of leaving out some components which can turn out to be a vital missing link, recent example being security flaw in 'snapchat' that lead to user data being exposed.

DistOS 2014W Lecture 3

2014-01-16T21:35:51Z

Sjoy: Summary of the discussion from lecture

Questions to consider:
* What were the purposes envisioned for computer networks? How do those compare with the uses they are put to today?
* What sort of resources were shared? What resources are shared today?
* What network architecture did they envision? Do we still have the same architecture?
* What surprised you about this paper?
* What was unclear?

==Group 1==
* video was mostly a summary of Kahn's paper
* process migration through different zones of air traffic control
* "distributed OS" meant something different than we normally think about, because many people would log in remotely to a single machine, it is very much like cloud infrastructure that we talk about today
* alto paper makes reference to Kahn's paper, and the alto designers had the foresight to see that networks like arpanet would be necessary
* would it be useful to have a co-processor responsible for maintaining shared resources even today? Like the IMPs of the arpanet? Today, computers are usually so fast it doesn't really matter.

=== Questions ===

* What were the purposes envisioned for computer networks?
** big computation, storage, resource sharing - "having a library on a hard disk"

* How do those compare with the uses they are put to today?
** those things are being done, but mostly communication like instant messaging, email

* What sort of resources were shared?
** databases, CPU time

* What resources are shared today?
** mostly storage

* What network architecture did they envision?
** they had a checksum and acknowledge on each packet
** the IMPs were the network interface and the routers
** packet-switching

* Do we still have the same architecture?
** packet-switching definitely won
** no, now IP doesn't checksum or acknowledge, but TCP has end-to-end checksum and acknowledge
** Kahn went on to learn from the errors of arpanet to design TCP/IP
** the job of network interface and router have been decoupled

* What surprised you about this paper?
** everything
** how they were able to do this
** a network interface card and router was the size of a fridge
** high-level languages
** bootstrap protocol, bootstrapping an application
** primitive computers
** desktop publishing
** the logistics of running a cable from one university to another
** how old the idea of distributed operating systems is

* What was unclear?
** much of the more technical specifications, but we mostly skipped over those

==Group 2==
1. The main purpose of early networks was resource sharing. Abstraction for transmission. Message reliability was a by-product. The underlying idea is the same.

2. Specialized Hardware/software and information sharing. super set of sharing.

3. AD-HOC routing, it was TCP without saying it. Largely unchanged today.

==Group 3==
===Envisioned computer network purposes===
* Improving reliability of services, due to redundant resource sets
* Resource sharing
* Usage modes:t
** Users can use a remote terminal, from a remote office or home, to access those resources.
** Would allow centralization of resources, to improve ease of management and do away with inefficiencies
* Allow specialization of various sites. rather than each site trying to do it all
* Distributed simulations (notably air traffic control)

Information-sharing is still relevant today, especially in research and large simulations. Remote access has mostly devolved into a specialized need.

===Resources shared===
* Computing resources (especially expensive mainframes)
* Data sets

===Network architecture===
* A primitive layered architecture
* Dedicated routing functions
* Various topologies:
** star
** loop
** bus
* Primarily (packet|mesage)-switched
** Circuit-switching too expensive and has large setup times
** Doesn't require committing resources
* Primitive flow control and buffering
* Predates proper congestion control such as Van Jacobsen's slow start
* Ad-hoc routing or based on something similar to RIP
* Anticipation of elephants and mice latency issues
* Unlike modern internet, error control and retransmission at every step

The architecture today is similar, but the link-layer is very different: use of Ethernet and ATM. The modern internet is a collection of autonomous systems, not a single network. Routing propogation is now large-scale, and semi-automated (e.g., BGP externally, IS-IS and OSPF internally)

===Surprising aspects===

===Unclear portions===
* Weird packet format: Page 1400 (4 of PDF): “Node 6, discovering the message is for itself,
replaces the destination address by the source address

==Group 4==

* What were the purposes envisioned for computer networks? How do those compare with the uses they are put to today?

Networks were envisioned as providing remote access to other computers, because useful resources such as computing power, large databases, and non-portable software were local to a particular computer, not themselves shared over the network.

Today, we use networks mostly for sharing data, although with services like Amazon AWS, we're starting to share computing resources again. We're also moving to support collaboration (e.g. Google Docs, GitHub, etc.).

* What sort of resources were shared? What resources are shared today?

Computing power was the key resource being shared; today, it's access to data. (See above.)

* What network architecture did they envision? Do we still have the same architecture?

Surprisingly, yes: modern networks have substantially similar architecures to the ones described in these papers.
Packet-switched networks are now ubiquitous. We no longer bother with circuit-switching even for telephony, in contrast to the assumption that non-network data would continue to use the circuit-switched common-carrier network.

* What surprised you about this paper?

We were surprised by the accuracy of the predictions given how early the paper was written — even things like electronic banking. Also surprising were technological advances since the paper was written, such as data transfer speeds (we have networks that are faster than the integrated bus in the Alto), and the predicted resolution requirements (which we are nowhere near meeting). The amount of detail in the description of the 'mouse pointing device' was interesting too.

* What was unclear?

Nothing significant; we're looking at these with the benefit of hindsight.

==Summary of the discussion from lecture==
Anil's view is that even these days we can imagine Computer Networks as more of a resource sharing platform. For example when we access the web or search Google we are making use of the resource sharing facilitated by the Internet(Network of interconnected Computer Networks). It's not possible to put 20,000 computers in our basements’, instead the Internet facilitates access to computing power/databases which are built of hundred thousands of computers. In fact Google and other popular search engines has a local copy of the entire web in their data centers, centralized copy of a large distributed system. Kind of a contradictory phenomenon if you think about in terms of the design goals of the distributed system.
Another important takeaway from the discussion was the point that "Early to market/ first player" with a new product/solution to a niche problem and the one which offer solutions based on simple mechanisms as opposed to one relying on complex mechanism gets adopted faster. Classic example is the Internet. ARPANET which was supposed to be an academic research project which was based on simple mechanisms, open and first of its kind got adopted widely and evolved in to the Internet as we see it today. It is to note that this approach is not without its own drawbacks example being the security aspects were not factored in while designing the ARPANET since it was intended to be a network between trusted parties, which was fine then. But when ARPANET evolved in to the Internet, security aspect was one area which required a major focus on. In Silicon Valley the focus is on being the "first player" in a niche market to meet that objective often simple framework/mechanisms are used. In doing so there is also a possibility of leaving out some components which can turn out to be a vital missing link, recent example being security flaw in 'snapchat' that lead to user data being exposed.

DistOS 2014W Lecture 1

2014-01-13T04:45:32Z

Sjoy: add/edit information supplementing notes from Lecture 1

'''What is an OS?''' Here are some ideas of what it could mean:
* a hardware abstraction
* Consistent execution environment. (ie. code written to interface -- think portable code)
* manages I/O
* Resource management/Multiplexing
* Communication infrastructure (example Inter Process Communication mechanisms) between the users (process, applications) of the Operating System.

An OS can be defined by the role it plays in the programming of systems. It takes care of resource management and creates abstraction. An OS turns hardware into the computer/api/interface you WANT to program.

This is similar to how the browser is becoming the OS of the web. The browser is
the key abstraction needed to run web apps. It is the interface web developers target.
It doesn't matter what you consume a given website on (eg. a phone, tablet,
etc.), the browser abstracts the device's hardware and OS away.

'''So, what's a distributed OS?'''

Anil prefers to think of this 'logically' than functionally/physically. This is
because the old distributed operating system (DOS) model applies to today's systems
(ie. managing multiple cores, etc). The traditional definition is systems that
manage their resources over a Network.

A lot of these definitions are hard to peg down because simplicity always gets in
the way of truth. These concepts to do not fit into well defined classes.

'''Anil's definition''': "taking the distributed pieces of a system you have and
turning it into the system you WANT."

It is good to think about about DOS's within the context of who/what is in
control, in terms of who makes and enforces decisions in DOS. The traditional kernel-process model is a dictatorship. Authoritarian
model of control. The kernel controls what lives or dies. The internet, by
contrast, is decentralised (eg. DNS). Distributed systems may have distributed
policies where there is not one source of power.Even in DOS paradigm we can see instances of authoritarian/centralized approaches one example being the walled garden model employed by Apple iOS. Anil's observation is that centralized systems has an inherent fragility built into and these kind of systems comes to existence and disappear after a while. Examples being AOL, Myspace. Even the Facebook also looks to be a possible candidate for a similar fate.

----

Yuan Liu's Notes

'''(Normal) Operating Systems'''

OS allows you to run on (slightly) different hardware. Functionalities and responsibilities of OSes include:

* abstracts hardware such that hardware resources can be accessed by software
* provides consistent execution environment (which hardware doesn't provide)
* manages I/O (such as user I/O, machine I/O i.e. network I/O, sensors, videos, etc.)
* manages resources via mulitplexing
* multiplexing (sharing): one resource wanted by multiple users
* O/S turns a computer you want to a computer you want to program
* manages synchronization and concurrency issues
* resource management and abstraction
* uses policies to manage resources

'''Distributed O/S'''
* turns a distributed system (with their hardware) into a distributed system you want to program
* resource management: who is in charge?
* in local O/S, the kernel is the boss
* in distributed O/S, the control is decentralized
* different humans control their machine
* has distributed policies for managing resources
* who decides control? different than local O/S

'''Other thoughts'''
* a more centralized system will become fragile later
* concentration of policy tend to fall apart in the future, according to Anil