DistOS 2014W Lecture 6: Difference between revisions

Revision as of 22:10, 26 February 2014

the point form notes for this lecture could be turned into full sentences/paragraphs

The Early Web (Jan. 23)

Berners-Lee et al., "World-Wide Web: The Information Universe" (1992), pp. 52-58
Alex Wright, "The Web That Wasn't" (2007), Google Tech Talk

Group Discussion on "The Early Web"

Questions to discuss:

How do you think the web would have been if not like the present way?
What kind of infrastructure changes would you like to make?

Group 1

Relatively satisfied with the present structure of the web some changes suggested are in the below areas:

Make use of the greater potential of Protocols
More communication and interaction capabilities.
Implementation changes in the present payment method systems. Example usage of "Micro-computation" - a discussion we would get back to in future classes. Also, Cryptographic currencies.
Augmented reality.
More towards individual privacy.

Group 2

Problem of unstructured information

A large portion of the web serves content that is overwhelmingly concerned about presentation rather than structuring content. Tim Berner-Lees himself bemoaned the death of the semantic web. His original vision of it was as follows:

I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A "Semantic Web", which makes this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The "intelligent agents" people have touted for ages will finally materialize.<ref></ref>

For this vision to be true, information arguably needs to be structured, maybe even classified. The idea of a universal information classification system has been floated. The modern web is mostly developed by software developers and similar, not librarians and the like.

Also, how does one differentiate satire from fact?

Valuation and deduplication of information

Another problem common with the current wwww is the duplication of information. Redundancy is not in itself harmful to increase the availability of information, but is ad-hoc duplication of the information itself?

One then comes to the problem of assigning a value to the information found therein. How does one rate information, and according to what criteria? How does one authenticate the information? Often, popularity is used as an indicator of veracity, almost in a sophistic manner. See excessive reliance on Google page ranking or Reddit score for various types of information consumption for research or news consumption respectively.

On the current infrastructure

The current internet infrastructure should remain as is, at least in countries with not just a modicum of freedom of access to information. Centralization of control of access to information is a terrible power. See China and parts of the Middle-East. On that note, what can be said of popular sites, such as Google or Wikipedia that serve as the main entry point for many access patterns?

The problem, if any, in the current web infrastructure is of the web itself, not the internet.

Group 3

What we want to keep
- Linking mechanisms
- Minimum permissions to publish
What we don't like
- Relying on one source for document
- Privacy links for security
Proposal
- Peer-peer to distributed mechanisms for documenting
- Reverse links with caching - distributed cache
- More availability for user - what happens when system fails?
- Key management to be considered - Is it good to have centralized or distributed mechanism?

Group 4

An idea of web searching for us
A suggestion of a different web if it would have been implemented by "AI" people
- AI programs searching for data - A notion already being implemented by Google slowly.
Generate report forums
HTML equivalent is inspired by the AI communication
Higher semantics apart from just indexing the data
- Problem : "How to bridge the semantic gap?"
- Search for more data patterns

Group design exercise — The web that could be

“The web that wasn't” mentioned the moans of librarians.
A universal classification system is needed.
The training overhead of classifiers (e.g., librarians) is high. See the master's that a librarian would need.
More structured content, both classification, and organization
Current indexing by crude brute-force searching for words, etc., rather than searching metadata
Information doesn't have the same persistence, see bitrot and Vint Cerf's talk.
Too concerned with presentation now.
Tim Berner-Lees bemoaning the death of the semantic web.
The problem of information duplication when information gets redistributed across the web. However, we do want redundancy.
Too much developed by software developers
Too reliant on Google for web structure
- See search-engine optimization
Problem of authentication (of the information, not the presenter)
- Too dependent at times on the popularity of a site, almost in a sophistic manner.
- See Reddit
How do you programmatically distinguish satire from fact
The web's structure is also “shaped by inbound links but would be nice a bit more”
Infrastructure doesn't need to change per se.
- The distributed architecture should still stay. Centralization of control of allowed information and access is terrible power. See China and the Middle-East.
- Information, for the most part, in itself, exists centrally (as per-page), though communities (to use a generic term) are distributed.
Need more sophisticated natural language processing.

Class discussion

Focusing on vision, not the mechanism.

Reverse linking
Distributed content distribution (glorified cache)
- Both for privacy and redunancy reasons
- Suggested centralized content certification, but doesn't address the problem of root of trust and distributed consistency checking.
  - Distributed key management is a holy grail
  - What about detecting large-scale subversion attempts, like in China
What is the new revenue model?
- What was TBL's revenue model (tongue-in-cheek, none)?
- Organisations like Google monetized the internet, and this mechanism could destroy their ability to do so.
Search work is semi-distributed. Suggested letting the web do the work for you.
Trying to structure content in a manner simultaneously palatable to both humans and machines.
Using spare CPU time on servers for natural language processing (or other AI) of cached or locally available resources.
Imagine a smushed Wolfram Alpha, Google, Wikipedia, and Watson, and then distributed over the net.
The document was TBL's idea of the atom of content, whereas nowaday we really need something more granular.
We want to extract higher-level semantics.
Google may not be pure keyword search anymore. It is essentially AI now, but we still struggle with expressing what we want to Google.
What about the adversarial aspect of content hosters, vying for attention?
People do actively try to fool you.
Compare to Google News, though that is very specific to that domain. Their vision is a semantic web, but they are incrementally building it.
In a scary fashion, Google is one of the central points of failure of the web. Even scarier is less technically competent people who depend on Facebook for that.
There is a semantic gap between how we express and query information, and how AI understands it.
Can think of Facebook as a distributed human search infrastructure.
A core service of an operating system is locating information. Search is infrastructure.
The problem is not purely technical. There are political and social aspects.
- Searching for a file on a local filesystem should have a unambiguous answer.
- Asking the web is a different thing. “What is the best chocolate bar?”
Is the web a network database, as understood in COMP 3005, which we consider harmful.
For two-way links, there is the problem of restructuring data and all the dependencies.
Privacy issues when tracing paths across the web.
What about the problem of information revocation?
Need more augmented reality and distributed and micro payment systems.
We need distributed, mutually untrusting social networks.
- Now we have the problem of storage and computation, but also take away some of of the monetizationable aspect.
Distribution is not free. It is very expensive in very funny ways.
The dream of harvesting all the computational power of the internet is not new.
- Startups have come and gone many times over that problem.
Google's indexers understands quite well many documents on the web. However, it only presents a primitive keyword-like interface. It doesn't expose the ontology.
Organising information does not necessarily mean applying an ontology to it.
The organisational methods we now use don't use ontologies, but rather are supplemented by them.

Adding couple of related points Anil mentioned during the discussion: Distributed key management is a holy grail no one has ever managed to get it working. Now a days databases have become important building blocks of the Distributed Operating System. Anil stressed the fact that Databases can in fact be considered as an OS service these days. The question “How you navigate the complex information space?” has remained a prominent question that The Web have always faced.

@@ Line 41: / Line 41: @@
 === On the current infrastructure ===
-The current <em>internet</em> infrastructure should remain as is, at least in countries with not just a modicum of freedom of access to information. Centralization of of control of access to information is a terrible power. See China and parts of the Middle-East. On that note, what can be said of popular sites, such as Google or Wikipedia that serve as the main entry point for many access patterns?
+The current <em>internet</em> infrastructure should remain as is, at least in countries with not just a modicum of freedom of access to information. Centralization of control of access to information is a terrible power. See China and parts of the Middle-East. On that note, what can be said of popular sites, such as Google or Wikipedia that serve as the main entry point for many access patterns?
 The problem, if any, in the current web infrastructure is of the web itself, not the internet.