DistOS 2014W Lecture 6

the point form notes for this lecture could be turned into full sentences/paragraphs

The Early Web (Jan. 23)

Berners-Lee et al., "World-Wide Web: The Information Universe" (1992), pp. 52-58
Alex Wright, "The Web That Wasn't" (2007), Google Tech Talk

Group Discussion on "The Early Web"

Questions to discuss:

How do you think the web would have been if not like the present way?
What kind of infrastructure changes would you like to make?

Group 1

This group was relatively satisfied with the present structure of the web, but suggested some changes in the following areas. First, it is suggested that the web make use of the greater potentials of its protocols, some of which are being overlooked or not taken advantage of to the fullest. It is also suggested that the communication and interaction capabilities of the web be expanded on, e.g. the chatting functionality built into so-called 'Web 2.0' websites and the collaboration aspect of services like Google Docs (now Google Drive). A move toward individual privacy is also desired.

Some more progressive suggestions are to implement new forms of monetization and augmented reality. The current economy of the web is primarily based on add revenue and merchandise. The idea is put forward here that alternative forms of 'service-as-payment' may exist and may be viable (e.g. 'micro-computation' done in client-side javascript). This idea was discussed heavily in the context of public resource computing (as that topic is essentially what this suggestion is). The use of cryptographic currencies in lieu of the present payment methods was also suggested.

Group 2

Problem of unstructured information

A large portion of the web serves content that is overwhelmingly concerned about presentation rather than structuring content. Tim Berner-Lees himself bemoaned the death of the semantic web. His original vision of it was as follows:

I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A "Semantic Web", which makes this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The "intelligent agents" people have touted for ages will finally materialize.<ref></ref>

For this vision to be true, information arguably needs to be structured, maybe even classified. The idea of a universal information classification system has been floated. The modern web is mostly developed by software developers and similar, not librarians and the like.

Also, how does one differentiate satire from fact?

Valuation and deduplication of information

Another problem common with the current web is the duplication of information. Redundancy is not in itself harmful to increase the availability of information, but is ad-hoc duplication of the information itself?

One then comes to the problem of assigning a value to the information found therein. How does one rate information, and according to what criteria? How does one authenticate the information? Often, popularity is used as an indicator of veracity, almost in a sophistic manner. See excessive reliance on Google page ranking or Reddit score for various types of information consumption for research or news consumption respectively.

On the current infrastructure

The current internet infrastructure should remain as is, at least in countries with not just a modicum of freedom of access to information. Centralization of control of access to information is a terrible power. See China and parts of the Middle-East. On that note, what can be said of popular sites, such as Google or Wikipedia that serve as the main entry point for many access patterns?

The problem, if any, in the current web infrastructure is of the web itself, not the internet.

Group 3

In redesigning the web from today's perspective, this group would like to keep the linking mechanisms and the fact that one needs only minimal permissions (or indeed none at all) to publish content. However, it is felt that there should be changes made to the structure of document ownership i.e. the reliance on a single source for document storage and for the 'authoritative' version of the document, and that there should be changes made to 'Privacy links for security'.

It is proposed that peer-to-peer mechanisms be used for distribution of documents, that reverse links be implemented, and that the overall architecture be made to be 'information-centric' rather than 'host-centric'. Using P2P technology for distribution would help remove the reliance on a single document source mentioned above. Using reverse links with caching (i.e. a web-wide distributed cache) is one mechanism by which this could be achieved. This also increases the property of availability: if a single host fails, the user may transparently be provided with a cached copy and would never know the difference. Of course, the key management architecture would need to be considered: should it be centralized or distributed, how are some document versions given authoritative status, how are cached versions verified, etc.

Group 4

Imagine an entirely different web. The papers referenced above indicate that the web has two main functions: searching and indexing. Imagine a web where the searching was done for us, i.e. rather than 'browsing' the web, a user would simply read the content that was of immediate interest to them. This vision is how the web might have been if it were implemented by the AI people. Rather than 'dumb' web servers, the network would be populated with intelligent web agents, who would be empowered to amalgamate reports on behalf of a user. For example: a user requests information on a topic from a relevant server (e.g. ask for spacecraft information from a NASA server), that server begins compiling a report on the requested information from content that it has locally. Any information that it does not have in local storage it could then request from logically nearby servers. As information is returned, the original agent compiles its report and either waits until a sufficient report is generated or updates the user's view of the report live as new information becomes available. In order to facilitate such communication, an HTML equivalent must be developed which emphasizes computer-comprehendable information rather than human-readable semantics. Such formats are the subject of current research in the AI field, and it would fall upon an AI researcher(s) to develop such an equivalent. It would require higher semantics than simple indexing of data, which is itself not well understood at this time; researchers are still working on a solution to the problem: "How to bridge the semantic gap?" and data patterns which might help with the solution have yet to be found. A hopeful final note is that the notion of AI programs searching and compiling data is already being explored and even implemented —albeit slowly— by Google.

Group design exercise — The web that could be

The web as it currently stands is a vast soup of unstructured information (much to the distress of librarians everywhere). Many ideas for "the web that could be" focus on a web for structured information, but has hit several difficulties. Primarily, nobody has ever been able to agree on a universal classification system for all potential information. In fact, there are those^{[Citation Needed]} who claim that such a system may never be realized. Even if such a system were developed and accepted as a standard, it would invariably require advanced training to classify any information. Those who aren't convinced of the difficulty of this task should see the masters degree which a librarian (information specialist) requires. That is to say: (proper and useful i.e. 'good') classification of content would not be possible (in the general case) by the content creators themselves, let alone the classification of the existing (and enormous) body of content.

Our current solution is a combination of various technologies including search engines and brute-force indexing, natural language processing, and tagging via the "semantic web". Unfortunately this system has its own problems. Arguably the primary reason is simply user interest: the semantic web has effectively died because nobody bothered tagging anything. Some contributing factors include the problem of information duplication as it is redistributed across the web (however, some redundancy is desired), the fact that too much is developed by software developers rather than information specialists, and that we have become too reliant on Google for web structure (see search-engine optimization). Another large problem that today's web faces is the problem of authentication (i.e. of the 'information', not the 'presenter') which is far too dependent on the popularity of a site, almost in a sophistic manner, rather than on the factuality or authority of the content (see Reddit <WARNING: Dangerous link>). There is also a need to develop a notion of connotation in the semantic markup of the web, since that is such an integral part of human communication. How does an author programmatically distinguish satire from fact? Or even agreement or derision of an article to which the author links? (For example, should an author writing an article about a hateful group (e.g. WBC 'protesting' a funeral) link to their website? On the one hand, for fair (not that the group in this example deserves anything fair) and unbiassed reporting, they must provide reference to the group's own words; on the other hand, the article's author and publisher most likely do not want to be associated with the group nor do they want to increase the popularity or relevance of the group's webpage, both mistakes that a web-crawler could easily make upon seeing the link. A solution is to have different keywords or a (simple) classification system for semantic links, e.g. <link-agree>, <link-disagree>, <link-omfgno>, etc.)

Other problems, not to do specifically with semanticism, with the web as it stands include persistence of information, the concern with presentation, and the structure enforced by one-way links. For the persistence of information, see bit rot and Vint Cerf's talk. The fact that the job of web development nowadays is primarily the job of a graphical designer can be seen as a major problem with the state of the web. It degrades the separation of semantics and layout which was envisioned by Tim Berners Lee et al. and it makes computer comprehension more difficult. The movement of HTML5 away from tags such as <center> and towards greater separation of CSS and HTML is a step in the right direction, but it needs to propagate to the mindset of the developers as well. The web's structure is primarily influenced by the fact that links are inbound only. A proper and well thought out implementation of bidirectional links may be difficult, but it is a worthwhile pursuit given the influence it would have on the web as a whole. There is also a need more sophisticated natural language processing, something that W|α gives us hope for but still seems far off.

All of this notwithstanding, the underlying infrastructure of the web doesn't need to change per se. The distributed architecture should obviously still stay. Centralization of control over allowed information and access is terrible power (see China and the Middle-East). Information (for the most part, in and of itself) exists centrally (on a per-page or per-document basis) though communities (to use a generic term) are distributed.

Class discussion

Focusing on vision, not the mechanism.

Reverse linking
Distributed content distribution (glorified cache)
- Both for privacy and redunancy reasons
- Suggested centralized content certification, but doesn't address the problem of root of trust and distributed consistency checking.
  - Distributed key management is a holy grail
  - What about detecting large-scale subversion attempts, like in China
What is the new revenue model?
- What was TBL's revenue model (tongue-in-cheek, none)?
- Organisations like Google monetized the internet, and this mechanism could destroy their ability to do so.
Search work is semi-distributed. Suggested letting the web do the work for you.
Trying to structure content in a manner simultaneously palatable to both humans and machines.
Using spare CPU time on servers for natural language processing (or other AI) of cached or locally available resources.
Imagine a smushed Wolfram Alpha, Google, Wikipedia, and Watson, and then distributed over the net.
The document was TBL's idea of the atom of content, whereas nowaday we really need something more granular.
We want to extract higher-level semantics.
Google may not be pure keyword search anymore. It essentially now uses AI to determine relevancy, but we still struggle with expressing what we want to Google.
What about the adversarial aspect of content hosters, vying for attention?
People do actively try to fool you.
Compare to Google News, though that is very specific to that domain. Their vision is a semantic web, but they are incrementally building it.
In a scary fashion, Google is one of the central points of failure of the web. Even scarier is less technically competent people who depend on Facebook for that.
There is a semantic gap between how we express and query information, and how AI understands it.
Can think of Facebook as a distributed human search infrastructure.
The core service/function of an operating system is to locate information. Search is infrastructure.
The problem is not purely technical. There are political and social aspects.
- Searching for a file on a local filesystem should have a unambiguous answer.
- Asking the web is a different thing. “What is the best chocolate bar?”
Is the web a network database, as understood in COMP 3005, which we consider harmful.
For two-way links, there is the problem of restructuring data and all the dependencies.
Privacy issues when tracing paths across the web.
What about the problem of information revocation?
Need more augmented reality and distributed and micro payment systems.
We need distributed, mutually untrusting social networks.
- Now we have the problem of storage and computation, but also take away some of of the monetizationable aspect.
Distribution is not free. It is very expensive in very funny ways.
The dream of harvesting all the computational power of the internet is not new.
- Startups have come and gone many times over that problem.
Google's indexers understands quite well many documents on the web. However, it only presents a primitive keyword-like interface. It doesn't expose the ontology.
Organising information does not necessarily mean applying an ontology to it.
The organisational methods we now use don't use ontologies, but rather are supplemented by them.

Adding couple of related points Anil mentioned during the discussion: Distributed key management is a holy grail no one has ever managed to get it working. Now a days databases have become important building blocks of the Distributed Operating System. Anil stressed the fact that Databases can in fact be considered as an OS service these days. The question “How you navigate the complex information space?” has remained a prominent question that The Web have always faced.

References