Group Discussion on "The Early Web

Questions to discuss:

How do you think the web would have been if not like the present way?
What kind of infrastructure changes would you like to make?

Group 1

Relatively satisfied with the present structure of the web some changes suggested are in the below areas:

Make use of the greater potential of Protocols
More communication and interaction capabilities.
Implementation changes in the present payment method systems. Example usage of "Micro-computation" - a discussion we would get back to in future classes. Also, Cryptographic currencies.
Augmented reality.
More towards individual privacy.

Group 2

A large portion of the web serves content that is overwhelmingly concerned about presentation rather than structuring content. Tim Berner-Lees himself bemoaned the death of the semantic web.

Information to be classified in detail
- Organize things on web. Ex: Yahoo indexers
- Suggestion for the need of Universal Decimal System an idea by Paul Otlet to be considered.
- In the end it comes to semantic web
Information redundancy
Information verification

Group 3

What we want to keep
- Linking mechanisms
- Minimum permissions to publish
What we don't like
- Relying on one source for document
- Privacy links for security
Proposal
- Peer-peer to distributed mechanisms for documenting
- Reverse links with caching - distributed cache
- More availability for user - what happens when system fails?
- Key management to be considered - Is it good to have centralized or distributed mechanism?

Group 4

An idea of web searching for us
A suggestion of a different web if it would have been implemented by "AI" people
- AI programs searching for data - A notion already being implemented by Google slowly.
Generate report forums
HTML equivalent is inspired by the AI communication
Higher semantics apart from just indexing the data
- Problem : "How to bridge the semantic gap?"
- Search for more data patterns

Group design exercise — The web that could be

“The web that wasn't” mentioned the moans of librarians.
A universal classification system is needed.
The training overhead of classifiers (e.g., librarians) is high. See the master's that a librarian would need.
More structured content, both classification, and organization
Current indexing by crude brute-force searching for words, etc., rather than searching metadata
Information doesn't have the same persistence, see bitrot and Vint Cerf's talk.
Too concerned with presentation now.
Tim Berner-Lees bemoaning the death of the semantic web.
The problem of information duplication when information gets redistributed across the web. However, we do want redundancy.
Too much developed by software developers
Too reliant on Google for web structure
- See search-engine optimization
Problem of authentication (of the information, not the presenter)
- Too dependent at times on the popularity of a site, almost in a sophistic manner.
- See Reddit
How do you programmatically distinguish satire from fact
The web's structure is also “shaped by inbound links but would be nice a bit more”
Infrastructure doesn't need to change per se.
- The distributed architecture should still stay. Centralization of control of allowed information and access is terrible power. See China and the Middle-East.
- Information, for the most part, in itself, exists centrally (as per-page), though communities (to use a generic term) are distributed.
Need more sophisticated natural language processing.

Class discussion

Focusing on vision, not the mechanism.

Reverse linking
Distributed content distribution (glorified cache)
- Both for privacy and redunancy reasons
- Suggested centralized content certification, but doesn't address the problem of root of trust and distributed consistency checking.
  - Distributed key management is a holy grail
  - What about detecting large-scale subversion attempts, like in China
What is the new revenue model?
- What was TBL's revenue model (tongue-in-cheek, none)?
- Organisations like Google monetized the internet, and this mechanism could destroy their ability to do so.
Search work is semi-distributed. Suggested letting the web do the work for you.
Trying to structure content in a manner simultaneously palatable to both humans and machines.
Using spare CPU time on servers for natural language processing (or other AI) of cached or locally available resources.
Imagine a smushed Wolfram Alpha, Google, Wikipedia, and Watson, and then distributed over the net.
The document was TBL's idea of the atom of content, whereas nowaday we really need something more granular.
We want to extract higher-level semantics.
Google may not be pure keyword search anymore. It is essentially AI now, but we still struggle with expressing what we want to Google.
What about the adversarial aspect of content hosters, vying for attention?
People do actively try to fool you.
Compare to Google News, though that is very specific to that domain. Their vision is a semantic web, but they are incrementally building it.
In a scary fashion, Google is one of the central points of failure of the web. Even scarier is less technically competent people who depend on Facebook for that.
There is a semantic gap between how we express and query information, and how AI understands it.
Can think of Facebook as a distributed human search infrastructure.
A core service of an operating system is locating information. Search is infrastructure.
The problem is not purely technical. There are political and social aspects.
- Searching for a file on a local filesystem should have a unambiguous answer.
- Asking the web is a different thing. “What is the best chocolate bar?”
Is the web a network database, as understood in COMP 3005, which we consider harmful.
For two-way links, there is the problem of restructuring data and all the dependencies.
Privacy issues when tracing paths across the web.
What about the problem of information revocation?
Need more augmented reality and distributed and micro payment systems.
We need distributed, mutually untrusting social networks.
- Now we have the problem of storage and computation, but also take away some of of the monetizationable aspect.
Distribution is not free. It is very expensive in very funny ways.
The dream of harvesting all the computational power of the internet is not new.
- Startups have come and gone many times over that problem.
Google's indexers understands quite well many documents on the web. However, it only presents a primitive keyword-like interface. It doesn't expose the ontology.
Organising information does not necessarily mean applying an ontology to it.
The organisational methods we now use don't use ontologies, but rather are supplemented by them.