Difference between revisions of "DistOS 2014W Lecture 6"

From Soma-notes
Jump to navigation Jump to search
(Convert previous content to more idiomatic markup (see pun on semantic web))
(Appending raw dump of my notes (to edit later))
Line 48: Line 48:
** Problem : "How to bridge the semantic gap?"
** Problem : "How to bridge the semantic gap?"
** Search for more data patterns
** Search for more data patterns
= Group design exercise — The web that could be =
* “The web that wasn't” mentioned the moans of librarians.
* A universal classification system is needed.
* The training overhead of classifiers (e.g., librarians) is high. See the master's that a librarian would need.
* More structured content, both classification, and organization
* Current indexing by crude brute-force searching for words, etc., rather than searching metadata
* Information doesn't have the same persistence, see bitrot and Vint Cerf's talk.
* Too concerned with presentation now.
* Tim Berner-Lees bemoaning the death of the semantic web.
* The problem of information duplication when information gets redistributed across the web. However, we do want redundancy.
* Too much developed by software developers
* Too reliant on Google for web structure
** See search-engine optimization
* Problem of authentication (of the information, not the presenter)
** Too dependent at times on the popularity of a site, almost in a sophistic manner.
** See Reddit
* How do you programmatically distinguish satire from fact
* The web's structure is also “shaped by inbound links but would be nice a bit more”
* Infrastructure doesn't need to change per se.
** The distributed architecture should still stay. Centralization of control of allowed information and access is terrible power. See China and the Middle-East.
** Information, for the most part, in itself, exists centrally (as per-page), though communities (to use a generic term) are distributed.
* Need more sophisticated natural language processing.
= Class discussion =
Focusing on vision, not the mechanism.
* Reverse linking
* Distributed content distribution (glorified cache)
** Both for privacy and redunancy reasons
** Suggested centralized content certification, but doesn't address the problem of root of trust and distributed consistency checking.
*** Distributed key management is a holy grail
*** What about detecting large-scale subversion attempts, like in China
* What is the new revenue model?
** What was TBL's revenue model (tongue-in-cheek, none)?
** Organisations like Google monetized the internet, and this mechanism could destroy their ability to do so.
* Search work is semi-distributed. Suggested letting the web do the work for you.
* Trying to structure content in a manner simultaneously palatable to both humans and machines.
* Using spare CPU time on servers for natural language processing (or other AI) of cached or locally available resources.
* Imagine a smushed Wolfram Alpha, Google, Wikipedia, and Watson, and then distributed over the net.
* The document was TBL's idea of the atom of content, whereas nowaday we really need something more granular.
* We want to extract higher-level semantics.
* Google may not be pure keyword search anymore. It is essentially AI now, but we still struggle with expressing what we want to Google.
* What about the adversarial aspect of content hosters, vying for attention?
* People do actively try to fool you.
* Compare to Google News, though that is very specific to that domain. Their vision is a semantic web, but they are incrementally building it.
* In a scary fashion, Google is one of the central points of failure of the web. Even scarier is less technically competent people who depend on Facebook for that.
* There is a semantic gap between how we express and query information, and how AI understands it.
* Can think of Facebook as a distributed human search infrastructure.
* A core service of an operating system is locating information. '''Search is infrastructure.'''
* The problem is not purely technical. There are political and social aspects.
** Searching for a file on a local filesystem should have a unambiguous answer.
** Asking the web is a different thing. “What is the best chocolate bar?”
* Is the web a network database, as understood in COMP 3005, which we consider harmful.
* For two-way links, there is the problem of restructuring data and all the dependencies.
* Privacy issues when tracing paths across the web.
* What about the problem of information revocation?
* Need more augmented reality and distributed and micro payment systems.
* We need distributed, mutually untrusting social networks.
** Now we have the problem of storage and computation, but also take away some of of the monetizationable aspect.
* Distribution is not free. It is very expensive in very funny ways.
* The dream of harvesting all the computational power of the internet is not new.
** Startups have come and gone many times over that problem.
* Google's indexers understands quite well many documents on the web. However, it only '''presents''' a primitive keyword-like interface. It doesn't expose the ontology.
* Organising information does not necessarily mean applying an ontology to it.
* The organisational methods we now use don't use ontologies, but rather are supplemented by them.

Revision as of 22:14, 23 January 2014


Group Discussion on "The Early Web

Questions to discuss:

  1. How do you think the web would have been if not like the present way?
  2. What kind of infrastructure changes would you like to make?

Group 1

Relatively satisfied with the present structure of the web some changes suggested are in the below areas:
  • Make use of the greater potential of Protocols
  • More communication and interaction capabilities.
  • Implementation changes in the present payment method systems. Example usage of "Micro-computation" - a discussion we would get back to in future classes. Also, Cryptographic currencies.
  • Augmented reality.
  • More towards individual privacy.

Group 2

A large portion of the web serves content that is overwhelmingly concerned about presentation rather than structuring content. Tim Berner-Lees himself bemoaned the death of the semantic web.

  • Information to be classified in detail
    • Organize things on web. Ex: Yahoo indexers
    • Suggestion for the need of Universal Decimal System an idea by Paul Otlet to be considered.
    • In the end it comes to semantic web
  • Information redundancy
  • Information verification

Group 3

  • What we want to keep
    • Linking mechanisms
    • Minimum permissions to publish
  • What we don't like
    • Relying on one source for document
    • Privacy links for security
  • Proposal
    • Peer-peer to distributed mechanisms for documenting
    • Reverse links with caching - distributed cache
    • More availability for user - what happens when system fails?
    • Key management to be considered - Is it good to have centralized or distributed mechanism?

Group 4

  • An idea of web searching for us
  • A suggestion of a different web if it would have been implemented by "AI" people
    • AI programs searching for data - A notion already being implemented by Google slowly.
  • Generate report forums
  • HTML equivalent is inspired by the AI communication
  • Higher semantics apart from just indexing the data
    • Problem : "How to bridge the semantic gap?"
    • Search for more data patterns

Group design exercise — The web that could be

  • “The web that wasn't” mentioned the moans of librarians.
  • A universal classification system is needed.
  • The training overhead of classifiers (e.g., librarians) is high. See the master's that a librarian would need.
  • More structured content, both classification, and organization
  • Current indexing by crude brute-force searching for words, etc., rather than searching metadata
  • Information doesn't have the same persistence, see bitrot and Vint Cerf's talk.
  • Too concerned with presentation now.
  • Tim Berner-Lees bemoaning the death of the semantic web.
  • The problem of information duplication when information gets redistributed across the web. However, we do want redundancy.
  • Too much developed by software developers
  • Too reliant on Google for web structure
    • See search-engine optimization
  • Problem of authentication (of the information, not the presenter)
    • Too dependent at times on the popularity of a site, almost in a sophistic manner.
    • See Reddit
  • How do you programmatically distinguish satire from fact
  • The web's structure is also “shaped by inbound links but would be nice a bit more”
  • Infrastructure doesn't need to change per se.
    • The distributed architecture should still stay. Centralization of control of allowed information and access is terrible power. See China and the Middle-East.
    • Information, for the most part, in itself, exists centrally (as per-page), though communities (to use a generic term) are distributed.
  • Need more sophisticated natural language processing.

Class discussion

Focusing on vision, not the mechanism.

  • Reverse linking
  • Distributed content distribution (glorified cache)
    • Both for privacy and redunancy reasons
    • Suggested centralized content certification, but doesn't address the problem of root of trust and distributed consistency checking.
      • Distributed key management is a holy grail
      • What about detecting large-scale subversion attempts, like in China
  • What is the new revenue model?
    • What was TBL's revenue model (tongue-in-cheek, none)?
    • Organisations like Google monetized the internet, and this mechanism could destroy their ability to do so.
  • Search work is semi-distributed. Suggested letting the web do the work for you.
  • Trying to structure content in a manner simultaneously palatable to both humans and machines.
  • Using spare CPU time on servers for natural language processing (or other AI) of cached or locally available resources.
  • Imagine a smushed Wolfram Alpha, Google, Wikipedia, and Watson, and then distributed over the net.
  • The document was TBL's idea of the atom of content, whereas nowaday we really need something more granular.
  • We want to extract higher-level semantics.
  • Google may not be pure keyword search anymore. It is essentially AI now, but we still struggle with expressing what we want to Google.
  • What about the adversarial aspect of content hosters, vying for attention?
  • People do actively try to fool you.
  • Compare to Google News, though that is very specific to that domain. Their vision is a semantic web, but they are incrementally building it.
  • In a scary fashion, Google is one of the central points of failure of the web. Even scarier is less technically competent people who depend on Facebook for that.
  • There is a semantic gap between how we express and query information, and how AI understands it.
  • Can think of Facebook as a distributed human search infrastructure.
  • A core service of an operating system is locating information. Search is infrastructure.
  • The problem is not purely technical. There are political and social aspects.
    • Searching for a file on a local filesystem should have a unambiguous answer.
    • Asking the web is a different thing. “What is the best chocolate bar?”
  • Is the web a network database, as understood in COMP 3005, which we consider harmful.
  • For two-way links, there is the problem of restructuring data and all the dependencies.
  • Privacy issues when tracing paths across the web.
  • What about the problem of information revocation?
  • Need more augmented reality and distributed and micro payment systems.
  • We need distributed, mutually untrusting social networks.
    • Now we have the problem of storage and computation, but also take away some of of the monetizationable aspect.
  • Distribution is not free. It is very expensive in very funny ways.
  • The dream of harvesting all the computational power of the internet is not new.
    • Startups have come and gone many times over that problem.
  • Google's indexers understands quite well many documents on the web. However, it only presents a primitive keyword-like interface. It doesn't expose the ontology.
  • Organising information does not necessarily mean applying an ontology to it.
  • The organisational methods we now use don't use ontologies, but rather are supplemented by them.