Notes
February 10
-----------
Web browser with history visualization/query
- understand what pages I've been visiting
Potential screens:
- basic web browser (where user will spend most of their time)
- history viewer (text)
- history viewer (graphical)
How do we display web history?
- basic chronological
- time, web page, title
Potential tasks
- what websites do I visit the most?
- what topics do I read about?
- what authors do I read?
- connect the above (topic & author, topic & website)
Web URLs are easy, get that from user visiting page
Time is easy, we know when things happen
But the other things...they require analyzing the page
Authorship is actually hard, no standard way of representing it
The "Semantic web" was an idea for representing the web for automatic analysis rather than human consumption
- all information would be represented in schemas
- so queries would be very easy
The semantic web never really took off, isn't what we get generally
What we *can* get is the plain text of a page
- we have what is between body and headline tags
- if we look at the code for "reader" modes in browsers, they show how to do this
Let's assume we have some way to convert a web page into text
We have a bunch of words
- can run sort & uniq, get a list of words in a page
- could even get frequency, but that typically isn't so useful
Given a list of words, what is the "topic" of a page?
- to do this properly you'd need NLP of some kind (AI)
I could just remember which words are on which pages
- remove frequently-occuring words automatically by deleting words that
get above a certain frequency
Zipf's law
- frequency of words in natural languase falls off as 1/n
- so most words are very infrequent, but a few are very frequent
Instead of topics, we are just remembering the words that are present on different pages
- so in addition to the URL, we need to store which words each page has
- but I don't think that will be that much storage if we eliminate
frequent words
So then, how do we structure a database schema to fit all this?
Potential tables
- page IDs
- URLs -> UIDs
- use UIDs everywhere instead of URLs to save space
- can trim URLs to make sure they work consistently
- raw history
- page URL, time
- page->words (maybe not needed?)
- only stores words that haven't reached max frequency
- words->pages
- again only words that haven't reached max frequency
- represent pages using a UID, not the URL
- word frequency
- every word we've seen
- for each word, its frequency
- max frequency, once above this we stop counting
- ignored URLs
- pages that change frequently
we can assume that pages don't change, i.e., URL->contents mapping stays consistent
Note that if we view too many pages with a given topic, that topic will turn into a frequent word and won't show up in our analysis
To visualize, we don't want pages->words, but words->pages
One problem with SQL, it doesn't like variable-sized records
So if we have a page with 1000 words, 20 unique non-frequent words
- so need to represent the mapping of the page's URL UID to each of these words
- but another page could have 2, another could have 2000
Can just have a table of word, UID with duplicates
- then later can index to get stats
(Need a cap on the maximum length of a word to exclude bizzare text strings)
Assume that we've solved this, what operations do we want to support?
- get topics ("interesting" words)
- for a topic, get the URLs with them
But how do we treat news sites?
- visiting the same URL produces completely different content?
- I think it should automatically be excluded
- so cbc.ca/news should automatically be ignored because its contents keep changing
- but pages it points to will be kept because they are consistent
So every time we visit a page we'll have to see if we've visited it before
- calculate word frequency, if it is very different from a past visit then we can add it to a list of ignored web pages
When the user wants to see their topic-based history, we would show them the list of words, perhaps using a word cloud
- select a word, then you can see the pages associated with it
- maybe get another word cloud that is associated with that word
(to show co-occurances)
Do you all like word clouds?