Soma: Created page with "==Notes==

 February 10 -----------  Web browser with history visualization/query  - understand what pages I've been visiting  Potential screens:  - basic web browser (where user will spend most of their time)  - history viewer (text)  - history viewer (graphical)  How do we display web history?  - basic chronological     - time, web page, title   Potential tasks  - what websites do I visit the most?  - what topics do I read about?  - what authors do I read?  - conn..."

2023-02-10T21:36:13Z

Created page with "==Notes== <pre> February 10 ----------- Web browser with history visualization/query - understand what pages I've been visiting Potential screens: - basic web browser (where user will spend most of their time) - history viewer (text) - history viewer (graphical) How do we display web history? - basic chronological - time, web page, title Potential tasks - what websites do I visit the most? - what topics do I read about? - what authors do I read? - conn..."

New page

==Notes==

<pre>
February 10
-----------

Web browser with history visualization/query
- understand what pages I've been visiting

Potential screens:
- basic web browser (where user will spend most of their time)
- history viewer (text)
- history viewer (graphical)

How do we display web history?
- basic chronological
- time, web page, title

Potential tasks
- what websites do I visit the most?
- what topics do I read about?
- what authors do I read?
- connect the above (topic & author, topic & website)

Web URLs are easy, get that from user visiting page
Time is easy, we know when things happen
But the other things...they require analyzing the page

Authorship is actually hard, no standard way of representing it

The "Semantic web" was an idea for representing the web for automatic analysis rather than human consumption
- all information would be represented in schemas
- so queries would be very easy

The semantic web never really took off, isn't what we get generally

What we *can* get is the plain text of a page
- we have what is between body and headline tags
- if we look at the code for "reader" modes in browsers, they show how to do this

Let's assume we have some way to convert a web page into text

We have a bunch of words
- can run sort & uniq, get a list of words in a page
- could even get frequency, but that typically isn't so useful

Given a list of words, what is the "topic" of a page?
- to do this properly you'd need NLP of some kind (AI)

I could just remember which words are on which pages
- remove frequently-occuring words automatically by deleting words that
get above a certain frequency

Zipf's law
- frequency of words in natural languase falls off as 1/n
- so most words are very infrequent, but a few are very frequent

Instead of topics, we are just remembering the words that are present on different pages
- so in addition to the URL, we need to store which words each page has
- but I don't think that will be that much storage if we eliminate
frequent words

So then, how do we structure a database schema to fit all this?

Potential tables
- page IDs
- URLs -> UIDs
- use UIDs everywhere instead of URLs to save space
- can trim URLs to make sure they work consistently
- raw history
- page URL, time
- page->words (maybe not needed?)
- only stores words that haven't reached max frequency
- words->pages
- again only words that haven't reached max frequency
- represent pages using a UID, not the URL
- word frequency
- every word we've seen
- for each word, its frequency
- max frequency, once above this we stop counting
- ignored URLs
- pages that change frequently

we can assume that pages don't change, i.e., URL->contents mapping stays consistent

Note that if we view too many pages with a given topic, that topic will turn into a frequent word and won't show up in our analysis

To visualize, we don't want pages->words, but words->pages

One problem with SQL, it doesn't like variable-sized records

So if we have a page with 1000 words, 20 unique non-frequent words
- so need to represent the mapping of the page's URL UID to each of these words
- but another page could have 2, another could have 2000

Can just have a table of word, UID with duplicates
- then later can index to get stats

(Need a cap on the maximum length of a word to exclude bizzare text strings)

Assume that we've solved this, what operations do we want to support?

- get topics ("interesting" words)
- for a topic, get the URLs with them

But how do we treat news sites?
- visiting the same URL produces completely different content?
- I think it should automatically be excluded
- so cbc.ca/news should automatically be ignored because its contents keep changing
- but pages it points to will be kept because they are consistent

So every time we visit a page we'll have to see if we've visited it before
- calculate word frequency, if it is very different from a past visit then we can add it to a list of ignored web pages

When the user wants to see their topic-based history, we would show them the list of words, perhaps using a word cloud
- select a word, then you can see the pages associated with it
- maybe get another word cloud that is associated with that word
(to show co-occurances)

Do you all like word clouds?
</pre>

Mobile Apps 2023W Lecture 10 - Revision history