<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://homeostasis.scs.carleton.ca/wiki/index.php?action=history&amp;feed=atom&amp;title=Mobile_Apps_2023W_Lecture_10</id>
	<title>Mobile Apps 2023W Lecture 10 - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://homeostasis.scs.carleton.ca/wiki/index.php?action=history&amp;feed=atom&amp;title=Mobile_Apps_2023W_Lecture_10"/>
	<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Mobile_Apps_2023W_Lecture_10&amp;action=history"/>
	<updated>2026-04-28T09:29:17Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.42.1</generator>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Mobile_Apps_2023W_Lecture_10&amp;diff=24352&amp;oldid=prev</id>
		<title>Soma: Created page with &quot;==Notes==  &lt;pre&gt; February 10 -----------  Web browser with history visualization/query  - understand what pages I&#039;ve been visiting  Potential screens:  - basic web browser (where user will spend most of their time)  - history viewer (text)  - history viewer (graphical)  How do we display web history?  - basic chronological     - time, web page, title   Potential tasks  - what websites do I visit the most?  - what topics do I read about?  - what authors do I read?  - conn...&quot;</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Mobile_Apps_2023W_Lecture_10&amp;diff=24352&amp;oldid=prev"/>
		<updated>2023-02-10T21:36:13Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;==Notes==  &amp;lt;pre&amp;gt; February 10 -----------  Web browser with history visualization/query  - understand what pages I&amp;#039;ve been visiting  Potential screens:  - basic web browser (where user will spend most of their time)  - history viewer (text)  - history viewer (graphical)  How do we display web history?  - basic chronological     - time, web page, title   Potential tasks  - what websites do I visit the most?  - what topics do I read about?  - what authors do I read?  - conn...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;==Notes==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
February 10&lt;br /&gt;
-----------&lt;br /&gt;
&lt;br /&gt;
Web browser with history visualization/query&lt;br /&gt;
 - understand what pages I&amp;#039;ve been visiting&lt;br /&gt;
&lt;br /&gt;
Potential screens:&lt;br /&gt;
 - basic web browser (where user will spend most of their time)&lt;br /&gt;
 - history viewer (text)&lt;br /&gt;
 - history viewer (graphical)&lt;br /&gt;
&lt;br /&gt;
How do we display web history?&lt;br /&gt;
 - basic chronological&lt;br /&gt;
    - time, web page, title&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Potential tasks&lt;br /&gt;
 - what websites do I visit the most?&lt;br /&gt;
 - what topics do I read about?&lt;br /&gt;
 - what authors do I read?&lt;br /&gt;
 - connect the above (topic &amp;amp; author, topic &amp;amp; website)&lt;br /&gt;
&lt;br /&gt;
Web URLs are easy, get that from user visiting page&lt;br /&gt;
Time is easy, we know when things happen&lt;br /&gt;
But the other things...they require analyzing the page&lt;br /&gt;
&lt;br /&gt;
Authorship is actually hard, no standard way of representing it&lt;br /&gt;
&lt;br /&gt;
The &amp;quot;Semantic web&amp;quot; was an idea for representing the web for automatic analysis rather than human consumption&lt;br /&gt;
 - all information would be represented in schemas&lt;br /&gt;
 - so queries would be very easy&lt;br /&gt;
&lt;br /&gt;
The semantic web never really took off, isn&amp;#039;t what we get generally&lt;br /&gt;
&lt;br /&gt;
What we *can* get is the plain text of a page&lt;br /&gt;
 - we have what is between body and headline tags&lt;br /&gt;
 - if we look at the code for &amp;quot;reader&amp;quot; modes in browsers, they show how to do this&lt;br /&gt;
&lt;br /&gt;
Let&amp;#039;s assume we have some way to convert a web page into text&lt;br /&gt;
&lt;br /&gt;
We have a bunch of words&lt;br /&gt;
 - can run sort &amp;amp; uniq, get a list of words in a page&lt;br /&gt;
 - could even get frequency, but that typically isn&amp;#039;t so useful&lt;br /&gt;
&lt;br /&gt;
Given a list of words, what is the &amp;quot;topic&amp;quot; of a page?&lt;br /&gt;
 - to do this properly you&amp;#039;d need NLP of some kind (AI)&lt;br /&gt;
 &lt;br /&gt;
I could just remember which words are on which pages&lt;br /&gt;
 - remove frequently-occuring words automatically by deleting words that&lt;br /&gt;
   get above a certain frequency&lt;br /&gt;
&lt;br /&gt;
Zipf&amp;#039;s law&lt;br /&gt;
 - frequency of words in natural languase falls off as 1/n&lt;br /&gt;
   - so most words are very infrequent, but a few are very frequent&lt;br /&gt;
&lt;br /&gt;
Instead of topics, we are just remembering the words that are present on different pages&lt;br /&gt;
 - so in addition to the URL, we need to store which words each page has&lt;br /&gt;
 - but I don&amp;#039;t think that will be that much storage if we eliminate&lt;br /&gt;
   frequent words&lt;br /&gt;
&lt;br /&gt;
So then, how do we structure a database schema to fit all this?&lt;br /&gt;
&lt;br /&gt;
Potential tables&lt;br /&gt;
 - page IDs&lt;br /&gt;
   - URLs -&amp;gt; UIDs&lt;br /&gt;
   - use UIDs everywhere instead of URLs to save space&lt;br /&gt;
   - can trim URLs to make sure they work consistently&lt;br /&gt;
 - raw history&lt;br /&gt;
   - page URL, time&lt;br /&gt;
 - page-&amp;gt;words (maybe not needed?)&lt;br /&gt;
    - only stores words that haven&amp;#039;t reached max frequency&lt;br /&gt;
 - words-&amp;gt;pages&lt;br /&gt;
    - again only words that haven&amp;#039;t reached max frequency&lt;br /&gt;
    - represent pages using a UID, not the URL&lt;br /&gt;
 - word frequency&lt;br /&gt;
    - every word we&amp;#039;ve seen&lt;br /&gt;
    - for each word, its frequency&lt;br /&gt;
    - max frequency, once above this we stop counting&lt;br /&gt;
 - ignored URLs&lt;br /&gt;
    - pages that change frequently&lt;br /&gt;
&lt;br /&gt;
we can assume that pages don&amp;#039;t change, i.e., URL-&amp;gt;contents mapping stays consistent&lt;br /&gt;
&lt;br /&gt;
Note that if we view too many pages with a given topic, that topic will turn into a frequent word and won&amp;#039;t show up in our analysis&lt;br /&gt;
&lt;br /&gt;
To visualize, we don&amp;#039;t want pages-&amp;gt;words, but words-&amp;gt;pages&lt;br /&gt;
&lt;br /&gt;
One problem with SQL, it doesn&amp;#039;t like variable-sized records&lt;br /&gt;
&lt;br /&gt;
So if we have a page with 1000 words, 20 unique non-frequent words&lt;br /&gt;
 - so need to represent the mapping of the page&amp;#039;s URL UID to each of these words&lt;br /&gt;
 - but another page could have 2, another could have 2000&lt;br /&gt;
&lt;br /&gt;
Can just have a table of word, UID with duplicates&lt;br /&gt;
 - then later can index to get stats&lt;br /&gt;
&lt;br /&gt;
(Need a cap on the maximum length of a word to exclude bizzare text strings)&lt;br /&gt;
&lt;br /&gt;
Assume that we&amp;#039;ve solved this, what operations do we want to support?&lt;br /&gt;
&lt;br /&gt;
 - get topics (&amp;quot;interesting&amp;quot; words)&lt;br /&gt;
 - for a topic, get the URLs with them&lt;br /&gt;
&lt;br /&gt;
But how do we treat news sites?&lt;br /&gt;
 - visiting the same URL produces completely different content?&lt;br /&gt;
 - I think it should automatically be excluded&lt;br /&gt;
   - so cbc.ca/news should automatically be ignored because its contents keep changing&lt;br /&gt;
   - but pages it points to will be kept because they are consistent&lt;br /&gt;
&lt;br /&gt;
So every time we visit a page we&amp;#039;ll have to see if we&amp;#039;ve visited it before&lt;br /&gt;
 - calculate word frequency, if it is very different from a past visit then we can add it to a list of ignored web pages&lt;br /&gt;
 &lt;br /&gt;
When the user wants to see their topic-based history, we would show them the list of words, perhaps using a word cloud&lt;br /&gt;
  - select a word, then you can see the pages associated with it&lt;br /&gt;
  - maybe get another word cloud that is associated with that word&lt;br /&gt;
    (to show co-occurances)&lt;br /&gt;
&lt;br /&gt;
Do you all like word clouds?&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;/div&gt;</summary>
		<author><name>Soma</name></author>
	</entry>
</feed>