Mobile Apps 2023W Lecture 10
Revision as of 17:36, 10 February 2023 by Soma (talk | contribs) (Created page with "==Notes== <pre> February 10 ----------- Web browser with history visualization/query - understand what pages I've been visiting Potential screens: - basic web browser (where user will spend most of their time) - history viewer (text) - history viewer (graphical) How do we display web history? - basic chronological - time, web page, title Potential tasks - what websites do I visit the most? - what topics do I read about? - what authors do I read? - conn...")
Notes
February 10 ----------- Web browser with history visualization/query - understand what pages I've been visiting Potential screens: - basic web browser (where user will spend most of their time) - history viewer (text) - history viewer (graphical) How do we display web history? - basic chronological - time, web page, title Potential tasks - what websites do I visit the most? - what topics do I read about? - what authors do I read? - connect the above (topic & author, topic & website) Web URLs are easy, get that from user visiting page Time is easy, we know when things happen But the other things...they require analyzing the page Authorship is actually hard, no standard way of representing it The "Semantic web" was an idea for representing the web for automatic analysis rather than human consumption - all information would be represented in schemas - so queries would be very easy The semantic web never really took off, isn't what we get generally What we *can* get is the plain text of a page - we have what is between body and headline tags - if we look at the code for "reader" modes in browsers, they show how to do this Let's assume we have some way to convert a web page into text We have a bunch of words - can run sort & uniq, get a list of words in a page - could even get frequency, but that typically isn't so useful Given a list of words, what is the "topic" of a page? - to do this properly you'd need NLP of some kind (AI) I could just remember which words are on which pages - remove frequently-occuring words automatically by deleting words that get above a certain frequency Zipf's law - frequency of words in natural languase falls off as 1/n - so most words are very infrequent, but a few are very frequent Instead of topics, we are just remembering the words that are present on different pages - so in addition to the URL, we need to store which words each page has - but I don't think that will be that much storage if we eliminate frequent words So then, how do we structure a database schema to fit all this? Potential tables - page IDs - URLs -> UIDs - use UIDs everywhere instead of URLs to save space - can trim URLs to make sure they work consistently - raw history - page URL, time - page->words (maybe not needed?) - only stores words that haven't reached max frequency - words->pages - again only words that haven't reached max frequency - represent pages using a UID, not the URL - word frequency - every word we've seen - for each word, its frequency - max frequency, once above this we stop counting - ignored URLs - pages that change frequently we can assume that pages don't change, i.e., URL->contents mapping stays consistent Note that if we view too many pages with a given topic, that topic will turn into a frequent word and won't show up in our analysis To visualize, we don't want pages->words, but words->pages One problem with SQL, it doesn't like variable-sized records So if we have a page with 1000 words, 20 unique non-frequent words - so need to represent the mapping of the page's URL UID to each of these words - but another page could have 2, another could have 2000 Can just have a table of word, UID with duplicates - then later can index to get stats (Need a cap on the maximum length of a word to exclude bizzare text strings) Assume that we've solved this, what operations do we want to support? - get topics ("interesting" words) - for a topic, get the URLs with them But how do we treat news sites? - visiting the same URL produces completely different content? - I think it should automatically be excluded - so cbc.ca/news should automatically be ignored because its contents keep changing - but pages it points to will be kept because they are consistent So every time we visit a page we'll have to see if we've visited it before - calculate word frequency, if it is very different from a past visit then we can add it to a list of ignored web pages When the user wants to see their topic-based history, we would show them the list of words, perhaps using a word cloud - select a word, then you can see the pages associated with it - maybe get another word cloud that is associated with that word (to show co-occurances) Do you all like word clouds?