WebFund 2016W Lecture 23: Difference between revisions

From Soma-notes
Created page with "==Video== The video from the lecture given on April 5, 2016 [http://homeostasis.scs.carleton.ca/~soma/webfund-2016w/lectures/comp2406-2016w-lec23-05Apr2016.mp4 is now availab..."
 
Line 118: Line 118:
==Code==
==Code==


[http://homeostasis.scs.carleton.ca/~soma/webfund-2016w/code/analyzeLogs-filecount.zip analyzeLogs-filecount]
* [http://homeostasis.scs.carleton.ca/~soma/webfund-2016w/code/analyzeLogs-filecount.zip analyzeLogs-filecount]


Note this version has node_modules removed; copy this directory from analyzeLogs-sol or run "npm install".
Note this version has node_modules removed; copy this directory from analyzeLogs-sol or run "npm install".

Revision as of 03:31, 6 April 2016

Video

The video from the lecture given on April 5, 2016 is now available.

Notes

In class

Lecture 23
----------

Scalability

* You replicate your web application, should be
  "embarassingly parallel" (no direct interaction)

* Communication between servers happens through the
  backend database

Why not have the web servers talk directly to each other?
 - you then have to figure out how to do
   synchronization/concurrency right
 - that's what databases are for!

So how in the world do you scale up databases?

First answer: use a minimal solution
 - only get the functionality that you want

First rule of scalability
 - you can't do everything at scale

So, you have to choose what you will do

Why are sacrifices necessary?

latency versus bandwidth

bandwith: bits transferred per second on average
latency: time to get first bit of response after request

Consider a large truck full of hard disks driving
across Canada.
  - very, very high bandwidth
  - very, very high latency as well!
    (2 weeks to get first bit of response)

Ideally, you want high bandwidth and low latency
 - bandwidth you get through parallelism
 - latency has to be engineered

A "supercomputer" is one with low-latency memory access,
for LOTS of memory
  - so it has to have fast interconnects
  - thus, accesses to different nodes aren't much
    slower than local accesses

Challeng for large web apps is having the database
answer queries with low latency

But some amount of latency is inevitable
 - speed of light is finite

So if you want fast access to your webserver worldwide
 - you need to replicate across the globe
 - be close to your clients


NoSQL databases became popular because of latency
concerns
 - you needed to be as fast as possible,
 - so strip it to the bone

Use an in-memory key-value store if it is sufficient
  - lowest latency
  - least functionality

If you have to, use an SQL database
  - highest latency
  - most functionality

Or use something in between (MongoDB)

Once you choose the type of database, you OPTIMIZE
 - minimize I/O and computation required per access
   (read or write)
 - example: query optimization
 - how you form the query
   - how database is organized

Count the number of web pages that have the word
 "amazing" in them

How?
 - first, need a database with a copy of the web pages
 - then, you could do linear search through all
   of the web pages...

I ask this because a web search is a massive challenge
in query optimization

 - need to limit scope as early as possible in query
 - organize data so queries are quick to be answered
    - precompute as much as possible

The best you can do is table lookup. So have the right
tables ready!

Key tool is making an INDEX
 - table of search term and pointers to data

E.g., you have a table of customers sorted by ID
 - have an index of names, so a table of names versus
   IDs

Code

Note this version has node_modules removed; copy this directory from analyzeLogs-sol or run "npm install".