WebFund 2016W Lecture 23

Video

The video from the lecture given on April 5, 2016 is now available.

Notes

In class

Lecture 23
----------

Scalability

* You replicate your web application, should be
  "embarassingly parallel" (no direct interaction)

* Communication between servers happens through the
  backend database

Why not have the web servers talk directly to each other?
 - you then have to figure out how to do
   synchronization/concurrency right
 - that's what databases are for!

So how in the world do you scale up databases?

First answer: use a minimal solution
 - only get the functionality that you want

First rule of scalability
 - you can't do everything at scale

So, you have to choose what you will do

Why are sacrifices necessary?

latency versus bandwidth

bandwith: bits transferred per second on average
latency: time to get first bit of response after request

Consider a large truck full of hard disks driving
across Canada.
  - very, very high bandwidth
  - very, very high latency as well!
    (2 weeks to get first bit of response)

Ideally, you want high bandwidth and low latency
 - bandwidth you get through parallelism
 - latency has to be engineered

A "supercomputer" is one with low-latency memory access,
for LOTS of memory
  - so it has to have fast interconnects
  - thus, accesses to different nodes aren't much
    slower than local accesses

Challeng for large web apps is having the database
answer queries with low latency

But some amount of latency is inevitable
 - speed of light is finite

So if you want fast access to your webserver worldwide
 - you need to replicate across the globe
 - be close to your clients


NoSQL databases became popular because of latency
concerns
 - you needed to be as fast as possible,
 - so strip it to the bone

Use an in-memory key-value store if it is sufficient
  - lowest latency
  - least functionality

If you have to, use an SQL database
  - highest latency
  - most functionality

Or use something in between (MongoDB)

Once you choose the type of database, you OPTIMIZE
 - minimize I/O and computation required per access
   (read or write)
 - example: query optimization
 - how you form the query
   - how database is organized

Count the number of web pages that have the word
 "amazing" in them

How?
 - first, need a database with a copy of the web pages
 - then, you could do linear search through all
   of the web pages...

I ask this because a web search is a massive challenge
in query optimization

 - need to limit scope as early as possible in query
 - organize data so queries are quick to be answered
    - precompute as much as possible

The best you can do is table lookup. So have the right
tables ready!

Key tool is making an INDEX
 - table of search term and pointers to data

E.g., you have a table of customers sorted by ID
 - have an index of names, so a table of names versus
   IDs

Student Notes

Counting Log Entries

Exercise related to Assignment 6: The following is to help you in getting started on adding in single-page functionality for the web app.
- How do we get the data related to the number of log files and the number of entries from the database? Ask the database how many log files are there?
- What do we want to do? Count the number of entries. Use the count method.
  - db.logs.count() counts records
- How many distinct values are there?
  - db.logs.distinct(“file”) counts how many files we have
- The placeholder information is currently filled in on the server
  - How? First go to views. The Jade file already has numfiles and numentries variables which are passed in to the template through the call to render() in the routes. How do we update them?
- What problems do we run in to when using the count() and distinct() methods?
  - Reloading the page is one problem
  - The real problem is with the callbacks
    - In order to do this, we need to do 2 separate database operations to get the values. We can try to combine them, however, this entails nesting one call to the database within the other callback, and therefore is messy.
    - Instead, we can make them each as a separate route. Set up the following in index.js:

router.get(‘/count’, function(req,res){
   function reportCount(err,count){
      if (err) {
         res.send(-1);
      }
      else {
         res.send(count);
      }
   }

   logsCollection.count({}, reportCount);
});

A problem with this code is that the count is interpreted as a response code
We can change the code to fix this:

router.get(‘/count’, function(req,res){
   function reportCount(err,count){
      if err {
         res.sendStatus(500);}
      } else {
         res.send({count: count});
      }
   }

   logsCollection.count({}, reportCount);
});

And the route for file count:

router.get(‘/storeFiles’, function(req,res) {
   function reportStoredFiles(err,files){
      if err {
         res.sendStatus(500);
      } else{
         res.send(files); //to get number of files res.send({count: files.length});
      }
   }

   logsCollection.distinct(“file”, reportStoredFiles);
});

This code allows us to get these values to the client
Currently, we do not have any client-side code to do anything with the values
We want to somehow update the page to reflect the correct values
We can do this on the server-side before sending the page to the client but this can be a pain so let’s change this so that we do an AJAX request from the browser and update the DOM
We will need a client-side script to do this
We can reference exam-storage to see how to link our scripts in the Jade templates
- In layout.jade, we want to link the jquery script since we will be using it
- From account.jade, we can see where our main client-side script was linked. We will need to do something like this for index.jade
- Make sure that the scripts are stored somewhere in the public directory and that the paths in the links are correct
We can add some default text into index.jade to display if we do not have the actual file and log entry counts
Then we need to write our query.js script
- The functions in this script are used to update the number of logs and number of entries shown in the browser. So, when you reload the page after uploading a file, the number of logs and entries also get updated accordingly.
  - updateStats() is used to make the AJAX requests to the server to get the updated values
  - updatesStatsText() is used to update the DOM with the new values

$(function() {
   var numEntries = 0;
   var numFiles = 0;
   var stats = $(“#stats”);

   function updatesStatsText() {
      stats.html(“Currently we have “ + numEntries + “ log entries in “ + numFiles + “log files.”);
   }

   function updateStats() {
      var numUpdated = 0;
      $.getJSON(“/ count”,function(v){
         numEntries = v.count;
         numUpdated++;

         if(numUpdated >= 2){
            updateStatsText();
         }
      });

      $.getJSON(“/storedFileCount”, function(v){
         numFiles = v.count;
         numUpdated++;

         if(numUpdated >= 2){
            updateStatsText(); });
         }
      });
   }

   updateStats();
}

With the assignment, everything should show up at the bottom of the page.
This means upating the DOM in a similar fashion to what we have done here

Scalability

When replicating your web application, your code should be “embarrassingly parallel” (no direct interaction)
Communication between servers happens through the backend database
Why not have the web servers talk directly to each other?
- You have to figure out how to do synchronization/concurrency right
- That's what databases are for! Let it deal with the problem
So, how in the world do you scale up databases?
First answer: use a minimal solution
- Only get the functionality that you want
Basic/first rule of scalability
- You cant do everything at scale
So you have to choose what you will do. Have to make some sacrifices
Why are sacrifices necessary?
Latency vs Bandwidth
- Bandwidth: bits transferred per second on average
- Latency: time to get first bit of response after request
Consider a large truck full of hard disks driving across Canada. What’s the bandwidth of this truck? Low bandwidth or high bandwidth?
- It’s very very high bandwidth takes two weeks to travel across the country, but when it arrives, a lot of data has been transferred
- Very very high latency as well! (2 weeks to get first bit of response)
Ideally you want HIGH BANDWIDTH and LOW LATENCY
- Bandwidth you get through parallelism
- Latency has to be engineered.
Latency is hard so,
- A “supercoumputer” is one with low latency memory access, for lots of memory.
  - So it has to have fast interconnects between the nodes. Fast meaning low latency, not high bandwidth.
  - Thus, accesses to different nodes aren't much slower than local accesses
Challenge for large web apps is having the database answer queries with low latency
If a user goes to your website the page should be loaded fast. Even if it's a bit slow, they will be unhappy. So, you need the low latency.
But some amount of latency is inevitable (laws of physics, Einstein)
- The speed of light is finite. A nanosecond in wire is a short piece of wire. A microsecond a pretty good size spool. Milliseconds is thousand times larger.
- How far do signals go?
  - Microprocessors operate on the nanoseconds
  - Networks on the milliseconds
  - Ping time to get to the other side of the globe, 100s of millisecond
    - If it's more than 50 milliseconds or half a second, people will notice
So if you want fast access to your web server worldwide
- You need to replicate across the globe (this is what a CDN/content delivery network does)
- Be close to your clients
In general, the more the functionality a database will have, the longer it will take
NoSQL databases became popular because of latency concerns
- You needed to be as fast as possible
- So strip it to the bone
Use an in-memory key value store if it is sufficient
- Lowest latency
- Least functionality
If you have to, use an SQL database
- Higher latency
- Most functionality
Or use something in between (MongoDB)
Once you choose the type of database, you optimize
- Minimize I/O and computation require per access (read or write)
- Example: query optimization
- How you form the query
- How the database is organized

Why Google is getting big bucks? How do you query the entire Web?
- Google is good at query optimization
Google has a copy of everything on the Web. Their Web crawlers do this for them
How do you make a query of the entire Web?
- Use a very big database!
- The interface to query through Google is just a regular website, it's just working with very large data sets!
How can we count the number of web pages that have the word that have the word “amazing” in them?
- First you need a database with a copy of all the web pages
- Then you could do linear search through all the of the pages of the web (this is not an efficient solution)
I ask this because a web search is a massive challenge in query optimization
- We need to limit scope as early as possible in query
- Organize data so queries can be answered quickly
- Precompute as much as possible
The best you can do is table lookup. So have the right tables ready!
When you do a Google search, it is picking out answer from answers it already figured out before
A key tool is making an INDEX
- Table of search terms and pointers to data
- e.g. If you have a table of customers sorted by ID
  - Have an index of names, so a table of names vs IDs
- An index is a table that is very short, one column for the indexed attribute and the second for the ID
- Is this what Google is doing? No, they are doing much fancier things but don’t tell anyone because of layers and layers of proprietary stuff
Microsoft has about 100 millions lines of code, Linux kernel is 20 millions lines of code, Google is estimated at about 2 billion lines of code. This is not surprising, there is a lot of functionality
Look up MapReduce. You can use it to find all web pages with “amazing”. All you do is you have a cluster of machines with data across them and you tell each machine to count the number of times amazing appears in the web pages you have on that machine and then accumulate altogether to be a single count. It can also handle failures during a query. Data is being stored in special replicated way, so that if there is one particular node that is failing, others can take over the job. It is functional programming at scale.

Code

analyzeLogs-filecount

Note this version has node_modules removed; copy this directory from analyzeLogs-sol or run "npm install".

WebFund 2016W Lecture 23

Contents

Video

Notes

In class

Student Notes

Counting Log Entries

Scalability

Code

Navigation menu

WebFund 2016W Lecture 23

Video

Notes

In class

Student Notes

Counting Log Entries

Scalability

Code

Navigation menu

Search