WebFund 2016W Lecture 23

From Soma-notes
Jump to navigation Jump to search

Video

The video from the lecture given on April 5, 2016 is now available.

Notes

In class

Lecture 23
----------

Scalability

* You replicate your web application, should be
  "embarassingly parallel" (no direct interaction)

* Communication between servers happens through the
  backend database

Why not have the web servers talk directly to each other?
 - you then have to figure out how to do
   synchronization/concurrency right
 - that's what databases are for!

So how in the world do you scale up databases?

First answer: use a minimal solution
 - only get the functionality that you want

First rule of scalability
 - you can't do everything at scale

So, you have to choose what you will do

Why are sacrifices necessary?

latency versus bandwidth

bandwith: bits transferred per second on average
latency: time to get first bit of response after request

Consider a large truck full of hard disks driving
across Canada.
  - very, very high bandwidth
  - very, very high latency as well!
    (2 weeks to get first bit of response)

Ideally, you want high bandwidth and low latency
 - bandwidth you get through parallelism
 - latency has to be engineered

A "supercomputer" is one with low-latency memory access,
for LOTS of memory
  - so it has to have fast interconnects
  - thus, accesses to different nodes aren't much
    slower than local accesses

Challeng for large web apps is having the database
answer queries with low latency

But some amount of latency is inevitable
 - speed of light is finite

So if you want fast access to your webserver worldwide
 - you need to replicate across the globe
 - be close to your clients


NoSQL databases became popular because of latency
concerns
 - you needed to be as fast as possible,
 - so strip it to the bone

Use an in-memory key-value store if it is sufficient
  - lowest latency
  - least functionality

If you have to, use an SQL database
  - highest latency
  - most functionality

Or use something in between (MongoDB)

Once you choose the type of database, you OPTIMIZE
 - minimize I/O and computation required per access
   (read or write)
 - example: query optimization
 - how you form the query
   - how database is organized

Count the number of web pages that have the word
 "amazing" in them

How?
 - first, need a database with a copy of the web pages
 - then, you could do linear search through all
   of the web pages...

I ask this because a web search is a massive challenge
in query optimization

 - need to limit scope as early as possible in query
 - organize data so queries are quick to be answered
    - precompute as much as possible

The best you can do is table lookup. So have the right
tables ready!

Key tool is making an INDEX
 - table of search term and pointers to data

E.g., you have a table of customers sorted by ID
 - have an index of names, so a table of names versus
   IDs

Student Notes

Counting Log Entries

  • Exercise related to Assignment 6: The following is to help you in getting started on adding in single-page functionality for the web app.
    • How do we get the data related to the number of log files and the number of entries from the database? Ask the database how many log files are there?
    • What do we want to do? Count the number of entries. Use the count method.
      • db.logs.count() counts records
    • How many distinct values are there?
      • db.logs.distinct(“file”) counts how many files we have
    • The placeholder information is currently filled in on the server
      • How? First go to views. The Jade file already has numfiles and numentries variables which are passed in to the template through the call to render() in the routes. How do we update them?
    • What problems do we run in to when using the count() and distinct() methods?
      • Reloading the page is one problem
      • The real problem is with the callbacks
        • In order to do this, we need to do 2 separate database operations to get the values. We can try to combine them, however, this entails nesting one call to the database within the other callback, and therefore is messy.
        • Instead, we can make them each as a separate route. Set up the following in index.js:
router.get(‘/count’, function(req,res){
   function reportCount(err,count){
      if (err) {
         res.send(-1);
      }
      else {
         res.send(count);
      }
   }

   logsCollection.count({}, reportCount);
});
  • A problem with this code is that the count is interpreted as a response code
  • We can change the code to fix this:
router.get(‘/count’, function(req,res){
   function reportCount(err,count){
      if err {
         res.sendStatus(500);}
      } else {
         res.send({count: count});
      }
   }

   logsCollection.count({}, reportCount);
});
  • And the route for file count:
router.get(‘/storeFiles’, function(req,res) {
   function reportStoredFiles(err,files){
      if err {
         res.sendStatus(500);
      } else{
         res.send(files); //to get number of files res.send({count: files.length});
      }
   }

   logsCollection.distinct(“file”, reportStoredFiles);
});
  • This code allows us to get these values to the client
  • Currently, we do not have any client-side code to do anything with the values
  • We want to somehow update the page to reflect the correct values
  • We can do this on the server-side before sending the page to the client but this can be a pain so let’s change this so that we do an AJAX request from the browser and update the DOM
  • We will need a client-side script to do this
  • We can reference exam-storage to see how to link our scripts in the Jade templates
    • In layout.jade, we want to link the jquery script since we will be using it
    • From account.jade, we can see where our main client-side script was linked. We will need to do something like this for index.jade
    • Make sure that the scripts are stored somewhere in the public directory and that the paths in the links are correct
  • We can add some default text into index.jade to display if we do not have the actual file and log entry counts
  • Then we need to write our query.js script
    • The functions in this script are used to update the number of logs and number of entries shown in the browser. So, when you reload the page after uploading a file, the number of logs and entries also get updated accordingly.
      • updateStats() is used to make the AJAX requests to the server to get the updated values
      • updatesStatsText() is used to update the DOM with the new values
$(function() {
   var numEntries = 0;
   var numFiles = 0;
   var stats = $(“#stats”);

   function updatesStatsText() {
      stats.html(“Currently we have “ + numEntries + “ log entries in “ + numFiles + “log files.”);
   }

   function updateStats() {
      var numUpdated = 0;
      $.getJSON(“/ count”,function(v){
         numEntries = v.count;
         numUpdated++;

         if(numUpdated >= 2){
            updateStatsText();
         }
      });

      $.getJSON(“/storedFileCount”, function(v){
         numFiles = v.count;
         numUpdated++;

         if(numUpdated >= 2){
            updateStatsText(); });
         }
      });
   }

   updateStats();
}      
  • With the assignment, everything should show up at the bottom of the page.
  • This means upating the DOM in a similar fashion to what we have done here

Scalability

  • When replicating your web application, your code should be “embarrassingly parallel” (no direct interaction)
  • Communication between servers happens through the backend database
  • Why not have the web servers talk directly to each other?
    • You have to figure out how to do synchronization/concurrency right
    • That's what databases are for! Let it deal with the problem
  • So, how in the world do you scale up databases?
  • First answer: use a minimal solution
    • Only get the functionality that you want
  • Basic/first rule of scalability
    • You cant do everything at scale
  • So you have to choose what you will do. Have to make some sacrifices
  • Why are sacrifices necessary?
  • Latency vs Bandwidth
    • Bandwidth: bits transferred per second on average
    • Latency: time to get first bit of response after request
  • Consider a large truck full of hard disks driving across Canada. What’s the bandwidth of this truck? Low bandwidth or high bandwidth?
    • It’s very very high bandwidth takes two weeks to travel across the country, but when it arrives, a lot of data has been transferred
    • Very very high latency as well! (2 weeks to get first bit of response)
  • Ideally you want HIGH BANDWIDTH and LOW LATENCY
    • Bandwidth you get through parallelism
    • Latency has to be engineered.
  • Latency is hard so,
    • A “supercoumputer” is one with low latency memory access, for lots of memory.
  • Challenge for large web apps is having the database answer queries with low latency
  • If a user goes to your website the page should be loaded fast. Even if it's a bit slow, they will be unhappy. So, you need the low latency.
  • But some amount of latency is inevitable (laws of physics, Einstein)
    • The speed of light is finite. A nanosecond in wire is a short piece of wire. A microsecond a pretty good size spool. Milliseconds is thousand times larger.
    • How far do signals go?
      • Microprocessors operate on the nanoseconds
      • Networks on the milliseconds
      • Ping time to get to the other side of the globe, 100s of millisecond
        • If it's more than 50 milliseconds or half a second, people will notice
  • So if you want fast access to your web server worldwide
    • You need to replicate across the globe (this is what a CDN/content delivery network does)
    • Be close to your clients
  • In general, the more the functionality a database will have, the longer it will take
  • NoSQL databases became popular because of latency concerns
    • You needed to be as fast as possible
    • So strip it to the bone
  • Use an in-memory key value store if it is sufficient
    • Lowest latency
    • Least functionality
  • If you have to, use an SQL database
    • Higher latency
    • Most functionality
  • Or use something in between (MongoDB)
  • Once you choose the type of database, you optimize
    • Minimize I/O and computation require per access (read or write)
    • Example: query optimization
    • How you form the query
    • How the database is organized


  • Why Google is getting big bucks? How do you query the entire Web?
    • Google is good at query optimization
  • Google has a copy of everything on the Web. Their Web crawlers do this for them
  • How do you make a query of the entire Web?
    • Use a very big database!
    • The interface to query through Google is just a regular website, it's just working with very large data sets!
  • How can we count the number of web pages that have the word that have the word “amazing” in them?
    • First you need a database with a copy of all the web pages
    • Then you could do linear search through all the of the pages of the web (this is not an efficient solution)
  • I ask this because a web search is a massive challenge in query optimization
    • We need to limit scope as early as possible in query
    • Organize data so queries can be answered quickly
    • Precompute as much as possible
  • The best you can do is table lookup. So have the right tables ready!
  • When you do a Google search, it is picking out answer from answers it already figured out before
  • A key tool is making an INDEX
    • Table of search terms and pointers to data
    • e.g. If you have a table of customers sorted by ID
      • Have an index of names, so a table of names vs IDs
    • An index is a table that is very short, one column for the indexed attribute and the second for the ID
    • Is this what Google is doing? No, they are doing much fancier things but don’t tell anyone because of layers and layers of proprietary stuff
  • Microsoft has about 100 millions lines of code, Linux kernel is 20 millions lines of code, Google is estimated at about 2 billion lines of code. This is not surprising, there is a lot of functionality
  • Look up MapReduce. You can use it to find all web pages with “amazing”. All you do is you have a cluster of machines with data across them and you tell each machine to count the number of times amazing appears in the web pages you have on that machine and then accumulate altogether to be a single count. It can also handle failures during a query. Data is being stored in special replicated way, so that if there is one particular node that is failing, others can take over the job. It is functional programming at scale.

Code

Note this version has node_modules removed; copy this directory from analyzeLogs-sol or run "npm install".