WebFund 2015W Lecture 12

Video

The video from the lecture given on February 23, 2015 is now available.

Notes

On-Screen Notes

Lecture 12
----------

Midterms to be returned Wednesday

Today: MongoDB and databases


Parts of a web application:
* Front End: client-side code (HTML, CSS, JavaScript)
  - runs in the browser
* Back End: server-side code
  - web server: PHP, Python, Java, *JavaScript* (node)
  - database: MySQL, Oracle, PostgresQL, MongoDB, ...

Early web servers just used a filesystem
 - good for static content
 - bad for concurrent access


Why concurrency?
 - many different web clients trying to modify the "same" resources
 - e.g., passwords stored in a file
 - *really* bad when using multiple web servers

Databases are good at
 - fine-grained data concurrency
 - not as easy to "scale" (cause concurrency is hard)
 - that's SOMEBODY ELSE's PROBLEM

Why not just have web clients talk to the database?
 (cut out the middle man)
 - security: don't want all clients to access ALL data
    - databases do have access control but generally not suitable for most web applications
    - also, data validation
 - scalability
    - databases are relatively slow
    - often stick a caching layer in front/separate out static content


Types of databases
------------------
SQL versus NoSQL databases

Actually, many kinds of NoSQL databases

Difference is *transactions*


SQL => relational databases

Data is divided into tables

tables have columns (fields) and rows (records)

Columns have types (date, string, number, etc)

Table for customers
fields (columns)
 - Customer ID (primary key)
 - name
 - street
 - city
 - province
 - etc.

Table for invoices
 - Invoice ID (primary key)
 - Customer ID
 - what they purchased
 - amount


Relations connect tables

Relations often work best with unique keys
 - fields where every record has a unique value


Scenario
 - creating first invoice for a new customer
 - problem: system crashes after adding the invoice but before adding the customer's info
 - transaction is what makes sure this never happens
   - if just an invoice or a customer change, transaction is not complete and data is thrown away

Transaction:
 - start transaction
 - make your changes
    - add the invoice
    - add the customer
 - commit transaction if all changes succeeded, otherwise abort

Relational databases have *overhead* and *complexty* and *security issues*


NoSQL => no transactions

We're using MongoDB
 - "document store"
 - i.e., JSON store (Technically, BSON)
 - no built-in relations
 - however, MongoDB can run code...JavaScript

In SQL databases
 - fields are pre-defined and relatively rigid

In MongoDB
 - can have any properties in any document
 - no consistency requirement

Indexing


 - records/documents are stored in order of primary keys
    - fast to find by primary key
 - what if I want to find records/documents by something that *isn't* a primary key
    - e.g., invoices by what was purchased, customers by city

basic search
 - if sorted, then you do binary search
 - maybe store so you can index into an arry/hash table
 - otherwise...exhaustive search

Indexing is a way to get around exhaustive search

An index is a data structure associating:
 * a field or fields with
 * the primary key

Index by city in customers
 - sorted list of cities and customer IDs

Why not index on every field?
 - space
 - time of insert/add operation

Databases

There are 3 parts to a web application:

The front end, which consists of code that runs in a browser - HTML, CSS, JavaScript
The back end, which has 2 parts:
- The web server - may be written in PHP, Python, Java, JavaScript, etc.
- The database - MySQL, Oracle, PostgreSQL, MongoDB, etc.

Why is a there a web server and why a data base?

It didn't used to be this way. Early web servers just used a file system. This is fine for static content but bad for concurrency.

Why do we need concurrency?

Many different clients may be trying to modify the same resources. This is an even larger problem if there are multiple web servers trying to access the same resources. We need to scale on the web and the easiest way to scale is to have multiple web servers. But then multiple web servers are trying to access the same resources. Even for distributed file systems, this is a problem. When a modification is made to a file, it may take some time for the changes to propagate and there may be syncing issues.

Databases are good at finely grained concurrency. They are not as easy to scale since it is difficult to manage and scale this level of concurrency. Typically these problems only get worked out through code being used extensively to fix bugs as they come up. We do not need to worry about this because that is somebody else's problem. There are existing services, database management systems, which handle all of this for us. We just need to pick one.

Why not just have web clients talk directly to the database (cut out the middle man)?

Security: We don't want all clients to have access to ALL of the data. Databases do have access control but generally not suitable for most web applications. You always want a web server in between just to make sure that no one messes with the data.
Data validation: Logic for validation is better to be implemented in the web server.
Scalability: Databases are relatively slow. It is often better to stick a caching layer in front and separate out the content.

Good web app design is often about “how do I keep from accessing the database too often?”.

SQL versus NoSQL

There are actually many kinds of NoSQL databases. The main difference is transactions. Only SQL supports transactions.

SQL (Standard Query Language) is a query language for relational databases. In a relational database, the data is divided into tables which have columns (fields) and rows(records). Each column has a type (date, string, number, etc.). The rows represent entries in the table. For example, if you have a table of customers, one row from the table represents a single customer - that is a record.

Databases are a way of creating persistent data structures that are richer in structure than files. Files are just byte streams. Databases have structure, the fundamental in this case being tables.

A table for customers might have the following fields (columns):

Customer ID (Primary key)
Name
Street
City
Province

Tables in a relational database are connected by relations. This works best with unique keys. A primary key is a field in a table with unique values (like the Customer ID in the example above). The primary key of one table can be included as a field in another table to create a relation between them.

Suppose now we have another table, this time for invoices:

Invoice ID (Primary key)
Customer ID (Foreign key - the relation to the Customers table)
Purchase
Amount

A relation between the Invoices table and the Customers table can be seen with the Customer ID. There should never be a case where a relation refers to a key value that does not exist in the table which has the primary key. In this example, every Customer ID in the Invoices table should refer to a valid Customer - one which exists in the Customers table. Therefore the application logic must take care of this. If you’re updating the Invoices table we should make sure the customer also exists.

Scenario:

Create the first invoice for a new customer
Add the record to the Invoices table
The system crashes before a record was added to the Customers table

In this scenario, we now have a relation which refers to a key that does not exist. Transactions are used to make sure that this never happens. If only part of the data gets inserted, it will not be committed. The transaction will only allow the data to actually be committed once everything is done.

Begin the transaction
Make changes...
Commit the transaction

In this course we will not be using relational databases because they have:

Overhead
Complexity
Security issues

MongoDB

Instead we will be using MongoDB, which is a document store. It is designed to store objects in the JSON (technically BSON) format. In the MongoDB terminology these objects are called documents. If we have an object in JavaScript, we can send it to a database in MongoDB and it will just save it within the collections in the database. It can later be retrieved via queries.

MongoDB does not have built-in relations but it does have the ability to run JavaScript code. Additionally, the structuring in MongoDB is not as rigid as in SQL databases. With SQL, fields are predefined and are rarely changed. In MonogoDB, we can have any properties in any document - there is no requirement for consistency.

Indexing

In MongoDB we still have the concept of a key. Documents are stored in a collection in the order of their primary key. This allows for quick searches based on the primary key.

What if we want to find documents by something that isn't a primary key?

If the documents are sorted on the value that we want to search by then a binary search can be done. Otherwise an exhaustive search must be run. Indexing is a way to get around exhaustive search!

An index is a data structure which associates one or more fields with the primary key. For example, if an index is made for customers on their 'City' values then we make a sorted list of cities associated with customer IDs.

Why not just make an index for every field or combinations of fields?

Every entry in an index structure takes up space.
Every time we add something we update every other index.

On the tutorial

In the tutorial, there are two programs provided - filestats and storenotes. The idea is to combine the two to make a single program - insertnotes.

filestats.js

This program opens a specific file, reads it and then prints to the console.

Some notes on parts of the code:

Var theFile = process.argv[2] - This takes the command line argument from position 2 (not safe in C as the argument might not have as many arguments, but in JS does not matter if it undefined!).
(theFile === undefined) checks if a file was given in the first place or not. If one was provided, the program will read it using fs.readFile
Utf8 - Encoding for text
The callback to getFileStats() checks for file reading errors and prints the characters and lines if there were no errors.
data.split() makes the data into an array of lines.

Database interaction

Mongodb process is running in the background (ps aux | grep mongo will show database running). Type mongo in a terminal to run the Mongo client. We can type help for assistance.

Type show dbs to show all of the databases which exist. in MongoDB, a database is a set of collections (a collection is made up of documents). To select a database, type use dbname. Then type show collections to see the collections inside the selected database.

Once a database is selected, type db.collectionname.find() to query a collection. Since no parameters are given to the find function, this will return all documents.

storenotes.js

Before running storenotes we need to install the required module. npm install mongodb installs the module needed for talking to MongoDB databases. Installing does a build process..to run some native code.

In the storenotes script, notes is an array of documents with title, owner and content.

mc.connect will connect to MongoDB. By default MongoDB is configured to listen on localhost port 27017.

mc.connect(‘mongodb://localhost/persistent-notes’,(function(err, db) {... connects to MongoDB and uses the persistent-notes database. The callback function is called when the connection is established.

There are a bunch of functions defined below that. These are chained together to drop the notes collection, make an index using the owner property and add some notes to the collection.