Notes
Lecture 8: NASD & GFS
---------------------
Questions?
- is NASD a NAS?
- how cost efficient? NASD or GFS?
- having the file server out of the loop, security issues with
NASD?
- checkpoint system?
- NASD in use?
- why just kill chunkservers? Not shut down?
- GFS file security?
NAS is just a file server
- dedicated, but a file server
- generally use standard network file sharing protocols
(CIFS, NFS)
NASD is a different beast
- disks are object servers, not file servers
- objects are just variable-sized chunks of data + metadata
(no code)
- contrast with blocks
With object-based distributed filesystems, we've added a level of indirection
- file server translates files to sets of objects, handle file
metadata
- object servers store objects
Why add this level of indirection? Why not just use fixed-sized blocks?
(In GFS, instead of objects we have chunks, bit less metadata)
Objects are all about parallel access
- to enable performance
Client can ask for objects from multiple object servers at once
- file server doesn't have to be involved at all
The classic way we did redundancy & reliability in storage is with RAID
Sounds like most of you haven't used RAID
- Redundant Array of Inexpensive/Independent Disks
- idea is to combine multiple drives together to get
more, higher performance, more reliable storage
RAID-0: striping
RAID-1: mirroring
RAID-5: striping + parity
With RAID, data is distributed across disks at the block level
- drives have no notion of files, just blocks
The modern insight with distributed storage is distributing at the block layer is too low level
- better to distribute bigger chunks, like objects!
Read objects in parallel, rather than blocks
- files are big, so feasible to read multiple objects in parallel
We do "mirroring" with objects/chunks, i.e. have multiple copies
- parity/erasure codes mostly not worth the effort for
these systems (but later systems will use such things)
Security
- NASD security? How can clients securely access
individual drives?
In Linux (POSIX) capabilities are a way to split up root access
- but that is actually not the "normal" meaning of capabilities
in a security context
Capabilities are tokens a process can present to a service to enable access
- separate authentication server gives out capability tokens
- idea is the authentication server doesn't have to check
when access is done, it can be done in advance
With capabilities, the drives can control access without
needing to understand about users, groups, etc
- it just has to understand the tokens, have a way to
verify them
- make sure the tokens can't be faked!
Most single sign on systems tend to have some sort of capability-like token underneath if they are really distributed
Note that capability tokens are ephemeral
- normally expire after a relatively short period of time (minutes or hours)
- needed to prevent replay attacks
Imagine having 10,000 storage servers and one authentication server
- if auth server had to be involved in every file access,
would become a bottleneck
- but with capabilities it can issue them at a much slower rate
and sit back while mass data transfers happen
Capabilities are at the heart of NASD
What about GFS?
- nope, assumes a trusted data center
- I think it has UNIX-like file permissions, but
nothing fancy
- just to prevent accidental file damage
What was GFS for?
- building a search engine
- i.e., downloading and indexing the entire web!
- data comes in from crawlers
- indices built as batch jobs
Are GFS files regular files?
- they are weird because they are sets of records
- records can be duplicated, must have unique id's
- record, think web page
- have to account for crawler messing up and
downloading same info multiple times
(i.e., if the crawler had a hardware or
software fault)