Computer Organization: Difference between revisions

From Soma-notes
No edit summary
m minor typo & grammar fixes
Line 87: Line 87:
'''Q''' :  What is the quickest way to get one single bit of information from location B to location A?
'''Q''' :  What is the quickest way to get one single bit of information from location B to location A?


'''A''' :  This time, the answer s the fiber optic cable.  How quickly information is transfered is known as '''latency'''.   
'''A''' :  This time, the answer is the fiber optic cable.  How quickly information is transfered is known as '''latency'''.   




Line 116: Line 116:
* ''Direct Memory Access (DMA)'' is when devices and their controllers are able to read and write information directly from the primary memory without any CPU getting in the way.  The DMA contains the same algorithms the CPU uses, except conceptually it does not need any registers.  This direct access can increase the speed significantly.   
* ''Direct Memory Access (DMA)'' is when devices and their controllers are able to read and write information directly from the primary memory without any CPU getting in the way.  The DMA contains the same algorithms the CPU uses, except conceptually it does not need any registers.  This direct access can increase the speed significantly.   


* ''Intelligent Devices'' : a final alternative is to have memory and chip sets that act in a way the CPU does on the I/O device itself.  This means we do not need to talk to the CPU or RAM at all (it can all be done on the device itself).
* ''Intelligent Devices'' : a final alternative is to have memory and chip sets that act in a way the CPU does on the I/O device itself.  This means we do not need to talk to the CPU or RAM at all (it can all be done on the device itself).
 


===SIMD Processors===
===SIMD Processors===

Revision as of 02:11, 1 October 2007

Computer Organization

Introduction

  • Information has been posted on the class website regarding running a Linux distribution on your own home machine (or a virtual desktop environment). Students are encouraged to familiarize themselves with the linux/unix environment as it is far less restricted and more robust than Windows/OSX.
  • Our research paper (ie. Operating Systems) should be well on its way. Students should have key words and topics narrowed down.


Stored Program Computer

The concept of a stored program computer began in the early 1800s with Charles Babbage and his "Difference Engine". Since then it has evolved into what we now consider Von Neumann Architecture. Although Von Neumann was only a piece of the puzzle of modern day computer architecture, we still refer to the architecture as being named after him. In more recent times, the overall Von Neumann model has changed, but conceptually, it is still the same. Several different components talk to each other (via buses) all at the same time - this scenario can be very complicated at times.


Central Processing Unit

The central processing unit (CPU) is made up of an Arithmetic Logic Unit (ALU) and Control Unit (CU).


Arithmetic Logic Unit

The ALU can perform all arithmetic (addition, subtraction, multiplication, division) and logical (AND, OR, NOT ... ) calculations at a very rapid rate.


Control Unit

The CU is made up of several components and duties :

  • Program counter or PC is the address of the currently executing instruction.
  • Status word is used to store information regarding the current state of the control unit. Information such as overflows and execution status can be stored. A word is typically what the processor natively deals with in terms of data size. Most modern processors are 32 or 64 - bits. That is, their registers store data segments that are 32 or 64 bits wide - which we collectively call a word. The size of the word will vary from CPU to CPU.
  • Responsible for the fetch-decode cycle which is an algorithm that grabs information from main memory, reads it, decodes it and then hands it off to the specific part of the processor to handle the instruction (oftentimes in involving the ALU).

The above description of the control unit is very simplified. It is actually a reflection of a CPU in the early to mid 1980s. For instance, Moore's Law states that that every 18 months, the number of transistors that sit on a CPU chip will double. This "law" was introduced in the 1960s and currently still holds true. Why? Mainly because the industry targets themselves towards this figure. Given Moore's Law, you can imagine how much CPU architecture has changed since the 1980s.


Transistors

With all of these extra transistors, what can we do with all of them? One (or both) of two things :

Integration

Integration refers to more and more components getting packed into the CPU. Devices such as :

  • Memory Controller (circuitry needed to drive the RAM)
  • Math Co-Processor (handles floating point values)
  • Memory Management Unit (MMU - used for virtualization)
  • L1 & L2 Cache (used for fast memory access when we don't want to go all the way to main memory. L1 cache is small but fast. L2 cache is large but slower).
  • Systems on a Chip : systems that have all transistors incorporated into a single chip. These currently exist, but are fairly slow in regards to overall performance.


Duplication

Why have one ALU when we can have 10? Modern processors use this technology quite a bit. Certain components can only be duplicated so much. Why? Aside from hardware constraints, we also have bottlenecks. For instance, a CPU has only one PC (program counter). What is the point of having several ALUs if the program counter can not keep up?

The most common forms of duplication include :

  • Pipelining : In pipelined CPUs a specific function unit of a CPU is partitioned into K smaller parts, called pipelined stages. When an operation is to be executed, it goes through the K stages of execution. The stages can execute independently and in parallel because they are separate units of hardware. The operation can be broken up into pieces where each stage can work on a particular piece so no part of the CPU is left idle. What happens if two instructions need the same piece of hardware? The CPU can guess which way the instruction will go afterwards and continue on its way. If it later learns that the path it took was incorrect, it can roll back its changes and carry on with the correct path. If the guesses are constantly wrong, the system is slowed down to a crawl. This entire process of 'guessing' is known as speculative execution.



  • Thread Level Parallelism : if we can't get a single instruction stream parallelized, we can just have several separate threads. This means we can share all of the address space information such as cache because the threads will be running in the same process.
  • Multi core : the most current technology in practice. After several years of packing CPUs with more and more features, bottlenecks began to arise. Solution? Duplicate the CPU! As the industry currently shows, more and more computers are being released with dual or even quad core processors. What are some bottlenecks that can arise with multi core processors? The bus is one obvious answer. If we increase the amount of information the buses can handle, we can accommodate multiple CPUs.
  • CPU cache : Why not get rid of RAM altogether and simply put all of it on the CPU as L1 or L2 cache? Cache memory is known as static memory, whereas RAM is known as dynamic memory or DRAM. DRAM can store information using about 1/4 of the amount of transistors as static memory. This is why there is only around 4MB of CPU cache available in most modern day processors. What RAM lacks in speed it makes up in size and parallelism.

The book " Computer Architecture : A Quantitative Approach " by Hennessy and Patterson is an excellent book on CPU Architecture.

What does all of this have to do with Operating Systems? If the OS does not understand resources (underlying hardware), then the system is doomed. We have to design the OS around the architecture (especially important with multi-core systems).


Latency & Bandwidth

Imagine the following situation : We have two facilities, A & B. A very large amount of information has to be transfered between the locations. We can do it in one of two ways :

  • Load up a semi-truck full of hard drives that are full of the information needed.
  • Connect the locations together via fiber optical cable to transfer the information.

Q : What method will allow the most information to be transported?

A : Obviously the truck full of hard drives, especially with modern day drives being capable of storing up to a Tera-bye of information. The amount of information to be transfered is known as bandwidth.

Q : What is the quickest way to get one single bit of information from location B to location A?

A : This time, the answer is the fiber optic cable. How quickly information is transfered is known as latency.



The above situation is also apparent when dealing with RAM. RAM is known as high-latency, high bandwidth. Moving large amounts of data to the CPU can sometimes be sped up using the cache. This issue is a very commonly studied topic in the world of RAM - CPU connections.

Other devices and their latency :

  • CPU registers : lowest latency
  • L2 & L1 Cache : low latency
  • Hard disk : highest latency


I/O

There are several different ways Input/Output devices can communication to the CPU and RAM :

  • Polled I/O : Think of it as giving a person a task. You can poke the person every 5 minutes to see if they have completed the task. If the person has finished the task, they have to respond with a 'yes', otherwise 'no'. In this scenario, one party has to poke every 5 seconds and another party has to respond every 5 seconds as they get poked. Obviously, not the most efficient solution.
  • Interrupt I/O : In this situation, when giving the person the task, we simply ask them to tell us when they have completed it. Much more efficient than polling.

If we have a small amount of data being processed by the I/O device, then our system runs relatively efficient. If the data is a large in size, the communication involved between the I/O device, the CPU and the RAM is frequent. There are two solutions to this problem:

  • Memory Mapped I/O : devices are associated with logical primary memory addresses rather than having a specialized device address table.
  • Direct Memory Access (DMA) is when devices and their controllers are able to read and write information directly from the primary memory without any CPU getting in the way. The DMA contains the same algorithms the CPU uses, except conceptually it does not need any registers. This direct access can increase the speed significantly.
  • Intelligent Devices : a final alternative is to have memory and chip sets that act in a way the CPU does on the I/O device itself. This means we do not need to talk to the CPU or RAM at all (it can all be done on the device itself).

SIMD Processors

Single Instruction Multiple Data (SIMD) processors handle one instruction accompanied with an entire array of data. This type of architecture is quite efficient at processing large amounts of data that all need to be operated on using the same instruction. Processing speeds are increased significantly because the processor does not have to decode every single introduction (since all pieces of data are only using one single instruction). Multimedia devices like MP3 players and gaming consoles all take advantage of this technology, these machines are also known as vector machines.

The Sony PS3 has one MIMD (multiple instruction, multiple data) processor and several SIMD processors. Compilers need to be adjusted and game programmers need to adapt their writing skills in favor of this new technology. Someone eventually has to slave over a library of machine language instructions in order to optimize the particular application to SIMD technology (not the most exciting task).


Boot Process

How is the OS started? How does the computer go from powering on to running the Operating Systems (Windows, OSX ...). This process takes a really long time when you consider how fast computers have become, why is this? This occurs because the system must determine all of the hardware present in the machine, each hardware's manufacturer, specifications, requirements etc. It can become a real mess! Before the OS can even begin to start, all of the hardware in the system must be properly configured. This all begins with the BIOS.


BIOS

BIOS (basic input/output system) is the code that runs before any other code runs. It is built into the system. In older times, the BIOS was burned into the machine and never changed. Nowadays, it can be flashed and reprogrammed. This can occur when bugs in the BIOS are so troublesome, that an update is needed (typically when the OS does not work well with the hardware). The BIOS configures all the hardware by running diagnostics etc. After hardware is configured, how is the OS launched? This process is not direct, mainly because the BIOS has limited capabilities because of legacy issues. The BIOS is configured in a way that it believes your computer's hardware is reflective of the year 1982. This means it believes you only have several megabytes of RAM, an 8086 processor, limited disk space and so on. It does however, know how to load information of up to 512 bytes. This is where bootstrapping comes into play.


Bootstrapping

After hardware is configured, the BIOS runs a small amount of code (approx 400 bytes, while the remaining amount is used to store partition tables) known as the boot block. The BIOS loads this algorithm into memory where it is executed. The boot block is just enough code to load another algorithm that makes the machine reflect more of what it really is (not an 8086). From here we can initialize RAM, hard disk and eventually start the OS.

Macintosh computers, run a next generation, more sophisticated BIOS called EFI. The EFI allows the computer to go from BIOS to OS startup more easily and direct. This is because the EFI recognizes the hardware of the computer to what it truly is. Older macs use open boot, which is a BIOS that allows the user to enter commands via a prompt.