Links to other weeks notes [ 2 ][ 3 ][ 4 ][ 5 ][ 6 ][ 7 ] [ 8 ][ 10 ][ 11 ][ 12 ]( 13 )[ 14 ]

Week 13, 6/9 October, 1997

Chapter 5, The Processor: Datapath and Control

The five classic components of a computer

The major components of a computer consist of control, datapath, memory, input, and output. The control and datapath are collectively referred as central processing unit (CPU), or simply processor. The processor is built on a piece of silicon chip with the integrated circuit technology. Typically, the size of the chip is only 1 cm^2 containing millions of transistors.

A simple implementation scheme

We'll see the necessary components to built a processor: how data moves (datapath) and how the operation of the processor is controlled (control logic). From this, we want to have an rough understanding on how the instructions are executed by the processor. We consider an extremely simple computer that can perform a subset of the MIPS instructions:

Memory references: lw and sw.
Arithmetic/logical computations: add, sub, and, or, slt.
Branch and jump: beq, j.

The diagram (Figure 5.1, page 272) is an overview of the major functional units and the major connections between them. The functional units are program counter register (PC), register file, arithmetic/logic unit (ALU), instruction and data memories. Note that memory is outside the CPU chip. PC, ALU, and register file are on the chip.

Executing one of the nine instructions requires the following steps.

Fetch an instruction: The value of PC (address of an instruction) is sent to instruction memory, memory responds with the requested instruction sending to CPU.
Read registers: The contents of two registers (or one in case of lw) are read. Which registers to read is encoded in the instruction by the rs and rt fields. This information is passed to register file. The register file sends out the contents of the read registers.
ALU operation: ALU uses the data from register file or immediate value field of the instruction to compute a result. For arithmetic/logic instructions (add, sub, and, or, slt), it simply uses the values of two registers and performs the corresponding operations. For beq, it uses ALU for comparison. For lw, and sw, it uses ALU to compute the address [register value + immediate value (offset)].
Depending on the instruction type: For lw, access the memory, then write the new value to a register. For sw, store value to data memory. for arithmetic/logic type, write the new value to a register. For beq, and j, compute and update the value of PC.

Clock Cycle

The computer operates in discrete steps. The timing is controlled by an interval clock run typically at 100 MHz, or equivalently 10 nanosecond (ns). In the following discussion, we assume that one instruction takes one period or one cycle in time. So that all instructions take the same time.

Building a Datapath

We'll look at the steps of the instruction execution in some detail.

(1) Instruction fetch

Instruction fetch involves the memory containing instructions, program counter register (PC), and a special adder to increment PC by 4 for the next address. See Figures 5.5 and 5.6 (page 277). All instruction execution begins with an instruction fetch with the following steps: PC sends address to memory and adder; memory reads the address content and sends out the instruction at the corresponding address; at the same time, adder computes PC+4; the result is written back to PC.

(2) R-format instructions (add, sub, and, or, slt)

The functional units consist of one register file and one ALU. See Figures 5.7 and 5.8 (page 278, 279). A register file is a collection of 32 registers. Register file contains two output data ports and one input data port. Three additional 5 bit control lines specify which two registers to read and which register to write. An arithmetic/logic operation always involves reading two registers and writing the result back to a register. ALU operation control line controls the type of operation.

(3) Load/Store datapath

Load or store needs two additional units, data memory and a device for sign extension. See Figure 5.9 and 5.10 (page 280). For a load operation, e.g., lw $1, 100($2), the content of the base address register ($2) is read from register file. Which register to read is specified in the instruction. The offset 100 in the immediate value field of the instruction is sign extended from 16 bits to 32 bits and is fed into ALU to compute the address (100+$2). This address is sent to data memory. The memory sends the content back to register file. The register ($1) is written with the new value.

For sw $1, 100($2), the contents in both register $1 and $2 are read. The address 100+$2 is computed by ALU. The resulting address and the content of $1 are sent to data memory to be written.

(4) Branch on equal datapath

The datapath is shown in Figure 5.11 (page 282). To execute the conditional branch, beq $1, $2, offset, two registers are read. The outputs are sent to ALU. The result (true or false) is sent to branch control logic. At the same time, the new branch target address is computed by first sign extending the offset and an adder is used to compute the new instruction address. PC is updated with the branch control logic if branch condition is true.

Final combined datapath is shown in Figure 5.14 (page 286). To reduce the cost of the datapath, common units are combined with the help of multiplexor.

Control

The multiplexors and other parts of the units need to be controlled by the main control logic. Which pass way to take in the datapath depends on the instruction. The control of ALU is relatively simple. The ALU control logic takes bit 0 to 5 from instruction and a 2-bit signal from main control and generates one of the five possible operations (and, or, add, sub, slt). Note that ALU is also used for lw and sw and beq. The 2-bit ALUop signal is used to decide which class of instructions the current computation is for. The main control is the master of the CPU. It takes the bit 31 to 26 from the instruction, based on this value, generates appropriate signals to another parts of the units (multiplexor, ALU control, memory, etc). The 6 bits (bit 32 to 26) encodes the type of operation specified by the instruction. The main control is also referred as instruction decoder.

Chapter 6, Enhancing Performance with Pipelining

We'll mention here only very briefly the concept of pipelining and superscalar designs of CPU. You'll learn more about them in an advanced course.

pipelining

A car manufacture assembly line is a good example of pipelining. Assemble a car takes some steps: prepare chassis, add engine, install seat, add wheels, final touch, each of which takes one hour, say. So the whole process takes 6 hours. If one car is assemble at a time, we can only have one car every 6 hours. However, if six groups of people work on six different cars in the different stage of the assembly, one car is then produced in one hour. This improves the productivity by a factor of 6.

Similarly, the execution of instructions can be divided naturally into steps. Look at the table

                  Time needed for each steps in nanosecond 
                                                1 ns = 10^{-9} second

Instruction type  instruction   register  ALU     Data      Register    Total
                     fetch        read operation store/load  write    

load word (lw)        10           5      10       10         5          40
store word (sw)       10           5      10       10                    35
R-format (add, sub,..)10           5      10                  5          30 
Branch (beq)          10           5      10                             25

For pipelined architecture to work, we need each step to take the same time. Since the longest is 10 ns, we can arrange for all instructions to take five steps (fetch, reg read, ALU, data fetch, reg write), each of 10 ns. Not all instructions need five steps, for example, R-format instruction need not store/load data. But pipeline requires the task to be uniform, which means for R-format instruction, during data fetch, nothing is being done.

For instructions execution one at a time, the sequence of instructions

  lw $1, 100($0)
  lw $2, 200($0)
  lw $3, 300($0)
  lw $4, 400($0)

takes 40*4 = 160 ns. For a pipelined execution, we can issue one fetch every 10 ns. As a result, the same sequence takes only 80 ns.

Pipelining Hazards

Pipelining may cause problem when data are dependent or when branch instruction is executed. For example,

   lw $1, 100($2)
   add $3, $1, $5

The second instruction needs the value $1. However, the the new value of $1 is not available yet, since the lw instruction has not finished yet and the register has not written. One solution is to add a nop instruction (no operation).

   lw $1, 100($2)
   nop
   add $3, $1, $5

The effect of lw instruction is seen only after two cycles. Similarly, the branch can not decide until sometime later. On MIPS R2000, or R3000, for reason of efficiency, the instruction after the branch is always executed. The effect of branch takes place only after two cycles. However, this complexity and the problem in lw is taking care of by the assembler. So when we write

   add $4, $7, $9
   beq $1, $3, 28
   and $12, $2, $5

The assembler insert nop,

   add $4, $7, $9
   beq $1, $3, 28
   nop
   and $12, $2, $5

The instruction after beq is called a delayed slot. The branch is called delayed branch. Actually, the assembler tries to insert useful instruction in the delayed slot. More likely the assembler will re-arrange the code as

   beq $1, $3, 28
   add $4, $7, $9
   and $12, $2, $5

Superscalar architecture

In pipelined architecture, CPU issues one instruction every cycle. In superscalar architecture, two or more instructions are issued per cycle. Multiple ALU and FPU units are built so that computation can be performed in parallel. Again data dependency may slow down the operation.