pipe1new

•
UERJ

Bruno Ochotorena
12/08/2014
Esta é uma pré-visualização de arquivo. Entre para ver o arquivo original
Curso de Tecnologia de Computadores
Professora: Roseli S. Wedemann
	 roseli@ime.uerj.br
	 salas D6019
Text: Computer Organization & Design, the Hardware
Software Interface, D. A. Patterson and J. L. Hennessy,
4th edition, Morgan Kaufman Publishers
Dave Patterson (http.cs.berkeley.edu/~patterson)
lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/
cs 152 L1 3 .1
DAP Fa97, Ó U.CB
Sobre o Curso
2
 
cs 152 L1 3 .2
DAP Fa97, Ó U.CB
Critério de Aprovação:
duas provas + prova final
para passar sem prova final, média >= 7
para passar com prova final, média >= 5
direito à 2a. chamada
para ir para a prova final, média >= 4
Nota = [ (P1 + P2) / 2 + PF ] / 2
Conteúdo
Capítulo 6: Enhancing Performance with Pipelining
Capítulo 9: Multiprocessors
Recap: Microprogramming
Specialize state-diagrams easily captured by microsequencer
simple increment & “branch” fields
datapath control fields
Control design reduces to Microprogramming 
Microprogramming is a fundamental concept
implement an instruction set by building a very simple processor and interpreting the instructions
essential for very complex instructions and when few register transfers are possible
overkill when ISA matches datapath 1-1
3
cs 152 L1 3 .3
DAP Fa97, Ó U.CB
 
Summary: Microprogramming one inspiration for RISC
4
cs 152 L1 3 .4
DAP Fa97, Ó U.CB
If simple instruction could execute at very high clock rate…
If you could even write compilers to produce microinstructions…
If most programs use simple instructions and addressing modes…
If microcode is kept in RAM instead of ROM so as to fix bugs …
If same memory used for control memory could be used instead as cache for “macroinstructions”…
Then why not skip instruction interpretation by a microprogram and simply compile directly into lowest language of machine? (microprogramming is overkill when ISA matches datapath 1-1)
Recap: Exceptions and Interrupts
5
cs 152 L1 3 .5
DAP Fa97, Ó U.CB
Exceptions are the hard part of control
Need to find convenient place to detect exceptions and to branch to state or microinstruction that saves PC and invokes the operating system
As we get pipelined, CPUs that support page faults on memory accesses, which means that the instruction cannot complete AND you must be able to restart the program at exactly the instruction with the exception, it gets even harder
Definitions of control variables
6
cs 152 L1 3 .6
DAP Fa97, Ó U.CB
EPC: Exception Program Counter (stores return address)
exp_addr: address of exception handler routine
cause: reg used for storing the cause of exception
rd: reg destination operand in instruction
rt: second reg source operand in instruction
SX: constant specified in instruction
ZX: zero extended immediate
Ori: S <= reg or ZX, S is a reg (see pg A-57)
 (see chapter 3 and section 5.6)
Modification to the Control Specification
7
IR <= MEM[PC]
PC <= PC + 4
R-type
S <= A fun B
R[rd] <= S
S <= A op ZX
R[rt] <= S
ORi
S <= A + SX
R[rt] <= M
M <= MEM[S]
LW
S <= A + SX
MEM[S] <= B
SW
“instruction fetch”
“decode”
Execute
Memory
Write-back
0000
0001
0100
0101
0110
0111
1000
1001
1010
1011
1100
BEQ
0010
If A = B
then PC <= S
S<= PC +SX
EPC <= PC - 4
PC <= exp_addr
cause <= 10 (RI)
other
undefined instruction
S <= A - B
overflow
EPC <= PC - 4
PC <= exp_addr
cause <= 12 (Ovf)
The Big Picture: Where are We Now? 
The Five Classic Components of a Computer
Today’s Topics: 
Pipelining by Analogy
8
Control
Datapath
Memory
Processor
Input
Output
6.1 An overview of Pipelining. It’s Natural!
cs 152 L1 3 .9
DAP Fa97, Ó U.CB
9
Example: Stages of Laundry
Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold
Washer takes 30 minutes
Dryer takes 30 minutes
“Folder” takes 30 minutes 
“Storer” takes 30 minutes to put clothes into drawers
A
B
C
D
Sequential Laundry
cs 152 L1 3 .10
DAP Fa97, Ó U.CB
10
Sequential laundry takes 8 hours for 4 loads
If they learned pipelining, how long would laundry take? 
30
T
a
s
k
O
r
d
e
r
Time
30
30
30
30
30
30
30
30
30
30
30
30
30
30
30
6 PM
7
8
9
10
11
12
1
2 AM
Pipelined Laundry: Start work ASAP
cs 152 L1 3 .11
DAP Fa97, Ó U.CB
11
Pipelined laundry takes 3.5 hours for 4 loads! 
T
a
s
k
O
r
d
e
r
12
2 AM
6 PM
7
8
9
10
11
1
Time
Pipelining Lessons
cs 152 L1 3 .12
DAP Fa97, Ó U.CB
12
Pipelining doesn’t help latency of single task, it helps throughput of entire workload
Multiple tasks operating simultaneously using different resources
Potential speedup = Number pipe stages
Pipeline rate limited by slowest pipeline stage
Stages should take about same amount of time
Unbalanced lengths of pipe stages reduces speedup
Time to “fill” pipeline and time to “drain” it reduces speedup
Stall for Dependences
6 PM
7
8
9
Time
T
a
s
k
O
r
d
e
r
The Five Stages of MIPS Instructions, ex. Load
13
cs 152 L1 3 .13
DAP Fa97, Ó U.CB
Ifetch: Instruction Fetch
Fetch the instruction from the Instruction Memory
Reg/Dec: Registers Fetch and Instruction Decode
Exec: Calculate the memory address
Mem: Read the data from the Data Memory
Wr: Write the data back to the register file
				 base addressing
Ex: lw $1, 100 ($0) $1 = Memory[$0 + 100]
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Load
We will limit our attention to 8 instrs
14
cs 152 L1 3 .14
DAP Fa97, Ó U.CB
lw: load word (8ns)	 instruction fetch: 2ns
sw: store word (7ns)	 register read or write: 1ns
add: add (6ns)		 ALU operation: 2ns
sub: subtract (6ns)	 memory acces: 2ns
and: and (6ns)			 
or: or	(6ns)		 
slt: set-less-than (compares 2 regs and sets a third to 1, if the first is less than the second, pg. 128) (6ns)
beq: branch-if-equal (branch to label if 2 regs are equal, pg 123) (5ns) 
 Ex: beq $3, $4, L if $3 = $4 go to L
 see Chapter 3 for descriptions of instructions
Pipelining
15
cs 152 L1 3 .15
DAP Fa97, Ó U.CB
Improve perfomance by increasing instruction throughput 
 figure 6.3 Ideal speedup is number of stages in the pipeline. Do we achieve this? 
Performance Metrics
16
cs 152 L1 3 .16
DAP Fa97, Ó U.CB
In the single-cycle design model (every instruction takes 1 clock cycle), the clock cycle must be stretched to accommodate the slowest instruction (Load is slowest and takes 8ns in MIPS).
In the pipelined design, the pipeline stages take a single clock cycle. The clock cycle must be long enough to accommodate the slowest operation (2ns).
Assume write to reg file occurs in first half of clock cycle and read from reg file occurs in second half of cycle.
Speedup = execution time without pipeline / execution time with pipeline = ideally to number of stages. 
(definition in pg. 101)
Performance Improvement
17
cs 152 L1 3 .17
DAP Fa97, Ó U.CB
pipe with 5 balanced stages: ideal speedup = 5, 8ns / 5 = 1.6ns clock cycle
stages may be imperfectly balanced and pipeline involves some overhead, so time per instruction (CPI) will exceed minimum possible and speedup will be less than number of stages.
In ex., for 3 instrs: speedup = 24ns / 14ns = 1.7
If we add 1000 lw instrs (1 instr per 2ns): speedup = 8024ns / 2014ns = 3.98 (almost 4)
Pipelining increases instruction throughput, as opposed to decreasing the execution time of an individual instruction. 
Instruction throughput is the important metric!
Basic Idea
18
cs 152 L1 3 .18
DAP Fa97, Ó U.CB
 			fig 6.10 (pg. 450)
What do we need to add to actually split the datapath into stages?
Instruction sets for piplines
19
cs 152 L1 3 .19
DAP Fa97, Ó U.CB
1. MIPS instrs are all the same length. Easier to fetch instr in 1st stage and
decode in 2nd stage.
2. MIPS has only a few instr formats, with the source reg fields in the same place. 2nd stage can start reading the reg file while HW is determining what type of instr was fetched. Reg read + decode in one stage.
3. Memory operands only appear in loads and stores in MIPS. Use the execute stage to calculate mem address and access mem in the next stage. Otherwise stages 3 and 4 would break into 3 stages: address, access, execute.
4. Operands must be aligned in memory (chapt 3, pg. 112). 1 data transfer takes 1 data mem access (1 pipeline stage).
Graphically Representing Pipelines
20
cs 152 L1 3 .20
DAP Fa97, Ó U.CB
 
Can help with answering questions like:
how many cycles does it take to execute this code?
what is the ALU doing during cycle 4?
use this representation to help understand datapaths
Conventional Pipelined Execution Representation
21
Program Flow
Time
cs 152 L1 3 .21
DAP Fa97, Ó U.CB
Single Cycle, Multiple Cycle, vs. Pipeline
22
cs 152 L1 3 .22
DAP Fa97, Ó U.CB
Clk
Cycle 1
Multiple Cycle Implementation:
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Cycle 10
Load
Load
Store
Pipeline Implementation:
Store
Clk
Single Cycle Implementation:
Load
Store
Waste
R-type
R-type
Cycle 1
Cycle 2
Why Pipeline?
23
cs 152 L1 3 .23
DAP Fa97, Ó U.CB
Suppose we execute 100 instructions
Single Cycle Machine
45 ns/cycle x 1 CPI x 100 inst = 4500 ns
Multicycle Machine
10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns
Ideal pipelined machine
10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns
speedup = 4500ns / 1040ns = 4.33
Why Pipeline? Because the resources are there!
24
cs 152 L1 3 .24
DAP Fa97, Ó U.CB
I
n
s
t
r.
O
r
d
e
r
Time (clock cycles)
Inst 0
Inst 1
Inst 2
Inst 4
Inst 3
Can pipelining get us into trouble?
25
cs 152 L1 3 .25
DAP Fa97, Ó U.CB
Yes: Pipeline Hazards (next instr can’t execute in next cycle)
structural hazards: attempt to use the same resource two different ways at the same time
E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV)
data hazards: attempt to use item before it is ready
E.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer
instruction depends on result of prior instruction still in the pipeline
control hazards: attempt to make a decision before condition is evaluated
E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in
branch instructions
Can always resolve hazards by waiting
 pipeline control must detect the hazard
 take action (or delay action) to resolve hazards
Single Memory is a Structural Hazard
26
cs 152 L1 3 .26
DAP Fa97, Ó U.CB
Mem
I
n
s
t
r.
O
r
d
e
r
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4
Reg
Mem
Reg
Reg
Mem
Reg
Detection is easy in this case! (right half highlight means read, left half write)
Structural Hazards limit performance
27
cs 152 L1 3 .27
DAP Fa97, Ó U.CB
Example: if 1.3 memory accesses per instruction and only one memory access per cycle, then 30% of instrs are loads
average CPI = 0.7*1 + 0.3*2 = 1.3 cycles / instr (30% of the time stall 1 cycle)
otherwise resource is more than 100% utilized
Control Hazard Solutions
28
cs 152 L1 3 .28
DAP Fa97, Ó U.CB
Stall (Bubble): wait until decision is clear
It’s possible to move up decision to 2nd stage by adding hardware to check registers as being read, calculate branch address and update PC
Impact: 2 clock cycles per branch instruction => slow
I
n
s
t
r.
O
r
d
e
r
Time (clock cycles)
Add
Beq
Load
Reg
Mem
Reg
Reg
Mem
Reg
Reg
Mem
Reg
Control Hazards Limit Performance
29
cs 152 L1 3 .29
DAP Fa97, Ó U.CB
Estimate average CPI for stalling on branches. Suppose 17% of the instrs executed are branches (gcc execution benchmark).
average CPI = 0.83*1 + 0.17*2 = 1.17 cycles / instr
Control Hazard Solutions
30
cs 152 L1 3 .30
DAP Fa97, Ó U.CB
Predict: guess one direction then back up if wrong
Predict branch not taken is a common approach
Impact: 1 clock cycle per branch instruction if right, 2 if wrong (right  50% of time)
I
n
s
t
r.
O
r
d
e
r
Time (clock cycles)
Add
Beq
Load
Reg
Mem
Reg
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Dynamic Scheme
31
cs 152 L1 3 .31
DAP Fa97, Ó U.CB
Dynamic HW predictors make guesses depending on the behavior of each branch and may change predictions for a branch over the life of a program. 
Ex: Keep history of each branch as taken or not taken and use the past to predict the future: Such HW has  90% accuracy.
When the guess is wrong, the pipeline control must ensure that the instrs following the wrongly guessed branch have no effect, and must restart the pipe from the proper branch address.
Longer pipes raise cost of mispredictions as is usual with all solutions to control hazards. 
Control Hazard Solutions
32
cs 152 L1 3 .32
DAP Fa97, Ó U.CB
Redefine branch behavior (takes place after next instruction) “delayed branch”. Place a “safe”instr after the branch instr. The compiler rearranges instrs when possible (doesn’t change program results). 
Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time)
Longer pipes, more slots, harder to fill.
I
n
s
t
r.
O
r
d
e
r
Time (clock cycles)
Add
Beq
Misc
Reg
Mem
Reg
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Load
Mem
Reg
Mem
Reg
Data Hazard on r1
33
cs 152 L1 3 .33
DAP Fa97, Ó U.CB
An instr depends on the results of a previous instr still in the pipe. Too many of these dependencies to be solved by a compiler successfuly.
add r1 ,r2,r3
sub r4, r1 ,r3
and r6, r1 ,r7
or r8, r1 ,r9
xor r10, r1 ,r11
Data Hazard on r1:
34
cs 152 L1 3 .34
DAP Fa97, Ó U.CB
 Dependencies backwards in time are hazards
 
I
n
s
t
r.
O
r
d
e
r
Time (clock cycles)
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
IF
ID/RF
EX
MEM
WB
ALU
Im
Reg
Dm
Reg
Reg
Dm
Reg
Reg
Dm
Reg
Im
ALU
Reg
Dm
Reg
Data Hazard Solution:
35
cs 152 L1 3 .35
DAP Fa97, Ó U.CB
 “Forward” result from one stage to another
 
 “or” OK if define read/write properly
I
n
s
t
r.
O
r
d
e
r
Time (clock cycles)
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
IF
ID/RF
EX
MEM
WB
ALU
Im
Reg
Dm
Reg
Reg
Dm
Reg
Reg
Dm
Reg
Im
ALU
Reg
Dm
Reg
Forwarding (or Bypassing): What about Loads
36
cs 152 L1 3 .36
DAP Fa97, Ó U.CB
 Dependencies backwards in time are hazards
 (shading indicates when operand is used by the instr)
 
 Can’t solve with forwarding: 
 Must delay/stall instruction dependent on loads 
Study example of reordering of code to avoid stalls caused
by data hazards (pg 447).
Time (clock cycles)
lw r1,0(r2)
sub r4,r1,r3
IF
ID/RF
EX
MEM
WB
ALU
Im
Reg
Dm
Reg
Reg
Dm
Reg
6.2 Designing a Pipelined Processor
37
cs 152 L1 3 .37
DAP Fa97, Ó U.CB
Go back and examine your datapath and control diagram
associated resources with states
ensure that flows do not conflict, or figure out how to resolve
assert control in appropriate stage
To retain the value of an individual instr for its other four stages, the value output by one stage must be saved in a regs. We must place regs where there are dividing lines between the stages. These are called pipeline regs.
Pipelined Datapath (as in book); hard to read
38
cs 152 L1 3 .38
DAP Fa97, Ó U.CB
 
IF	 ID	 EX	MEM	 WB
	 
Fig 6.12
Data flowing from right to left does not affect the current instr, only later instrs in the pipe
Pipelined
Processor (almost) for slides
39
cs 152 L1 3 .39
DAP Fa97, Ó U.CB
What happens if we start a new instruction every cycle?
Exec
Reg. 
File
Mem
Access
Data
Mem
Reg
File
Equal
PC
Next PC
IR
Inst. Mem
Valid
IRex
Dcd Ctrl
IRmem
Ex Ctrl
IRwb
Mem Ctrl
WB Ctrl
Control and Datapath
40
cs 152 L1 3 .40
DAP Fa97, Ó U.CB
IR <- Mem[PC]; PC <– PC+4;
A <- R[rs]; B<– R[rt]
S <– A + B;
R[rd] <– S;
S <– A + SX;
M <– Mem[S]
R[rd] <– M;
S <– A or ZX;
R[rt] <– S;
S <– A + SX;
Mem[S] <- B
If Cond
PC < PC+SX;
Exec
Reg. 
File
Mem
Access
Data
Mem
Reg
File
Equal
PC
Next PC
IR
Inst. Mem
Pipelining the Load Instruction
41
cs 152 L1 3.41
DAP Fa97, Ó U.CB
The five independent functional units in the pipeline datapath are:
Instruction Memory for the Ifetch stage
Register File’s Read ports (bus A and busB) for the Reg/Dec stage
ALU for the Exec stage
Data Memory for the Mem stage
Register File’s Write port (bus W) for the Wr stage
Clock
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
2nd lw
3rd lw
The Four Stages of R-type
42
cs 152 L1 3 .42
DAP Fa97, Ó U.CB
Ifetch: Instruction Fetch
Fetch the instruction from the Instruction Memory
Reg/Dec: Registers Fetch and Instruction Decode
Exec: 
ALU operates on the two register operands
Update PC
Wr: Write the ALU output back to the register file
Cycle 1
Cycle 2
Cycle 3
Cycle 4
R-type
Pipelining the R-type and Load Instruction
43
cs 152 L1 3 .43
DAP Fa97, Ó U.CB
We have pipeline conflict or structural hazard:
Two instructions try to write to the register file at the same time!
Only one write port
Clock
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
R-type
R-type
Load
R-type
R-type
Ops! We have a problem!
Important Observation
44
cs 152 L1 3 .44
DAP Fa97, Ó U.CB
Each functional unit can only be used once per instruction
Each functional unit must be used at the same stage for all instructions:
Load uses Register File’s Write Port during its 5th stage
R-type uses Register File’s Write Port during its 4th stage
2 ways to solve this pipeline hazard.
Solution 1: Insert “Bubble” into the Pipeline
45
cs 152 L1 3 .45
DAP Fa97, Ó U.CB
Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle
The control logic can be complex.
Lose instruction fetch and issue opportunity.
No instruction is started in Cycle 6!
Clock
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
R-type
Load
R-type
R-type
Pipeline
Bubble
Solution 2: Delay R-type’s Write by One Cycle
46
cs 152 L1 3 .46
DAP Fa97, Ó U.CB
Delay R-type’s register write by one cycle:
Now R-type instructions also use Reg File’s write port at Stage 5
Mem stage is a NOOP stage: nothing is being done.
Clock
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
R-type
R-type
Load
R-type
R-type
R-type
Mem
1
2
3
4
5
Modified Control & Datapath
47
cs 152 L1 3 .47
DAP Fa97, Ó U.CB
IR <- Mem[PC]; PC <– PC+4;
A <- R[rs]; B<– R[rt]
S <– A + B;
R[rd] <– M;
S <– A + SX;
M <– Mem[S]
R[rd] <– M;
S <– A or ZX;
R[rt] <– M;
S <– A + SX;
Mem[S] <- B
if Cond PC < PC+SX;
M <– S
Exec
Reg. 
File
Mem
Access
Data
Mem
S
Reg
File
Equal
PC
Next PC
IR
Inst. Mem
M <– S
The Four Stages of Store
48
cs 152 L1 3 .48
DAP Fa97, Ó U.CB
Ifetch: Instruction Fetch
Fetch the instruction from the Instruction Memory
Reg/Dec: Registers Fetch and Instruction Decode
Exec: Calculate the memory address
Mem: Write the data into the Data Memory
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Store
Wr
The Three Stages of Beq
49
cs 152 L1 3 .49
DAP Fa97, Ó U.CB
Ifetch: Instruction Fetch
Fetch the instruction from the Instruction Memory
Reg/Dec: 
Registers Fetch and Instruction Decode
Exec: 
compares the two register operand, 
select correct branch target address
latch into PC
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Beq
Wr
Control Diagram
50
IR <- Mem[PC]; PC < PC+4;
A <- R[rs]; B<– R[rt]
S <– A + B;
R[rd] <– S;
S <– A + SX;
M <– Mem[S]
R[rd] <– M;
S <– A or ZX;
R[rt] <– S;
S <– A + SX;
Mem[S] <- B
If Cond PC < PC+SX;
Exec
Reg. 
File
Mem
Access
Data
Mem
Reg
File
Equal
PC
Next PC
IR
Inst. Mem
M <– S
M <– S
cs 152 L1 3 .50
DAP Fa97, Ó U.CB
Datapath + Data Stationary Control
51
cs 152 L1 3 .51
DAP Fa97, Ó U.CB
Exec
Reg. 
File
Mem
Access
Data
Mem
Reg
File
Next PC
IR
Inst. Mem
Decode
Mem
Ctrl
WB 
Ctrl
rs
rt
op
rs
rt
fun
im
ex
me
wb
rw
v
me
wb
rw
v
wb
rw
v
Let’s Try it Out
52
cs 152 L1 3 .52
DAP Fa97, Ó U.CB
10	lw 	r1, r2(35)
14	addI 	r2, r2, 3
20	sub	r3, r4, r5
24	beq	r6, r7, 100
30	ori	r8, r9, 17
34	add	r10, r11, r12
100	and	r13, r14, 15
these addresses are octal
Start: Fetch 10
53
cs 152 L1 3 .53
DAP Fa97, Ó U.CB
Exec
Reg. 
File
Mem
Access
Data
Mem
Reg
File
PC
Next PC
IR
Inst. Mem
Decode
Mem
Ctrl
WB 
Ctrl
rs
rt
im
10	lw 	r1, r2(35)
14	addI 	r2, r2, 3
20	sub	r3, r4, r5
24	beq	r6, r7, 100
30	ori	r8, r9, 17
34	add	r10, r11, r12
100	and	r13, r14, 15
10
Fetch 14, Decode 10
54
cs 152 L1 3 .54
DAP Fa97, Ó U.CB
Exec
Reg. 
File
Mem
Access
Data
Mem
Reg
File
PC
Next PC
IR
Inst. Mem
Decode
Mem
Ctrl
WB 
Ctrl
2
rt
im
10	lw 	r1, r2(35)
14	addI 	r2, r2, 3
20	sub	r3, r4, r5
24	beq	r6, r7, 100
30	ori	r8, r9, 17
34	add	r10, r11, r12
100	and	r13, r14, 15
14
lw r1, r2(35)
Fetch 20, Decode 14, Exec 10
55
cs 152 L1 3 .55
DAP Fa97, Ó U.CB
Exec
Reg. 
File
Mem
Access
Data
Mem
r2
Reg
File
PC
Next PC
IR
Inst. Mem
Decode
Mem
Ctrl
WB 
Ctrl
2
rt
35
10	lw 	r1, r2(35)
14	addI 	r2, r2, 3
20	sub	r3, r4, r5
24	beq	r6, r7, 100
30	ori	r8, r9, 17
34	add	r10, r11, r12
100	and	r13, r14, 15
20
lw r1
addI r2, r2, 3
Fetch 24, Decode 20, Exec 14, Mem 10
56
cs 152 L1 3 .56
DAP Fa97, Ó U.CB
Exec
Reg. 
File
Mem
Access
Data
Mem
r2
r2+35
Reg
File
PC
Next PC
IR
Inst. Mem
Decode
Mem
Ctrl
WB 
Ctrl
4
5
3
10	lw 	r1, r2(35)
14	addI 	r2, r2, 3
20	sub	r3, r4, r5
24	beq	r6, r7, 100
30	ori	r8, r9, 17
34	add	r10, r11, r12
100	and	r13, r14, 15
24
lw r1
sub r3, r4, r5
addI r2, r2, 3
Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10
57
cs 152 L1 3 .57
DAP Fa97, Ó U.CB
Exec
Reg. 
File
Mem
Access
Data
Mem
r4
r2+3
Reg
File
PC
Next PC
IR
Inst. Mem
Decode
Mem
Ctrl
WB 
Ctrl
M[r2+35]
6
7
10	lw 	r1, r2(35)
14	addI 	r2, r2, 3
20	sub	r3, r4, r5
24	beq	r6, r7, 100
30	ori	r8, r9, 17
34	add	r10, r11, r12
100	and	r13, r14, 15
30
lw r1
beq r6, r7 100
addI r2
sub r3
Fetch 34, Dcd 30, Ex 24, Mem 20, WB 14
58
cs 152 L1 3 58.
DAP Fa97, Ó U.CB
Exec
Reg. 
File
Mem
Access
Data
Mem
r6
r2+3
Reg
File
PC
Next PC
IR
Inst. Mem
Decode
Mem
Ctrl
WB 
Ctrl
r1=M[r2+35]
9
xx
10	lw 	r1, r2(35)
14	addI 	r2, r2, 3
20	sub	r3, r4, r5
24	beq	r6, r7, 100
30	ori	r8, r9, 17
34	add	r10, r11, r12
100	and	r13, r14, 15
34
beq
addI r2
sub r3
r4-r5
100
ori r8, r9 17
Fetch 100, Dcd 34, Ex 30, Mem 24, WB 20
59
cs 152 L1 3 .59
DAP Fa97, Ó U.CB
Exec
Reg. 
File
Mem
Access
Data
Mem
r9
Reg
File
PC
Next PC
IR
Inst. Mem
Decode
Mem
Ctrl
WB 
Ctrl
r1=M[r2+35]
11
12
10	lw 	r1, r2(35)
14	addI 	r2, r2, 3
20	sub	r3, r4, r5
24	beq	r6, r7, 100
30	ori	r8, r9, 17
34	add	r10, r11, r12
100	and	r13, r14, 15
100
beq
r2 = r2+3
sub r3
r4-r5
17
ori r8
xxx
add r10, r11, r12
ooops, we should have only one delayed instruction
Fetch 104, Dcd 100, Ex 34, Mem 30, WB 24
60
cs 152 L1 3 .60
DAP Fa97, Ó U.CB
Exec
Reg. 
File
Mem
Access
Data
Mem
r11
Reg
File
PC
Next PC
IR
Inst. Mem
Decode
Mem
Ctrl
WB 
Ctrl
r1=M[r2+35]
14
15
10	lw 	r1, r2(35)
14	addI 	r2, r2, 3
20	sub	r3, r4, r5
24	beq	r6, r7, 100
30	ori	r8, r9, 17
34	add	r10, r11, r12
100	and	r13, r14, 15
104
beq
r2 = r2+3
r3 = r4-r5
xx
ori r8
xxx
add r10
and r13, r14, r15
n
Squash the extra instruction in the branch shadow!
r9 | 17
Fetch 108, Dcd 104, Ex 100, Mem 34, WB 30
61
cs 152 L1 3 .61
DAP Fa97, Ó U.CB
Exec
Reg. 
File
Mem
Access
Data
Mem
r14
Reg
File
PC
Next PC
IR
Inst. Mem
Decode
Mem
Ctrl
WB 
Ctrl
r1=M[r2+35]
10	lw 	r1, r2(35)
14	addI 	r2, r2, 3
20	sub	r3, r4, r5
24	beq	r6, r7, 100
30	ori	r8, r9, 17
34	add	r10, r11, r12
100	and	r13, r14, 15
110
r2 = r2+3
r3 = r4-r5
xx
ori r8
add r10
and r13
n
Squash the extra instruction in the branch shadow!
r9 | 17
r11+r12
Fetch 114, Dcd 110, Ex 104, Mem 100, WB 34
62
cs 152 L1 3 .62
DAP Fa97, Ó U.CB
Exec
Reg. 
File
Mem
Access
Data
Mem
Reg
File
PC
Next PC
IR
Inst. Mem
Decode
Mem
Ctrl
WB 
Ctrl
r1=M[r2+35]
10	lw 	r1, r2(35)
14	addI 	r2, r2, 3
20	sub	r3, r4, r5
24	beq	r6, r7, 100
30	ori	r8, r9, 17
34	add	r10, r11, r12
100	and	r13, r14, 15
114
r2 = r2+3
r3 = r4-r5
r8 = r9 | 17
add r10
and r13
n
Squash the extra instruction in the branch shadow!
r11+r12
NO WB
NO Ovflow
r14 & R15
Summary: Pipelining
63
cs 152 L1 3 .63
DAP Fa97, Ó U.CB
What makes it easy
all instructions are the same length
just a few instruction formats
memory operands appear only in loads and stores 
What makes it hard?
structural hazards: suppose we had only one memory
control hazards: need to worry about branch instructions
data hazards: an instruction depends on a previous instruction 
We’ll build a simple pipeline and look at these issues 
We’ll talk about modern processors and what really makes it hard:
exception handling
trying to improve performance with out-of-order execution, etc.
Summary
64
cs 152 L1 3 .64
DAP Fa97, Ó U.CB
Pipelining is a fundamental concept
multiple steps using distinct resources
Utilize capabilities of the Datapath by pipelined instruction processing
start next instruction while working on the current one
limited by length of longest stage (plus fill/flush)
detect and resolve hazards
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
4
4
4
4
4
5
5
5
5
5
6
6
6
6
6
7
7
7
7
7
8
7
7
7
6
6
6
8
8
8
8
8
10
8
8
8
7
7
7
So where are in in the overall scheme of things.
Well, we just finished designing the processor’s datapath.
Now I am going to show you how to design the control for the datapath.
+1 = 7 min. (X:47)
9
9
9
9
9
10
10
10
10
10
11
11
11
11
11
12
12
12
12
12
13
13
13
13
13
14
14
14
14
14
15
15
15
15
15
16
16
16
16
16
17
17
17
17
17
18
18
18
18
18
19
19
19
19
19
20
20
20
20
20
21
21
21
21
23
36
21
21
21
19
16
16
22
22
22
22
24
23
23
23
23
26
24
24
24
24
28
25
25
25
25
30
26
26
26
26
32
27
27
27
27
34
28
28
28
28
36
29
29
29
29
38
30
30
30
30
40
31
31
31
31
42
32
32
32
33
44
33
33
33
35
46
34
34
34
37
48
35
35
35
39
50
36
36
36
41
52
37
37
38
43
54
38
38
40
45
56
39
39
42
47
58
40
40
44
49
60
41
41
46
51
62
42
42
48
53
64
43
43
50
55
66
44
44
52
57
68
45
45
54
59
70
46
46
56
61
72
47
47
58
63
74
48
48
60
65
76
49
49
62
67
78
50
50
65
70
81
94
50
50
50
47
44
44
51
51
66
71
82
52
53
68
73
84
53
55
70
75
86
54
57
72
77
88
55
59
74
79
90
56
61
76
81
92
57
63
78
83
94
58
65
80
85
96
59
67
82
87
98
60
69
84
89
100
61
71
86
91
102
62
73
88
93
104
63
75
90
95
106
64
77
92
97
108