pipe2new

•
UERJ

Bruno Ochotorena
12.08.2014
Esta é uma pré-visualização de arquivo. Entre para ver o arquivo original
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Recap: Sequential Laundry
Sequential laundry takes 8 hours for 4 loads
If they learned pipelining, how long would laundry take? 
30
T
a
s
k
O
r
d
e
r
Time
30
30
30
30
30
30
30
30
30
30
30
30
30
30
30
6 PM
7
8
9
10
11
12
1
2 AM
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Recap: Pipelining Lessons (it’s intuitive!)
Pipelining doesn’t help latency of single task, it helps throughput of entire workload
Multiple tasks operating simultaneously using different resources
Potential speedup = Number pipe stages
Pipeline rate limited by slowest pipeline stage
Stages should take about same amount of time
Unbalanced lengths of pipe stages reduces speedup
Time to “fill” pipeline and time to “drain” it reduces speedup
Stall for Dependences
6 PM
7
8
9
Time
T
a
s
k
O
r
d
e
r
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Recap: Ideal Pipelining
Maximum Speedup £ Number of stages
Speedup £ Time for unpipelined operation
 Time for longest stage
IF
DCD
EX
MEM
WB
IF
DCD
EX
MEM
WB
IF
DCD
EX
MEM
WB
IF
DCD
EX
MEM
WB
IF
DCD
EX
MEM
WB
Example: 40ns data path, 5 stages, Longest stage is 10 ns, Speedup £ 4
Assume instructions
are completely independent!
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Recap: Graphically Representing Pipelines
 
Can help with answering questions like:
how many cycles does it take to execute this code?
what is the ALU doing during cycle 4?
use this representation to help understand datapaths
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Recap: Can pipelining get us into trouble?
Yes: Pipeline Hazards
structural hazards: attempt to use the same resource two different ways at the same time
e.g., multiple memory accesses, multiple register writes
solutions: multiple memories, stretch pipeline
control hazards: attempt to make a decision before condition is evaulated
e.g., any conditional branch
solutions: prediction, delayed branch
data hazards: attempt to use item before it is ready
e.g., add r1,r2,r3; sub r4, r1 ,r5; lw r6, 0(r7); or r8, r6 ,r9
solutions: forwarding/bypassing, stall/bubble
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Recap: Pipelined Datapath with Data Stationary Control
npc
I mem
Regs
B
alu
S
D mem
m
IAU
PC
lw $2,20($5)
Regs
Operand Register Selects
ALU Op
MEM Op
Result Reg Select and Enable
Just like Time-State!
A
im
op
rw
n
<= PC + 4 + immed
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Recap
Pipelining is a fundamental concept
multiple steps using distinct resources
Utilize capabilities of the Datapath by pipelined instruction processing
start next instruction while working on the current one
limited by length of longest stage (plus fill/flush)
detect and resolve hazards
What makes it easy
all instructions are the same length
just a few instruction formats
memory operands appear only in loads and stores 
Hazards make it hard
We’ll build a simple pipeline and look at these issues
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
The Big Picture: Where are We Now? 
The Five Classic Components of a Computer
This Section´s Topics: 
Recap last lecture
Pipelined Control/ Do it yourself Pipelined Control
Hazards/Forwarding
Exceptions
Review MIPS R3000 pipeline
Advanced Pipelining?
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Recap: Control Diagram
IR <- Mem[PC]; PC < PC+4;
A <- R[rs]; B<– R[rt]
S <– A + B;
R[rd] <– S;
S <– A + SX;
M <– Mem[S]
R[rd] <– M;
S <– A or ZX;
R[rt] <– S;
S <– A + SX;
Mem[S] <- B
If Cond PC < PC+SX;
Exec
Reg. 
File
Mem
Access
Data
Mem
Reg
File
Equal
PC
Next PC
IR
Inst. Mem
M <– S
M <– S
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
6.3 (Parts of 6.6) Pipelined Control
cs 152 L1 3 .10
DAP Fa97, Ó U.CB
To specify control for the pipeline, we need only set the control values during each pipeline stage. Because each control line is associated with a component active in only a single pipeline stage, we divide the control lines into five groups.
The PC and Pipeline registers are written every clock cycle, so there is no seperate write signal for them. 
Since the control lines start with the EX stage, the control information is created during instruction decode. The signals are then used in the appropriate pipeline stage as the instr moves down the pipe. The signals for the stage are stored in the pipeline regs.
If control bit into 2 way multiplexor = 1 then multiplexor selects the input corresponding to 1, if control bit = 0 then multiplexor selects the input corresponding to 0.
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Specification of 9 Control Signals (Figs. 6.26 and 6.27)
 Instruction Fetch: Control signals to read Inst Mem and write 
 PC are always asserted (nothing special to control).
 ID / Reg file read: same as previous stage (no optional control 
 lines to set).
 EX / Address Calculation: signals to be set are RegDest 
 (selects result reg), ALUOp (2 control bits assert ALU
 operation, a 6-bit function code field specified in the immediate 
 field of the instr saved in the ID/EX pipe reg, see fig 6.26) and 
 ALUSrc (Read data reg 2 or a sign-extended immediate for the 
 ALU).
 Mem Access: control lines to be set are Branch (set by branch 
 instr), MemRead (load instr)and MemWrite (store instr). 
 PCSrc selects the next sequential address unless control 
 asserts Branch and the ALU result was zero.
 Write back: control lines are MemtoReg (decides between 
 sending the ALU result or the memory value to the reg file) and 
 RegWrite (writes the chosen value).
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
But recall use of “Data Stationary Control”
The Main Control generates the control signals during Reg/Dec
Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later
Control signals for Mem (MemWr Branch) are used 2 cycles later
Control signals for Wr (MemtoReg MemWr) are used 3 cycles later
IF/ID Register
ID/Ex Register
Ex/Mem Register
Mem/Wr Register
Reg/Dec
Exec
Mem
ExtOp
ALUOp
RegDst
ALUSrc
Branch
MemWr
MemtoReg
RegWr
Main
Control
ExtOp
ALUOp
RegDst
ALUSrc
MemtoReg
RegWr
MemtoReg
RegWr
MemtoReg
RegWr
Branch
MemWr
Branch
MemWr
Wr
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Datapath + Data Stationary Control
Exec
Reg. 
File
Mem
Access
Data
Mem
Reg
File
Next PC
IR
Inst. Mem
Decode
Mem
Ctrl
WB 
Ctrl
rs
rt
op
rs
rt
fun
im
ex
me
wb
rw
v
me
wb
rw
v
wb
rw
v
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Optimized Data Path
cs 152 L1 3 .14
DAP Fa97, Ó U.CB
Suppose extra HW (branch address adder moved from EX to IF stage and comparisson logic) is added in the ID / RegFetch stage so that branches are resolved in the second cycle of branch execution. Next PC is known in second cycle of branch ---> saves 1 clock cycle of penalty.
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Let’s Try it Out
10	lw 	r1, r2(35)
14	addI 	r2, r2, 3
20	sub	r3, r4, r5
24	beq	r6, r7, 100
30	ori	r8, r9, 17
34	add	r10, r11, r12
100	and	r13, r14, 15
these addresses are octal
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Start: Fetch 10
Exec
Reg. 
File
Mem
Access
Data
Mem
Reg
File
IR
Inst. Mem
Decode
Mem
Ctrl
WB 
Ctrl
rs
rt
im
10	lw 	r1, r2(35)
14	addI 	r2, r2, 3
20	sub	r3, r4, r5
24	beq	r6, r7, 100
30	ori	r8, r9, 17
34	add	r10, r11, r12
100	and	r13, r14, 15
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Fetch 14, Decode 10
Exec
Reg. 
File
Mem
Access
Data
Mem
Reg
File
IR
Inst. Mem
Decode
Mem
Ctrl
WB 
Ctrl
2
rt
im
10	lw 	r1, r2(35)
14	addI 	r2, r2, 3
20	sub	r3, r4, r5
24	beq	r6, r7, 100
30	ori	r8, r9, 17
34	add	r10, r11, r12
100	and	r13, r14, 15
lw r1, r2(35)
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Fetch 20, Decode 14, Exec 10
Exec
Reg. 
File
Mem
Access
Data
Mem
r2
Reg
File
IR
Inst. Mem
Decode
Mem
Ctrl
WB 
Ctrl
2
rt
35
10	lw 	r1, r2(35)
14	addI 	r2, r2,
3
20	sub	r3, r4, r5
24	beq	r6, r7, 100
30	ori	r8, r9, 17
34	add	r10, r11, r12
100	and	r13, r14, 15
lw r1
addI r2, r2, 3
EX
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Fetch 24, Decode 20, Exec 14, Mem 10
Exec
Reg. 
File
Mem
Access
Data
Mem
r2
r2+35
Reg
File
IR
Inst. Mem
Decode
Mem
Ctrl
WB 
Ctrl
4
5
3
10	lw 	r1, r2(35)
14	addI 	r2, r2, 3
20	sub	r3, r4, r5
24	beq	r6, r7, 100
30	ori	r8, r9, 17
34	add	r10, r11, r12
100	and	r13, r14, 15
lw r1
sub r3, r4, r5
addI r2, r2, 3
M 
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10
Exec
Reg. 
File
Mem
Access
Data
Mem
r4
r2+3
Reg
File
IR
Inst. Mem
Decode
Mem
Ctrl
WB 
Ctrl
M[r2+35]
6
7
10	lw 	r1, r2(35)
14	addI 	r2, r2, 3
20	sub	r3, r4, r5
24	beq	r6, r7, 100
30	ori	r8, r9, 17
34	add	r10, r11, r12
100	and	r13, r14, 15
lw r1
beq r6, r7 100
addI r2
sub r3
Note Delayed Branch: always execute ori after beq
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Fetch 100, Dcd 30, Ex 24, Mem 20, WB 14
Exec
Reg. 
File
Mem
Access
Data
Mem
r6
r2+3
Reg
File
IR
Inst. Mem
Decode
Mem
Ctrl
WB 
Ctrl
r1=M[r2+35]
9
xx
10	lw 	r1, r2(35)
14	addI 	r2, r2, 3
20	sub	r3, r4, r5
24	beq	r6, r7, 100
30	ori	r8, r9, 17
34	add	r10, r11, r12
100	and	r13, r14, 15
beq
addI r2
sub r3
r4-r5
100
ori r8, r9 17
M 
WB 
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Fetch 104, Dcd 100, Ex 30, Mem 24, WB 20
Exec
Reg. 
File
Mem
Access
Data
Mem
Reg
File
IR
Inst. Mem
Decode
Mem
Ctrl
WB 
Ctrl
10	lw 	r1, r2(35)
14	addI 	r2, r2, 3
20	sub	r3, r4, r5
24	beq	r6, r7, 100
30	ori	r8, r9, 17
34	add	r10, r11, r12
100	and	r13, r14, 15
Fill it in yourself!
?
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Fetch 110, Dcd 104, Ex 100, Mem 30, WB 24
Exec
Reg. 
File
Mem
Access
Data
Mem
Reg
File
IR
Inst. Mem
Decode
Mem
Ctrl
WB 
Ctrl
10	lw 	r1, r2(35)
14	addI 	r2, r2, 3
20	sub	r3, r4, r5
24	beq	r6, r7, 100
30	ori	r8, r9, 17
34	add	r10, r11, r12
100	and	r13, r14, 15
M 
Fill it in yourself!
?
?
?
?
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Fetch 114, Dcd 110, Ex 104, Mem 100, WB 30
Exec
Reg. 
File
Mem
Access
Data
Mem
Reg
File
IR
Inst. Mem
Decode
Mem
Ctrl
WB 
Ctrl
10	lw 	r1, r2(35)
14	addI 	r2, r2, 3
20	sub	r3, r4, r5
24	beq	r6, r7, 100
30	ori	r8, r9, 17
34	add	r10, r11, r12
100	and	r13, r14, 15
M 
Fill it in yourself!
?
?
?
?
?
?
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Pipeline Hazards Again
I-Fet ch DCD MemOpFetch OpFetch Exec Store
IFetch DCD ° ° °
Structural
Hazard
I-Fet ch DCD OpFetch Jump
IFetch DCD ° ° °
Control Hazard
 IF DCD EX Mem WB
 IF DCD OF Ex Mem
RAW (read after write) Data Hazard
WAW Data Hazard (write after write) 
 IF DCD OF Ex WB
WAR Data Hazard (write after read) 
 IF DCD EX Mem WB
 IF DCD EX Mem WB
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
6.4 Data Hazards
Avoid some “by design”
eliminate WAR by always fetching operands early (DCD) in pipe
eliminate WAW by doing all WBs in order (last stage, static)
Detect and resolve remaining ones
stall or forward (if possible)
Introduce a Forwarding Unit in the datapath
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Compiler Scheduling Solution
*
	sub $2, $1, $3
	nop
	nop
	and $12, $2, $5
	or $13, $6, $2
	add $14, $2, $2
	sw $15, 100($2)
This code works properly but the 2 nops occupy 2 clock cycles that do no useful work. These dependencies occur too often to rely on compilers.
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Hazard Detection
Suppose instruction i is about to be issued and a predecessor instruction j is in the instruction pipeline.
A RAW hazard exists on register r if r Î Rregs( i ) Ç Wregs( j )
Keep a record of pending writes (for inst's in the pipe) and compare with operand regs of current instruction. 
When instruction issues, reserve its result register. 
When on operation completes, remove its write reservation.
A WAW hazard exists on register r if r Î Wregs( i ) Ç Wregs( j )
A WAR hazard exists on register r if r Î Wregs( i ) Ç Rregs( j )
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Forwarding
*
cs 152 L1 3 .29
DAP Fa97, Ó U.CB
If we can take the inputs to the ALU from any pipeline register rather than just the ID/EX, then we can forward the proper data.
By adding multiplexors to the input of the ALU and with the proper controls, we can run the pipeline at full speed in the presence of some data dependencies. 
Notation:
	rs: 1st regs source operand
	rt: 2nd regs source operand
	rw (rd in text book): regs destination operand
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Record of Pending Writes
Current operand registers
Pending writes
hazard <=
((rs == rwex) & regWex) OR
((rs == rwmem) & regWme) OR
((rs == rwwb) & regWwb) OR
((rt == rwex) & regWex) OR
((rt == rwmem) & regWme) OR
((rt == rwwb) & regWwb) 
npc
I mem
Regs
B
alu
S
D mem
m
IAU
PC
Regs
A
im
op
rw
n
op
rw
n
op
rw
n
op rw rs rt
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Resolve RAW by forwarding
Detect nearest valid write op operand register and forward into op latches, bypassing remainder of the pipe
Increase muxes to add paths from pipeline registers
Data Forwarding = Data Bypassing
npc
I mem
Regs
B
alu
S
D mem
m
IAU
PC
Regs
A
im
op
rw
n
op
rw
n
op
rw
n
op rw rs rt
Forward
mux
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
6.5 What about memory operations?
° If instructions are initiated in order and 
 operations always occur in the same 
 stage, there can be no hazards between
 memory operations!
° What does delaying WB on arithmetic
 operations cost?
 – cycles ?
 – hardware ?
° What about data dependence on loads?
 R1 <- R4 + R5
 R2 <- Mem[ R2 + I ]
 R3 <- R2 + R1
=>
Instr is in the ID stage means:
 the instruction’s control signals are in
 the IF/ID register.
A
B
op Rd Ra Rb
op Rd Ra Rb
 Rd 
to reg
file
R
T
 Rd 
"Delayed Loads"
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Load Data Hazard
cs 152 L1 3 .33
DAP Fa97, Ó U.CB
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
No Solution by Forwarding
cs 152 L1 3 .34
DAP Fa97, Ó U.CB
cs 152 L1 3 .34
DAP Fa97, Ó U.CB
Since the dependence between the load and the following instr goes backwards in time, this hazard cannot be solved by forwarding.
It is necessary to stall the pipeline. Include a Hazard Detection Unit which operates during the ID stage
Condition for this load hazard detection:
	if (ID/EX.MemRead and
	 (( ID/EX RegsRt = IF/ID.RegsRs ) or
 ID/EX RegsRt = IF/ID.RegsRt )))
 stall the pipe one clock cycle
If instr is load and destination regs field in EX stage matches either source regs of the instr in the ID stage, then stall.
After the stall the forwarding unit handles the dependency to avoid loosing one more cycle.
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Bubbles Caused by Stalls
cs 152 L1 3 .35
DAP Fa97, Ó U.CB
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Stalling
cs 152 L1 3 .36
DAP Fa97, Ó U.CB
If the instruction in the ID stage is stalled, then the instr in the IF stage must also be stalled (so we don’t loose the fetched instr).
Stalling mechanism: prevent the PC and the IF/ID pipe regs from changing.
The instr in the IF stage will continue to be read using the same PC, and the regs in the ID stage will continue to be read using the same instr fields in the IF/ID pipe regs. Like restarting the washing machine with the same clothes while letting the dryer continue working empty.
It is also necessary to set the EX, MEM and WB control signals of the ID/ EX pipe regs to 0 (the hazard has already been detected in the ID stage). These values are propagated forward in the pipe and no regs
or memory locations are changed. Actually only the signals RegWrite and MemWrite need to be 0.
In the example in previous figure, the HW repeats in cycle 4 what it did for instrs and and or in clock 3. The execution time for these instrs is stretched and the fetch of the add instr is delayed.
The bubble delays everything behind it and proceeds down the
 pipe.
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Hazard Detection and Forwarding Units
cs 152 L1 3 .37
DAP Fa97, Ó U.CB
The Forwarding Unit controls the ALU multiplexors to replace the value from a general purpose regs with the value from the proper pipe regs. 
The Hazard Detection Unit controls the writing of the PC and IF/ID regs plus the multiplexor that chooses between the real control values and all 0s. It stalls and deasserts the control fields if the load-use hazard test is true.
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Cycles 2 and 3 (Single-Cycle Diagrams)
cs 152 L1 3 .36
DAP Fa97, Ó U.CB
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Cycles 4 and 5
cs 152 L1 3 .39
DAP Fa97, Ó U.CB
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Cycles 6 and 7
cs 152 L1 3 .
DAP Fa97, Ó U.CB
1
1
1
1
1
1
2
2
2
2
2
2
3
3
3
3
3
3
4
4
4
4
4
4
5
5
5
5
5
5
6
6
6
6
6
6
Data stationary control -------> finite state diagram
7
7
7
7
7
7
8
8
8
8
8
8
So where are in in the overall scheme of things.
Well, we just finished designing the processor’s datapath.
Now I am going to show you how to design the control for the datapath.
+1 = 7 min. (X:47)
9
9
9
9
9
9
10
10
10
10
10
10
11
11
11
11
11
11
12
12
12
12
12
12
The main control here is identical to the one in the single cycle processor. It generate all the control signals necessary for a given instruction during that instruction’s Reg/Decode stage.
All these control signals will be saved in the ID/Exec pipeline register at the end of the Reg/Decode cycle.
The control signals for the Exec stage (ALUSrc, ... etc.) come from the output of the ID/Exec register. That is they are delayed ONE cycle from the cycle they are generated.
The rest of the control signals that are not used during the Exec stage is passed down the pipeline and saved in the Exec/Mem register.
The control signals for the Mem stage (MemWr, Branch) come from the output of the Exec/Mem register. That is they are delayed two cycles from the cycle they are generated.
Finally, the control signals for the Wr stage (MemtoReg & RegWr) come from the output of the Exec/Wr register: they are delayed three cycles from the cycle they are generated.
+2 = 45 min. (Y:45)
13
13
13
13
13
13
14
14
14
14
14
14
15
15
15
15
15
15
16
16
16
16
16
16
17
17
17
17
17
17
18
18
18
18
18
18
19
19
19
19
19
19
20
20
20
20
20
20
21
21
21
21
21
21
22
22
22
22
22
22
23
23
23
23
23
23
24
24
24
24
24
24
25
25
25
25
25
25
26
26
26
26
26
26
27
27
27
27
27
28
28
28
28
28
28
29
29
29
29
29
30
30
30
30
30
30
31
31
31
31
31
31
32
32
32
32
32
32
33
33
33
33
34
34
34
34
35
35
35
35
36
36
36
36
37
37
37
37
38
38
38
38
39
39
39
39
40
40
40
40