278500-Apresentações
287 pág.

278500-Apresentações


DisciplinaArquitetura e Organização de Computadores 1277 materiais4.376 seguidores
Pré-visualização17 páginas
word can still cause a hazard:
\u2013 an instruction tries to read a register following a load instruction 
that writes to the same register.
\u2022 Thus, we need a hazard detection unit to \u201cstall\u201d the load instruction
Can't always forward
Program
execution
order
(in instructions)
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
Time (in clock cycles)
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
IM DMReg Reg
IM DMReg Reg
IM DMReg Reg
IM DMReg Reg
IM DMReg Reg
218\uf6d92004 Morgan Kaufmann Publishers
Stalling
\u2022 We can stall the pipeline by keeping an instruction in the same stage
bubble
Program
execution
order
(in instructions)
lw $2, 20($1)
and becomes nop
add $4, $2, $5
or $8, $2, $6
add $9, $4, $2
Time (in clock cycles)
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10
IM DMReg Reg
IM DMReg Reg
IM DMReg Reg
IM DMReg Reg
IM DMReg Reg
219\uf6d92004 Morgan Kaufmann Publishers
Hazard Detection Unit
\u2022 Stall by letting an instruction that won\u2019t write anything go forward
0 M
WB
WB
Data
memory
Instruction
memory
M
u
x
M
u
x
M
u
x
M
u
x
ALU
ID/EX
EX/MEM
MEM/WB
Forwarding
unit
PC
Control
EX
M
WB
IF/ID
M
u
x
Hazard
detection
unit
ID/EX.MemRead
IF/ID.RegisterRs
IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRd
ID/EX.RegisterRt
Registers
Rt
Rd
Rs
Rt
220\uf6d92004 Morgan Kaufmann Publishers
\u2022 When we decide to branch, other instructions are in the pipeline!
\u2022 We are predicting \u201cbranch not taken\u201d
\u2013 need to add hardware for flushing instructions if we are wrong
Branch Hazards
Reg
Program
execution
order
(in instructions)
40 beq $1, $3, 28
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
Time (in clock cycles)
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
IM DMReg Reg
IM DMReg Reg
IM DM Reg
IM DMReg Reg
IM DMReg Reg
221\uf6d92004 Morgan Kaufmann Publishers
Flushing Instructions
Control
Hazard
detection
unit
+
4
PC Instructionmemory
Sign
extend
Registers
=
+
Fowarding
unit
ALU
ID/EX
EX/MEM
EX/MEM
WB
M
EX
Shift
left 2
IF.Flush
IF/ID
M
u
x
M
u
x
M
u
x
M
u
x
M
u
x
M
u
x
Data
memory
WB
WBM
0
Note: we\u2019ve also moved branch decision to ID stage
222\uf6d92004 Morgan Kaufmann Publishers
Branches
\u2022 If the branch is taken, we have a penalty of one cycle
\u2022 For our simple design, this is reasonable
\u2022 With deeper pipelines, penalty increases and static branch prediction 
drastically hurts performance
\u2022 Solution: dynamic branch prediction
Predict taken Predict taken
Predict not taken Predict not taken
Not taken
Not taken
Not taken
Not taken
Taken
Taken
Taken
Taken
A 2-bit prediction scheme
223\uf6d92004 Morgan Kaufmann Publishers
Branch Prediction
\u2022 Sophisticated Techniques:
\u2013 A \u201cbranch target buffer\u201d to help us look up the destination
\u2013 Correlating predictors that base prediction on global behavior
and recently executed branches (e.g., prediction for a specific
branch instruction based on what happened in previous branches)
\u2013 Tournament predictors that use different types of prediction 
strategies and keep track of which one is performing best.
\u2013 A \u201cbranch delay slot\u201d which the compiler tries to fill with a useful 
instruction (make the one cycle delay part of the ISA)
\u2022 Branch prediction is especially important because it enables other 
more advanced pipelining techniques to be effective!
\u2022 Modern processors predict correctly 95% of the time!
224\uf6d92004 Morgan Kaufmann Publishers
Improving Performance
\u2022 Try and avoid stalls! E.g., reorder these instructions:
lw $t0, 0($t1)
lw $t2, 4($t1)
sw $t2, 0($t1)
sw $t0, 4($t1)
\u2022 Dynamic Pipeline Scheduling
\u2013 Hardware chooses which instructions to execute next
\u2013 Will execute instructions out of order (e.g., doesn\u2019t wait for a 
dependency to be resolved, but rather keeps going!)
\u2013 Speculates on branches and keeps the pipeline full 
(may need to rollback if prediction incorrect)
\u2022 Trying to exploit instruction-level parallelism
225\uf6d92004 Morgan Kaufmann Publishers
Advanced Pipelining
\u2022 Increase the depth of the pipeline
\u2022 Start more than one instruction each cycle (multiple issue)
\u2022 Loop unrolling to expose more ILP (better scheduling)
\u2022 \u201cSuperscalar\u201d processors
\u2013 DEC Alpha 21264: 9 stage pipeline, 6 instruction issue
\u2022 All modern processors are superscalar and issue multiple 
instructions usually with some limitations (e.g., different \u201cpipes\u201d)
\u2022 VLIW: very long instruction word, static multiple issue 
(relies more on compiler technology)
\u2022 This class has given you the background you need to learn more!
226\uf6d92004 Morgan Kaufmann Publishers
Chapter 6 Summary
\u2022 Pipelining does not improve latency, but does improve throughput
Slower Faster
Instructions per clock (IPC = 1/CPI)
Multicycle
(Section 5.5)
Single-cycle
(Section 5.4)
Deeply
pipelined
Pipelined
Multiple issue
with deep pipeline
(Section 6.10)
Multiple-issue
pipelined
(Section 6.9)
1 Several
Use latency in instructions
Multicycle
(Section 5.5)
Single-cycle
(Section 5.4)
Deeply
pipelinedPipelined
Multiple issue
with deep pipeline
(Section 6.10)
Multiple-issue
pipelined
(Section 6.9)
227\uf6d92004 Morgan Kaufmann Publishers
Chapter Seven
228\uf6d92004 Morgan Kaufmann Publishers
\u2022 SRAM:
\u2013 value is stored on a pair of inverting gates
\u2013 very fast but takes up more space than DRAM (4 to 6 transistors)
\u2022 DRAM:
\u2013 value is stored as a charge on capacitor (must be refreshed)
\u2013 very small but slower than SRAM (factor of 5 to 10)
Memories: Review
B
A A
B
Word line
Pass transistor
Capacitor
Bit line
229\uf6d92004 Morgan Kaufmann Publishers
\u2022 Users want large and fast memories! 
SRAM access times are .5 \u2013 5ns at cost of $4000 to $10,000 per GB.
DRAM access times are 50-70ns at cost of $100 to $200 per GB.
Disk access times are 5 to 20 million ns at cost of $.50 to $2 per GB.
\u2022 Try and give it to them anyway
\u2013 build a memory hierarchy
Exploiting Memory Hierarchy
2004
CPU
Level 1
Level 2
Level n
Increasing distance
from the CPU in
access time
Levels in the
memory hierarchy
Size of the memory at each level
230\uf6d92004 Morgan Kaufmann Publishers
Locality
\u2022 A principle that makes having a memory hierarchy a good idea
\u2022 If an item is referenced,
temporal locality: it will tend to be referenced again soon
spatial locality: nearby items will tend to be referenced soon.
Why does code have locality?
\u2022 Our initial focus: two levels (upper, lower)
\u2013 block: minimum unit of data 
\u2013 hit: data requested is in the upper level
\u2013 miss: data requested is not in the upper level
231\uf6d92004 Morgan Kaufmann Publishers
\u2022 Two issues:
\u2013 How do we know if a data item is in the cache?
\u2013 If it is, how do we find it?
\u2022 Our first example:
\u2013 block size is one word of data
\u2013 "direct mapped"
For each item of data at the lower level, 
there is exactly one location in the cache where it might be.
e.g., lots of items at the lower level share locations in the upper level
Cache
232\uf6d92004 Morgan Kaufmann Publishers
\u2022 Mapping: address is modulo the number of blocks in the cache
Direct Mapped Cache
00001 00101 01001 01101 10001 10101 11001 11101
0
0
0
C a ch e
M em ory
0
0
1
0
1
0
0
1
1
1
0
0
1
0
1
1
1
0
1
1
1
233\uf6d92004 Morgan Kaufmann Publishers
\u2022 For MIPS:
What kind of locality are we taking advantage of?
Direct Mapped Cache
Address (showing bit positions)
Data
Hit
Data