Buscar

pipe3new

Esta é uma pré-visualização de arquivo. Entre para ver o arquivo original

cs 152 L1 3 .*
DAP Fa97, Ó U.CB
6.6 Branch Hazards (Control Hazards)
cs 152 L1 3 .
DAP Fa97, Ó U.CB
Control Hazards are simple to understand and occur much less frequently than data hazards.
They can’t be solved as effectively as forwarding is for data hazards. 
There are 2 schemes for resolving these hazards and one optimization to improve these schemes.
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
a) Assume Branch not Taken
cs 152 L1 3 .42
DAP Fa97, Ó U.CB
Stalling until branches complete is too slow and reduces potential speedup too much.
Improvement strategy: Assume branch will not be taken and continue execution down the sequential instr stream. If the branch is taken, the instrs that are being fetched and decoded must be discarded, and execution continues at the branch target. 
If branches are taken ½ the time, and if it costs little to discard the instrs, this scheme halves the cost of control hazards.
To discard instrs, the original control values are changed to 0’s (like load hazards). Change also for instrs in the IF, ID and EX stages when the branch reaches MEM - Flush instrs in the IF, ID, and EX stages. 
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Impact of Branch Instrs on Pipe Efficiency
cs 152 L1 3 .3
DAP Fa97, Ó U.CB
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
b) Reducing the Delay of Branches
04/26/99
4
cs 152 L1 3 .4
DAP Fa97, Ó U.CB
Reduce the cost of the taken branch.
In first implementation, the next PC for a branch is selected in the MEM stage.
Move the branch execution to the ID stage. Special HW in ID: adder for address calculation (moved from MEM stage); Exclusive-OR and AND logic (for equality testing).
Only one instruction to flush if the branch is taken.
To flush: control line, IF.Flush, that zeros the instr control signals of the IF/ID pipe regst (the instr becomes a nop).
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Datapath for Branch (HW to Flush Instr after branch)
IF.Flush control signal actually comes from equality checker (evaluates 
 branch condition).
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Example: Pipelined Branch Execution
04/26/99
6
cs 152 L1 3 .6
DAP Fa97, Ó U.CB
Assume the pipeline is optimized for branches not taken and branch execution is moved to ID stage.
36 sub $10, $4, $8
40 beq $1, $3, 7 #PC-relative branch to 40+4+7*4 = 72
44 and $12, $2, $5
48 or $13, $2, $6
52 add $14, $4, $2
56 slt $15, $6, $7
…
72 lw $4, 50($7)
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Branch Execution of Sample Code
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Dynamic Branch Prediction
04/26/99
8
cs 152 L1 3 .8
DAP Fa97, Ó U.CB
Assuming Branch taken is a crude form of branch prediction. Try to detect regularities in branch behavior.
 
Algorithm: look up the address of the instr to see if a branch was taken the last time this instr was executed and, if so, begin fetching new instrs from the same place as the last time.
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Implementation of Dynamic Branch Prediction
Implementation: branch prediction buffer or branch history table, a small memory indexed by the lower portion of the address of the branch instr. This memory contains a bit that says wether the branch was taken or not. 
 The branch prediction buffer can be implemented as a small, special buffer accessed with the instr address during the IF stage. 
If the instr is predicted as taken, fetching begins from the target as soon as the PC is known (this can be as early as the ID stage). 
Otherwise sequential fetching continues. If prediction is wrong, prediction bit must be changed.
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Example Branch Prediction (1 bit history table)
04/26/99
*
9
cs 152 L1 3 .9
DAP Fa97, Ó U.CB
Consider the following loop with: i=1 initially, j=1, h=10
Loop: g = g + A[ i ];
 i = i + j;
 if (i != h) go to Loop;
Loop: add $t1, $s3, $s3		#Temp reg $t1 = 2 *i
 add $t1, $t1, $t1		#Temp reg $t1 = 4*i
 add $t1, $t1, $s5		#$t1 = address of A[ i ] 
 lw $t0, 0($t1)		#Temp reg $t0 = A[ i ]
 add $s1, $s1, $t0		#g = g + A[ i ]
 add $s3, $s3, $s4		#i = i + j
 bne $s3, $s3, Loop 	# go to Loop if i != j
The algorithm will mispredict the first time (i = 1, bit =0) and last time (i=10, bit =1).
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Loops and Prediction
04/26/99
10
cs 152 L1 3 .10
DAP Fa97, Ó U.CB
The prediction may be wrong (it may have been put there by another branch that has the same low order bits). This doesn’t affect correctness. If the prediction is wrong, the prediction bit is inverted and stored back, and the proper bit sequence is executed.
The prediction accuracy for this branch that is taken 90% of the time is only 80% (2 incorrect predictions and eight correct ones).
For highly regular branches, we want the accuracy of the predictor to match the taken branch frequency.
Solution: Use a 2 bit prediction buffer. A prediction must be wrong twice before it is changed.
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Finite State Machine for 2-bit Prediction Scheme
A branch that strongly favors taken or not taken will be mispredicted only once. The 2 bits are used to encode the four states in the system. 
Assume code has exited an earlier loop:
 when enters loop bits=10 (2nd state, right top) predicts correctly;
 during other prediction bits = 11 (1st state, left top);
 at last loop prediction bits = 11, predicts wrongly and changes to10;
 after last prediction bits = 10 (2nd state, right top).
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Compiler Scheduling for Branch Delays 
04/27/99
12
cs 152 L1 3 .12
DAP Fa97, Ó U.CB
A delayed branch always executes the following instr, but the 2nd instr following the branch will be affected by the branch.
Compilers and assemblers try to place an instr that always executes after the branch in the branch delay slot. The SW tries to make the successor instr valid and useful.
Limitations of scheduling:
 a) restrictions on instrs that are scheduled in the slot,
 b) the ability to predict at compile time wether a
 branch will be taken or not.
As machines go to longer pipes and towards issuing 
multiple instrs / cycle a single delay slot is not very 
useful. This strategy is loosing popularity. 
Transistors in chip increase, dynamic predictors become more popular.
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
3 Ways to Schedule Branch Delay Slot
In b) and c) performance is improved only when execution proceeds 
 in expected direction.
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Comments on Branch Delay Scheduling
In a) the slot is scheduled with an independent instr from before the branch. This is the best choice, it always improves performance.
b) and c) are used when a) is not possible ($s1 written by add and read by if).
In b) usually the target will have to be copied into slot because it can be reached by another path. This strategy is preferred when branch is taken with high probability (such as loop braches).
In c) the slot is scheduled with the not-taken fall- through.
To make b) and c) scheduling possible, it is necessary to be OK to execute the sub instr when the branch goes in the unexpected direction (work is wasted but the program executes correctly, e. g. $t4 is not used when branch takes unexpected direction).
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Execution Model Performance Comparisson
Compare performance of single-cycle, multicycle and pipelined control for instr mix of gcc (fig 4.54)
instr mix:
 22% loads, 11% stores, 49% R-format, 16% branches, 2% jumps
operation times for major functional units: mem access = 2ns, ALU operations = 2 ns, regs file read / write = 1 ns.
Single-cycle: 8 ns / instruction = cycle duration.
Multicycle:
 No. of cycles for each operation (CPI) in multicycle implementation: 
 loads = 5, stores = 4, R-format = 4, branches
= 3, jumps = 3 cycles. 
 CPI = 0.22*5 + 0.11*4 + 0.49*4 + 0.16*3 + 0.02*3 = 4.04 cycles / instr
 Average time for instr execution = 4.04 * 2 ns = 8.08 ns.
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Performance Comparisson Continuation
Pipelined: assume
 1/2 of loads followed by instr that uses load result; branch delay on misprediction 1 cycle and 1/4 branches are mispredicted; jumps always delay 1 cycle.
 loads: 1 cycle when no dependency and 2 when there is dependency, 
 average cycle for loads = 1.5 cycles
 stores and R-format: 1 cycle
 branches: 1 cycle when prediction is correct and 2 cycles when 
 wrong, average cycle for branches = 1 + 1*0.25 = 1.25 cycles
 jumps: 1 cycle + 1 cycle delay = 2 cycles
 CPI = 1.5 *0.22 + 1*0.11 + 1*0.49 + 1.25*0.16 + 2*0.02 = 1.17 cycles
 average instr execution time 1.17*2 ns = 2.34 ns
Pipelined control time is:
 8 ns / 2.34 ns = 3.4 times faster than single-cycle control, 
 8.08 ns / 2.34 ns = 3.5 times faster than multicycle control (book does 4.04 ns / 2.34 ns = 1.73 times faster - wrong...see example pg 397 chapter 5).
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
6.7 Exceptions
Exceptions and interrupts are events other than branches or jumps that change the normal flow of instr execution.
Exception: an unexpected event from within the processor. Ex: arithmetic overflow
Interrupts: a change of control flow from outside the processor. Ex: Interrupts caused by I/O devices to communicate with the processor.
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
What about Interrupts, Traps, Faults?
External Interrupts:
Allow pipeline to drain,
Load PC with interupt address
Faults (within instruction, restartable)
Force trap instruction into IF (transfer control to the 
 exception routine)
disable writes till trap hits WB
must save multiple PCs or PC + state
Refer to MIPS solution
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Example: Arithmetic Overflow Exception
What happens if an over flow exception occurs in the add instr of the following instr sequence? Instr addresses are given in hexadecimal.
 40 sub $11, $2, $4
 44 and $12, $2, $5
 48 or $13, $2, $6
 4C add $1, $2, $1
 50 slt $15, $6, $7
 54 lw $16, 50($7)
assume the instrs to be invoked on an exception begin:
 4000040 sw $25, 1000($0)
 4000044 sw $26, 1004($0)
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Bubbles are Inserted
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
How to Flush Unwanted Instructions
Flush instrs in IF by turning it into nop (IF.Flush control signal).
Use multiplexors already in ID to zero control signals of instr in ID stage. A new ID.Flush control signal is ORed with the stall signal from the Hazard Detection Unit to flush during ID.
Use a new signal EX.Flush is used to cause new multiplexors to zero the control lines in EX.
To start fetching instrs at location 40000040, add an additional input to the PC multiplexor that sends 40000040 to the PC.
If the instr is not stopped in the middle of its execution, the programmer will see a wrong value of $1 (not the original value that caused the overflow). The exception must be detected and flush signals set during the EX of add instr. EX.Flush prevents this instr from writing in the WB stage.
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Datapath with Controls to Handle Exceptions
Cause regs records cause of exception; Exception PC saves address of 
 instr that caused the exception (actually saves address + 4). In example
 EPC = 4C + 4 = 50 (in hexadecimal).
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Other Causes of Exceptions
I/O device request
Invoking Operating System service from user program
Using an undefined instr
HW malfunction
How to associate an exception with the corresponding instr when there are 5 instrs active in a clok cycle?
Multiple exceptions can occur simultaneously in a single clock cycle.
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Exception Problem
Exceptions/Interrupts: 5 instructions executing in 5 stage pipeline
How to stop the pipeline?
Restart?
Who caused the interrupt?
Stage	Problem interrupts occurring
IF		Page fault on instruction fetch; misaligned memory 		access; memory-protection violation
ID		Undefined or illegal opcode
EX	Arithmetic exception
MEM	Page fault on data fetch; misaligned memory 			access; memory-protection violation; memory error
Load with data page fault, Add with instruction page fault?
Solution 1: interrupt vector/instruction 2: interrupt ASAP, restart everything incomplete
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Exception Handling
npc
I mem
Regs
B
alu
S
D mem
m
IAU
PC
lw $2,20($5)
Regs
A
im
op
rw
n
detect bad instruction address
detect bad instruction
detect overflow
detect bad data address
Allow exception to take effect
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Solutions to Exception Problem
I/O device requests and HW malfunction: not associated with a specific instr. Implementation has some flexibility as to when to interrupt the pipeline. For I/O request, stop at simplest instr. For HW malfunction, it is best to stop as soon as possible because HW is unstable.
Prioritize exceptions so it is easy to determine which is serviced first. In most MIPs implementations the HW sorts instrs so that earliest instr is interrupted.
EPC has address of interrupted instrs.
Cause regs records all possible exceptions in a clock cycle.
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Pipeline Changes Order of Exception Ocurrence
instr__1__ _2___3___4___5___6___7___8___9___10
i-3 IF ID EX MEM WB
i-2 IF ID EX MEM WB
i-1 IF ID EX MEM WB
i LW IF ID EX MEM WB
i+1 ADD IF ID EX MEM WB
i+2 IF ID EX MEM WB
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
HW SW Interface for Exception Handling
By knowing in which stage an exception can occur, the exception SW can match the exception to the instr.
Exceptions are collected in the cause reg so that the HW can interrupt based on later exceptions, once the earliest one has been serviced.
Precise Interrupts: The pipeline is stopped so that the instrs prior to the falty instr j complete and the following are flushed and later restarted. 
Implementation: HW associates an interrupt vector with each instr and disables the effects of the of the faulty instr j when it is detected. The instr follows down the pipe and the exception is treated at the end of the MEM stage of j. This guarantees that the exceptions of instr i are treated before those of instr i+1.
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
HW SW Interface (Cont...)
The HW saves the address in EPC, and the cause of the falty instr in the cause reg and then jumps to a prearranged address of the exception handling routine.
The OS looks at the cause and acts appropriately.
For undefined instr, HW malfunction, arithmetic overflow, the OS normally kills the program and returns an indicator of the reason.
For I/O device request or an OS service call, the OS saves the state of the program, performs the desired task, and then restores the program to continue execution.
MIPS and the most machines today support precise interrupts or precise exceptions because they simplify the interface with the OS (e.g. to support virtual memory).
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Imprecise Interrupts or Imprecise Exceptions
Imprecise Interrupts: Treat exception when it occurs.
instr__1__ _2___3___4___5___6___7___8___9___10
i-3 IF ID EX MEM WB
i-2 IF ID EX MEM WB
i-1 IF ID EX MEM WB
i LW IF ID EX MEM WB
i+1 ADD IF ID EX MEM WB
i+2 IF ID EX MEM WB
Treats ADD exception
at the end of 5th cycle. Stop instrs i-2, i-1, i, and i+1. Reinitiate pipe at intrs i-2 after exception handling. It will then detect the pg fault of instr i, stopping instrs i...i+3.
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Why use Imprecise Exceptions?
The difficulty of making the HW associate the correct exception with the correct instr in pipelined computers sometimes leads designers to relax precise exception treatment.
Let the OS system determine: 
which instr caused the exception (from the type of exception and the stage in which it occured)
which instrs to flush
from which instr to restart
cs 152 L1 3 .*
DAP Fa97, Ó U.CB
Problems for Chapter 6
6.1, 6.2, 6.3, 6.4, 6.5, 6.7, 6.8, 6.9, 6.11, 6.12, 6.13, 6.14, 6.15, 6.19, 6.23, 
Study analytic performance examples done in class and in the book.
1
1
1
1
1
1
1
2
2
2
2
2
2
2
3
3
3
3
3
3
3
4
4
4
4
4
4
5
5
5
5
5
5
6
6
6
6
6
6
7
7
7
7
7
7
8
8
8
8
8
8
9
9
10
10
10
9
9
11
11
11
10
10
12
12
12
11
11
13
13
13
12
14
14
14
13
15
15
15
14
16
16
16
15
17
17
17
16
18
18
18
17
19
19
19
18
20
20
20
21
21
21
22
22
22
23
23
23
24
24
24
25
25
25
25
26
26
27
27
27
28
28
29
29
30
31
32
33

Teste o Premium para desbloquear

Aproveite todos os benefícios por 3 dias sem pagar! 😉
Já tem cadastro?

Outros materiais