Baixe o app para aproveitar ainda mais
Esta é uma pré-visualização de arquivo. Entre para ver o arquivo original
cs 152 L1 3 .* DAP Fa97, Ó U.CB 6.6 Branch Hazards (Control Hazards) cs 152 L1 3 . DAP Fa97, Ó U.CB Control Hazards are simple to understand and occur much less frequently than data hazards. They can’t be solved as effectively as forwarding is for data hazards. There are 2 schemes for resolving these hazards and one optimization to improve these schemes. cs 152 L1 3 .* DAP Fa97, Ó U.CB a) Assume Branch not Taken cs 152 L1 3 .42 DAP Fa97, Ó U.CB Stalling until branches complete is too slow and reduces potential speedup too much. Improvement strategy: Assume branch will not be taken and continue execution down the sequential instr stream. If the branch is taken, the instrs that are being fetched and decoded must be discarded, and execution continues at the branch target. If branches are taken ½ the time, and if it costs little to discard the instrs, this scheme halves the cost of control hazards. To discard instrs, the original control values are changed to 0’s (like load hazards). Change also for instrs in the IF, ID and EX stages when the branch reaches MEM - Flush instrs in the IF, ID, and EX stages. cs 152 L1 3 .* DAP Fa97, Ó U.CB Impact of Branch Instrs on Pipe Efficiency cs 152 L1 3 .3 DAP Fa97, Ó U.CB cs 152 L1 3 .* DAP Fa97, Ó U.CB b) Reducing the Delay of Branches 04/26/99 4 cs 152 L1 3 .4 DAP Fa97, Ó U.CB Reduce the cost of the taken branch. In first implementation, the next PC for a branch is selected in the MEM stage. Move the branch execution to the ID stage. Special HW in ID: adder for address calculation (moved from MEM stage); Exclusive-OR and AND logic (for equality testing). Only one instruction to flush if the branch is taken. To flush: control line, IF.Flush, that zeros the instr control signals of the IF/ID pipe regst (the instr becomes a nop). cs 152 L1 3 .* DAP Fa97, Ó U.CB Datapath for Branch (HW to Flush Instr after branch) IF.Flush control signal actually comes from equality checker (evaluates branch condition). cs 152 L1 3 .* DAP Fa97, Ó U.CB Example: Pipelined Branch Execution 04/26/99 6 cs 152 L1 3 .6 DAP Fa97, Ó U.CB Assume the pipeline is optimized for branches not taken and branch execution is moved to ID stage. 36 sub $10, $4, $8 40 beq $1, $3, 7 #PC-relative branch to 40+4+7*4 = 72 44 and $12, $2, $5 48 or $13, $2, $6 52 add $14, $4, $2 56 slt $15, $6, $7 … 72 lw $4, 50($7) cs 152 L1 3 .* DAP Fa97, Ó U.CB Branch Execution of Sample Code cs 152 L1 3 .* DAP Fa97, Ó U.CB Dynamic Branch Prediction 04/26/99 8 cs 152 L1 3 .8 DAP Fa97, Ó U.CB Assuming Branch taken is a crude form of branch prediction. Try to detect regularities in branch behavior. Algorithm: look up the address of the instr to see if a branch was taken the last time this instr was executed and, if so, begin fetching new instrs from the same place as the last time. cs 152 L1 3 .* DAP Fa97, Ó U.CB Implementation of Dynamic Branch Prediction Implementation: branch prediction buffer or branch history table, a small memory indexed by the lower portion of the address of the branch instr. This memory contains a bit that says wether the branch was taken or not. The branch prediction buffer can be implemented as a small, special buffer accessed with the instr address during the IF stage. If the instr is predicted as taken, fetching begins from the target as soon as the PC is known (this can be as early as the ID stage). Otherwise sequential fetching continues. If prediction is wrong, prediction bit must be changed. cs 152 L1 3 .* DAP Fa97, Ó U.CB Example Branch Prediction (1 bit history table) 04/26/99 * 9 cs 152 L1 3 .9 DAP Fa97, Ó U.CB Consider the following loop with: i=1 initially, j=1, h=10 Loop: g = g + A[ i ]; i = i + j; if (i != h) go to Loop; Loop: add $t1, $s3, $s3 #Temp reg $t1 = 2 *i add $t1, $t1, $t1 #Temp reg $t1 = 4*i add $t1, $t1, $s5 #$t1 = address of A[ i ] lw $t0, 0($t1) #Temp reg $t0 = A[ i ] add $s1, $s1, $t0 #g = g + A[ i ] add $s3, $s3, $s4 #i = i + j bne $s3, $s3, Loop # go to Loop if i != j The algorithm will mispredict the first time (i = 1, bit =0) and last time (i=10, bit =1). cs 152 L1 3 .* DAP Fa97, Ó U.CB Loops and Prediction 04/26/99 10 cs 152 L1 3 .10 DAP Fa97, Ó U.CB The prediction may be wrong (it may have been put there by another branch that has the same low order bits). This doesn’t affect correctness. If the prediction is wrong, the prediction bit is inverted and stored back, and the proper bit sequence is executed. The prediction accuracy for this branch that is taken 90% of the time is only 80% (2 incorrect predictions and eight correct ones). For highly regular branches, we want the accuracy of the predictor to match the taken branch frequency. Solution: Use a 2 bit prediction buffer. A prediction must be wrong twice before it is changed. cs 152 L1 3 .* DAP Fa97, Ó U.CB Finite State Machine for 2-bit Prediction Scheme A branch that strongly favors taken or not taken will be mispredicted only once. The 2 bits are used to encode the four states in the system. Assume code has exited an earlier loop: when enters loop bits=10 (2nd state, right top) predicts correctly; during other prediction bits = 11 (1st state, left top); at last loop prediction bits = 11, predicts wrongly and changes to10; after last prediction bits = 10 (2nd state, right top). cs 152 L1 3 .* DAP Fa97, Ó U.CB Compiler Scheduling for Branch Delays 04/27/99 12 cs 152 L1 3 .12 DAP Fa97, Ó U.CB A delayed branch always executes the following instr, but the 2nd instr following the branch will be affected by the branch. Compilers and assemblers try to place an instr that always executes after the branch in the branch delay slot. The SW tries to make the successor instr valid and useful. Limitations of scheduling: a) restrictions on instrs that are scheduled in the slot, b) the ability to predict at compile time wether a branch will be taken or not. As machines go to longer pipes and towards issuing multiple instrs / cycle a single delay slot is not very useful. This strategy is loosing popularity. Transistors in chip increase, dynamic predictors become more popular. cs 152 L1 3 .* DAP Fa97, Ó U.CB 3 Ways to Schedule Branch Delay Slot In b) and c) performance is improved only when execution proceeds in expected direction. cs 152 L1 3 .* DAP Fa97, Ó U.CB Comments on Branch Delay Scheduling In a) the slot is scheduled with an independent instr from before the branch. This is the best choice, it always improves performance. b) and c) are used when a) is not possible ($s1 written by add and read by if). In b) usually the target will have to be copied into slot because it can be reached by another path. This strategy is preferred when branch is taken with high probability (such as loop braches). In c) the slot is scheduled with the not-taken fall- through. To make b) and c) scheduling possible, it is necessary to be OK to execute the sub instr when the branch goes in the unexpected direction (work is wasted but the program executes correctly, e. g. $t4 is not used when branch takes unexpected direction). cs 152 L1 3 .* DAP Fa97, Ó U.CB Execution Model Performance Comparisson Compare performance of single-cycle, multicycle and pipelined control for instr mix of gcc (fig 4.54) instr mix: 22% loads, 11% stores, 49% R-format, 16% branches, 2% jumps operation times for major functional units: mem access = 2ns, ALU operations = 2 ns, regs file read / write = 1 ns. Single-cycle: 8 ns / instruction = cycle duration. Multicycle: No. of cycles for each operation (CPI) in multicycle implementation: loads = 5, stores = 4, R-format = 4, branches = 3, jumps = 3 cycles. CPI = 0.22*5 + 0.11*4 + 0.49*4 + 0.16*3 + 0.02*3 = 4.04 cycles / instr Average time for instr execution = 4.04 * 2 ns = 8.08 ns. cs 152 L1 3 .* DAP Fa97, Ó U.CB Performance Comparisson Continuation Pipelined: assume 1/2 of loads followed by instr that uses load result; branch delay on misprediction 1 cycle and 1/4 branches are mispredicted; jumps always delay 1 cycle. loads: 1 cycle when no dependency and 2 when there is dependency, average cycle for loads = 1.5 cycles stores and R-format: 1 cycle branches: 1 cycle when prediction is correct and 2 cycles when wrong, average cycle for branches = 1 + 1*0.25 = 1.25 cycles jumps: 1 cycle + 1 cycle delay = 2 cycles CPI = 1.5 *0.22 + 1*0.11 + 1*0.49 + 1.25*0.16 + 2*0.02 = 1.17 cycles average instr execution time 1.17*2 ns = 2.34 ns Pipelined control time is: 8 ns / 2.34 ns = 3.4 times faster than single-cycle control, 8.08 ns / 2.34 ns = 3.5 times faster than multicycle control (book does 4.04 ns / 2.34 ns = 1.73 times faster - wrong...see example pg 397 chapter 5). cs 152 L1 3 .* DAP Fa97, Ó U.CB 6.7 Exceptions Exceptions and interrupts are events other than branches or jumps that change the normal flow of instr execution. Exception: an unexpected event from within the processor. Ex: arithmetic overflow Interrupts: a change of control flow from outside the processor. Ex: Interrupts caused by I/O devices to communicate with the processor. cs 152 L1 3 .* DAP Fa97, Ó U.CB What about Interrupts, Traps, Faults? External Interrupts: Allow pipeline to drain, Load PC with interupt address Faults (within instruction, restartable) Force trap instruction into IF (transfer control to the exception routine) disable writes till trap hits WB must save multiple PCs or PC + state Refer to MIPS solution cs 152 L1 3 .* DAP Fa97, Ó U.CB Example: Arithmetic Overflow Exception What happens if an over flow exception occurs in the add instr of the following instr sequence? Instr addresses are given in hexadecimal. 40 sub $11, $2, $4 44 and $12, $2, $5 48 or $13, $2, $6 4C add $1, $2, $1 50 slt $15, $6, $7 54 lw $16, 50($7) assume the instrs to be invoked on an exception begin: 4000040 sw $25, 1000($0) 4000044 sw $26, 1004($0) cs 152 L1 3 .* DAP Fa97, Ó U.CB Bubbles are Inserted cs 152 L1 3 .* DAP Fa97, Ó U.CB How to Flush Unwanted Instructions Flush instrs in IF by turning it into nop (IF.Flush control signal). Use multiplexors already in ID to zero control signals of instr in ID stage. A new ID.Flush control signal is ORed with the stall signal from the Hazard Detection Unit to flush during ID. Use a new signal EX.Flush is used to cause new multiplexors to zero the control lines in EX. To start fetching instrs at location 40000040, add an additional input to the PC multiplexor that sends 40000040 to the PC. If the instr is not stopped in the middle of its execution, the programmer will see a wrong value of $1 (not the original value that caused the overflow). The exception must be detected and flush signals set during the EX of add instr. EX.Flush prevents this instr from writing in the WB stage. cs 152 L1 3 .* DAP Fa97, Ó U.CB Datapath with Controls to Handle Exceptions Cause regs records cause of exception; Exception PC saves address of instr that caused the exception (actually saves address + 4). In example EPC = 4C + 4 = 50 (in hexadecimal). cs 152 L1 3 .* DAP Fa97, Ó U.CB Other Causes of Exceptions I/O device request Invoking Operating System service from user program Using an undefined instr HW malfunction How to associate an exception with the corresponding instr when there are 5 instrs active in a clok cycle? Multiple exceptions can occur simultaneously in a single clock cycle. cs 152 L1 3 .* DAP Fa97, Ó U.CB Exception Problem Exceptions/Interrupts: 5 instructions executing in 5 stage pipeline How to stop the pipeline? Restart? Who caused the interrupt? Stage Problem interrupts occurring IF Page fault on instruction fetch; misaligned memory access; memory-protection violation ID Undefined or illegal opcode EX Arithmetic exception MEM Page fault on data fetch; misaligned memory access; memory-protection violation; memory error Load with data page fault, Add with instruction page fault? Solution 1: interrupt vector/instruction 2: interrupt ASAP, restart everything incomplete cs 152 L1 3 .* DAP Fa97, Ó U.CB Exception Handling npc I mem Regs B alu S D mem m IAU PC lw $2,20($5) Regs A im op rw n detect bad instruction address detect bad instruction detect overflow detect bad data address Allow exception to take effect cs 152 L1 3 .* DAP Fa97, Ó U.CB Solutions to Exception Problem I/O device requests and HW malfunction: not associated with a specific instr. Implementation has some flexibility as to when to interrupt the pipeline. For I/O request, stop at simplest instr. For HW malfunction, it is best to stop as soon as possible because HW is unstable. Prioritize exceptions so it is easy to determine which is serviced first. In most MIPs implementations the HW sorts instrs so that earliest instr is interrupted. EPC has address of interrupted instrs. Cause regs records all possible exceptions in a clock cycle. cs 152 L1 3 .* DAP Fa97, Ó U.CB Pipeline Changes Order of Exception Ocurrence instr__1__ _2___3___4___5___6___7___8___9___10 i-3 IF ID EX MEM WB i-2 IF ID EX MEM WB i-1 IF ID EX MEM WB i LW IF ID EX MEM WB i+1 ADD IF ID EX MEM WB i+2 IF ID EX MEM WB cs 152 L1 3 .* DAP Fa97, Ó U.CB HW SW Interface for Exception Handling By knowing in which stage an exception can occur, the exception SW can match the exception to the instr. Exceptions are collected in the cause reg so that the HW can interrupt based on later exceptions, once the earliest one has been serviced. Precise Interrupts: The pipeline is stopped so that the instrs prior to the falty instr j complete and the following are flushed and later restarted. Implementation: HW associates an interrupt vector with each instr and disables the effects of the of the faulty instr j when it is detected. The instr follows down the pipe and the exception is treated at the end of the MEM stage of j. This guarantees that the exceptions of instr i are treated before those of instr i+1. cs 152 L1 3 .* DAP Fa97, Ó U.CB HW SW Interface (Cont...) The HW saves the address in EPC, and the cause of the falty instr in the cause reg and then jumps to a prearranged address of the exception handling routine. The OS looks at the cause and acts appropriately. For undefined instr, HW malfunction, arithmetic overflow, the OS normally kills the program and returns an indicator of the reason. For I/O device request or an OS service call, the OS saves the state of the program, performs the desired task, and then restores the program to continue execution. MIPS and the most machines today support precise interrupts or precise exceptions because they simplify the interface with the OS (e.g. to support virtual memory). cs 152 L1 3 .* DAP Fa97, Ó U.CB Imprecise Interrupts or Imprecise Exceptions Imprecise Interrupts: Treat exception when it occurs. instr__1__ _2___3___4___5___6___7___8___9___10 i-3 IF ID EX MEM WB i-2 IF ID EX MEM WB i-1 IF ID EX MEM WB i LW IF ID EX MEM WB i+1 ADD IF ID EX MEM WB i+2 IF ID EX MEM WB Treats ADD exception at the end of 5th cycle. Stop instrs i-2, i-1, i, and i+1. Reinitiate pipe at intrs i-2 after exception handling. It will then detect the pg fault of instr i, stopping instrs i...i+3. cs 152 L1 3 .* DAP Fa97, Ó U.CB Why use Imprecise Exceptions? The difficulty of making the HW associate the correct exception with the correct instr in pipelined computers sometimes leads designers to relax precise exception treatment. Let the OS system determine: which instr caused the exception (from the type of exception and the stage in which it occured) which instrs to flush from which instr to restart cs 152 L1 3 .* DAP Fa97, Ó U.CB Problems for Chapter 6 6.1, 6.2, 6.3, 6.4, 6.5, 6.7, 6.8, 6.9, 6.11, 6.12, 6.13, 6.14, 6.15, 6.19, 6.23, Study analytic performance examples done in class and in the book. 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 9 9 10 10 10 9 9 11 11 11 10 10 12 12 12 11 11 13 13 13 12 14 14 14 13 15 15 15 14 16 16 16 15 17 17 17 16 18 18 18 17 19 19 19 18 20 20 20 21 21 21 22 22 22 23 23 23 24 24 24 25 25 25 25 26 26 27 27 27 28 28 29 29 30 31 32 33
Compartilhar