Baixe o app para aproveitar ainda mais
Esta é uma pré-visualização de arquivo. Entre para ver o arquivo original
cs 152 L1 3 .* DAP Fa97, Ó U.CB Recap: Sequential Laundry Sequential laundry takes 8 hours for 4 loads If they learned pipelining, how long would laundry take? 30 T a s k O r d e r Time 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 6 PM 7 8 9 10 11 12 1 2 AM cs 152 L1 3 .* DAP Fa97, Ó U.CB Recap: Pipelining Lessons (it’s intuitive!) Pipelining doesn’t help latency of single task, it helps throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Pipeline rate limited by slowest pipeline stage Stages should take about same amount of time Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Stall for Dependences 6 PM 7 8 9 Time T a s k O r d e r cs 152 L1 3 .* DAP Fa97, Ó U.CB Recap: Ideal Pipelining Maximum Speedup £ Number of stages Speedup £ Time for unpipelined operation Time for longest stage IF DCD EX MEM WB IF DCD EX MEM WB IF DCD EX MEM WB IF DCD EX MEM WB IF DCD EX MEM WB Example: 40ns data path, 5 stages, Longest stage is 10 ns, Speedup £ 4 Assume instructions are completely independent! cs 152 L1 3 .* DAP Fa97, Ó U.CB Recap: Graphically Representing Pipelines Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths cs 152 L1 3 .* DAP Fa97, Ó U.CB Recap: Can pipelining get us into trouble? Yes: Pipeline Hazards structural hazards: attempt to use the same resource two different ways at the same time e.g., multiple memory accesses, multiple register writes solutions: multiple memories, stretch pipeline control hazards: attempt to make a decision before condition is evaulated e.g., any conditional branch solutions: prediction, delayed branch data hazards: attempt to use item before it is ready e.g., add r1,r2,r3; sub r4, r1 ,r5; lw r6, 0(r7); or r8, r6 ,r9 solutions: forwarding/bypassing, stall/bubble cs 152 L1 3 .* DAP Fa97, Ó U.CB Recap: Pipelined Datapath with Data Stationary Control npc I mem Regs B alu S D mem m IAU PC lw $2,20($5) Regs Operand Register Selects ALU Op MEM Op Result Reg Select and Enable Just like Time-State! A im op rw n <= PC + 4 + immed cs 152 L1 3 .* DAP Fa97, Ó U.CB Recap Pipelining is a fundamental concept multiple steps using distinct resources Utilize capabilities of the Datapath by pipelined instruction processing start next instruction while working on the current one limited by length of longest stage (plus fill/flush) detect and resolve hazards What makes it easy all instructions are the same length just a few instruction formats memory operands appear only in loads and stores Hazards make it hard We’ll build a simple pipeline and look at these issues cs 152 L1 3 .* DAP Fa97, Ó U.CB The Big Picture: Where are We Now? The Five Classic Components of a Computer This Section´s Topics: Recap last lecture Pipelined Control/ Do it yourself Pipelined Control Hazards/Forwarding Exceptions Review MIPS R3000 pipeline Advanced Pipelining? cs 152 L1 3 .* DAP Fa97, Ó U.CB Recap: Control Diagram IR <- Mem[PC]; PC < PC+4; A <- R[rs]; B<– R[rt] S <– A + B; R[rd] <– S; S <– A + SX; M <– Mem[S] R[rd] <– M; S <– A or ZX; R[rt] <– S; S <– A + SX; Mem[S] <- B If Cond PC < PC+SX; Exec Reg. File Mem Access Data Mem Reg File Equal PC Next PC IR Inst. Mem M <– S M <– S cs 152 L1 3 .* DAP Fa97, Ó U.CB 6.3 (Parts of 6.6) Pipelined Control cs 152 L1 3 .10 DAP Fa97, Ó U.CB To specify control for the pipeline, we need only set the control values during each pipeline stage. Because each control line is associated with a component active in only a single pipeline stage, we divide the control lines into five groups. The PC and Pipeline registers are written every clock cycle, so there is no seperate write signal for them. Since the control lines start with the EX stage, the control information is created during instruction decode. The signals are then used in the appropriate pipeline stage as the instr moves down the pipe. The signals for the stage are stored in the pipeline regs. If control bit into 2 way multiplexor = 1 then multiplexor selects the input corresponding to 1, if control bit = 0 then multiplexor selects the input corresponding to 0. cs 152 L1 3 .* DAP Fa97, Ó U.CB Specification of 9 Control Signals (Figs. 6.26 and 6.27) Instruction Fetch: Control signals to read Inst Mem and write PC are always asserted (nothing special to control). ID / Reg file read: same as previous stage (no optional control lines to set). EX / Address Calculation: signals to be set are RegDest (selects result reg), ALUOp (2 control bits assert ALU operation, a 6-bit function code field specified in the immediate field of the instr saved in the ID/EX pipe reg, see fig 6.26) and ALUSrc (Read data reg 2 or a sign-extended immediate for the ALU). Mem Access: control lines to be set are Branch (set by branch instr), MemRead (load instr)and MemWrite (store instr). PCSrc selects the next sequential address unless control asserts Branch and the ALU result was zero. Write back: control lines are MemtoReg (decides between sending the ALU result or the memory value to the reg file) and RegWrite (writes the chosen value). cs 152 L1 3 .* DAP Fa97, Ó U.CB But recall use of “Data Stationary Control” The Main Control generates the control signals during Reg/Dec Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later Control signals for Mem (MemWr Branch) are used 2 cycles later Control signals for Wr (MemtoReg MemWr) are used 3 cycles later IF/ID Register ID/Ex Register Ex/Mem Register Mem/Wr Register Reg/Dec Exec Mem ExtOp ALUOp RegDst ALUSrc Branch MemWr MemtoReg RegWr Main Control ExtOp ALUOp RegDst ALUSrc MemtoReg RegWr MemtoReg RegWr MemtoReg RegWr Branch MemWr Branch MemWr Wr cs 152 L1 3 .* DAP Fa97, Ó U.CB Datapath + Data Stationary Control Exec Reg. File Mem Access Data Mem Reg File Next PC IR Inst. Mem Decode Mem Ctrl WB Ctrl rs rt op rs rt fun im ex me wb rw v me wb rw v wb rw v cs 152 L1 3 .* DAP Fa97, Ó U.CB Optimized Data Path cs 152 L1 3 .14 DAP Fa97, Ó U.CB Suppose extra HW (branch address adder moved from EX to IF stage and comparisson logic) is added in the ID / RegFetch stage so that branches are resolved in the second cycle of branch execution. Next PC is known in second cycle of branch ---> saves 1 clock cycle of penalty. cs 152 L1 3 .* DAP Fa97, Ó U.CB Let’s Try it Out 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 these addresses are octal cs 152 L1 3 .* DAP Fa97, Ó U.CB Start: Fetch 10 Exec Reg. File Mem Access Data Mem Reg File IR Inst. Mem Decode Mem Ctrl WB Ctrl rs rt im 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 cs 152 L1 3 .* DAP Fa97, Ó U.CB Fetch 14, Decode 10 Exec Reg. File Mem Access Data Mem Reg File IR Inst. Mem Decode Mem Ctrl WB Ctrl 2 rt im 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 lw r1, r2(35) cs 152 L1 3 .* DAP Fa97, Ó U.CB Fetch 20, Decode 14, Exec 10 Exec Reg. File Mem Access Data Mem r2 Reg File IR Inst. Mem Decode Mem Ctrl WB Ctrl 2 rt 35 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 lw r1 addI r2, r2, 3 EX cs 152 L1 3 .* DAP Fa97, Ó U.CB Fetch 24, Decode 20, Exec 14, Mem 10 Exec Reg. File Mem Access Data Mem r2 r2+35 Reg File IR Inst. Mem Decode Mem Ctrl WB Ctrl 4 5 3 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 lw r1 sub r3, r4, r5 addI r2, r2, 3 M cs 152 L1 3 .* DAP Fa97, Ó U.CB Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10 Exec Reg. File Mem Access Data Mem r4 r2+3 Reg File IR Inst. Mem Decode Mem Ctrl WB Ctrl M[r2+35] 6 7 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 lw r1 beq r6, r7 100 addI r2 sub r3 Note Delayed Branch: always execute ori after beq cs 152 L1 3 .* DAP Fa97, Ó U.CB Fetch 100, Dcd 30, Ex 24, Mem 20, WB 14 Exec Reg. File Mem Access Data Mem r6 r2+3 Reg File IR Inst. Mem Decode Mem Ctrl WB Ctrl r1=M[r2+35] 9 xx 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 beq addI r2 sub r3 r4-r5 100 ori r8, r9 17 M WB cs 152 L1 3 .* DAP Fa97, Ó U.CB Fetch 104, Dcd 100, Ex 30, Mem 24, WB 20 Exec Reg. File Mem Access Data Mem Reg File IR Inst. Mem Decode Mem Ctrl WB Ctrl 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Fill it in yourself! ? cs 152 L1 3 .* DAP Fa97, Ó U.CB Fetch 110, Dcd 104, Ex 100, Mem 30, WB 24 Exec Reg. File Mem Access Data Mem Reg File IR Inst. Mem Decode Mem Ctrl WB Ctrl 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 M Fill it in yourself! ? ? ? ? cs 152 L1 3 .* DAP Fa97, Ó U.CB Fetch 114, Dcd 110, Ex 104, Mem 100, WB 30 Exec Reg. File Mem Access Data Mem Reg File IR Inst. Mem Decode Mem Ctrl WB Ctrl 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 M Fill it in yourself! ? ? ? ? ? ? cs 152 L1 3 .* DAP Fa97, Ó U.CB Pipeline Hazards Again I-Fet ch DCD MemOpFetch OpFetch Exec Store IFetch DCD ° ° ° Structural Hazard I-Fet ch DCD OpFetch Jump IFetch DCD ° ° ° Control Hazard IF DCD EX Mem WB IF DCD OF Ex Mem RAW (read after write) Data Hazard WAW Data Hazard (write after write) IF DCD OF Ex WB WAR Data Hazard (write after read) IF DCD EX Mem WB IF DCD EX Mem WB cs 152 L1 3 .* DAP Fa97, Ó U.CB 6.4 Data Hazards Avoid some “by design” eliminate WAR by always fetching operands early (DCD) in pipe eliminate WAW by doing all WBs in order (last stage, static) Detect and resolve remaining ones stall or forward (if possible) Introduce a Forwarding Unit in the datapath cs 152 L1 3 .* DAP Fa97, Ó U.CB Compiler Scheduling Solution * sub $2, $1, $3 nop nop and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) This code works properly but the 2 nops occupy 2 clock cycles that do no useful work. These dependencies occur too often to rely on compilers. cs 152 L1 3 .* DAP Fa97, Ó U.CB Hazard Detection Suppose instruction i is about to be issued and a predecessor instruction j is in the instruction pipeline. A RAW hazard exists on register r if r Î Rregs( i ) Ç Wregs( j ) Keep a record of pending writes (for inst's in the pipe) and compare with operand regs of current instruction. When instruction issues, reserve its result register. When on operation completes, remove its write reservation. A WAW hazard exists on register r if r Î Wregs( i ) Ç Wregs( j ) A WAR hazard exists on register r if r Î Wregs( i ) Ç Rregs( j ) cs 152 L1 3 .* DAP Fa97, Ó U.CB Forwarding * cs 152 L1 3 .29 DAP Fa97, Ó U.CB If we can take the inputs to the ALU from any pipeline register rather than just the ID/EX, then we can forward the proper data. By adding multiplexors to the input of the ALU and with the proper controls, we can run the pipeline at full speed in the presence of some data dependencies. Notation: rs: 1st regs source operand rt: 2nd regs source operand rw (rd in text book): regs destination operand cs 152 L1 3 .* DAP Fa97, Ó U.CB Record of Pending Writes Current operand registers Pending writes hazard <= ((rs == rwex) & regWex) OR ((rs == rwmem) & regWme) OR ((rs == rwwb) & regWwb) OR ((rt == rwex) & regWex) OR ((rt == rwmem) & regWme) OR ((rt == rwwb) & regWwb) npc I mem Regs B alu S D mem m IAU PC Regs A im op rw n op rw n op rw n op rw rs rt cs 152 L1 3 .* DAP Fa97, Ó U.CB Resolve RAW by forwarding Detect nearest valid write op operand register and forward into op latches, bypassing remainder of the pipe Increase muxes to add paths from pipeline registers Data Forwarding = Data Bypassing npc I mem Regs B alu S D mem m IAU PC Regs A im op rw n op rw n op rw n op rw rs rt Forward mux cs 152 L1 3 .* DAP Fa97, Ó U.CB 6.5 What about memory operations? ° If instructions are initiated in order and operations always occur in the same stage, there can be no hazards between memory operations! ° What does delaying WB on arithmetic operations cost? – cycles ? – hardware ? ° What about data dependence on loads? R1 <- R4 + R5 R2 <- Mem[ R2 + I ] R3 <- R2 + R1 => Instr is in the ID stage means: the instruction’s control signals are in the IF/ID register. A B op Rd Ra Rb op Rd Ra Rb Rd to reg file R T Rd "Delayed Loads" cs 152 L1 3 .* DAP Fa97, Ó U.CB Load Data Hazard cs 152 L1 3 .33 DAP Fa97, Ó U.CB cs 152 L1 3 .* DAP Fa97, Ó U.CB No Solution by Forwarding cs 152 L1 3 .34 DAP Fa97, Ó U.CB cs 152 L1 3 .34 DAP Fa97, Ó U.CB Since the dependence between the load and the following instr goes backwards in time, this hazard cannot be solved by forwarding. It is necessary to stall the pipeline. Include a Hazard Detection Unit which operates during the ID stage Condition for this load hazard detection: if (ID/EX.MemRead and (( ID/EX RegsRt = IF/ID.RegsRs ) or ID/EX RegsRt = IF/ID.RegsRt ))) stall the pipe one clock cycle If instr is load and destination regs field in EX stage matches either source regs of the instr in the ID stage, then stall. After the stall the forwarding unit handles the dependency to avoid loosing one more cycle. cs 152 L1 3 .* DAP Fa97, Ó U.CB Bubbles Caused by Stalls cs 152 L1 3 .35 DAP Fa97, Ó U.CB cs 152 L1 3 .* DAP Fa97, Ó U.CB Stalling cs 152 L1 3 .36 DAP Fa97, Ó U.CB If the instruction in the ID stage is stalled, then the instr in the IF stage must also be stalled (so we don’t loose the fetched instr). Stalling mechanism: prevent the PC and the IF/ID pipe regs from changing. The instr in the IF stage will continue to be read using the same PC, and the regs in the ID stage will continue to be read using the same instr fields in the IF/ID pipe regs. Like restarting the washing machine with the same clothes while letting the dryer continue working empty. It is also necessary to set the EX, MEM and WB control signals of the ID/ EX pipe regs to 0 (the hazard has already been detected in the ID stage). These values are propagated forward in the pipe and no regs or memory locations are changed. Actually only the signals RegWrite and MemWrite need to be 0. In the example in previous figure, the HW repeats in cycle 4 what it did for instrs and and or in clock 3. The execution time for these instrs is stretched and the fetch of the add instr is delayed. The bubble delays everything behind it and proceeds down the pipe. cs 152 L1 3 .* DAP Fa97, Ó U.CB Hazard Detection and Forwarding Units cs 152 L1 3 .37 DAP Fa97, Ó U.CB The Forwarding Unit controls the ALU multiplexors to replace the value from a general purpose regs with the value from the proper pipe regs. The Hazard Detection Unit controls the writing of the PC and IF/ID regs plus the multiplexor that chooses between the real control values and all 0s. It stalls and deasserts the control fields if the load-use hazard test is true. cs 152 L1 3 .* DAP Fa97, Ó U.CB Cycles 2 and 3 (Single-Cycle Diagrams) cs 152 L1 3 .36 DAP Fa97, Ó U.CB cs 152 L1 3 .* DAP Fa97, Ó U.CB Cycles 4 and 5 cs 152 L1 3 .39 DAP Fa97, Ó U.CB cs 152 L1 3 .* DAP Fa97, Ó U.CB Cycles 6 and 7 cs 152 L1 3 . DAP Fa97, Ó U.CB 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 Data stationary control -------> finite state diagram 7 7 7 7 7 7 8 8 8 8 8 8 So where are in in the overall scheme of things. Well, we just finished designing the processor’s datapath. Now I am going to show you how to design the control for the datapath. +1 = 7 min. (X:47) 9 9 9 9 9 9 10 10 10 10 10 10 11 11 11 11 11 11 12 12 12 12 12 12 The main control here is identical to the one in the single cycle processor. It generate all the control signals necessary for a given instruction during that instruction’s Reg/Decode stage. All these control signals will be saved in the ID/Exec pipeline register at the end of the Reg/Decode cycle. The control signals for the Exec stage (ALUSrc, ... etc.) come from the output of the ID/Exec register. That is they are delayed ONE cycle from the cycle they are generated. The rest of the control signals that are not used during the Exec stage is passed down the pipeline and saved in the Exec/Mem register. The control signals for the Mem stage (MemWr, Branch) come from the output of the Exec/Mem register. That is they are delayed two cycles from the cycle they are generated. Finally, the control signals for the Wr stage (MemtoReg & RegWr) come from the output of the Exec/Wr register: they are delayed three cycles from the cycle they are generated. +2 = 45 min. (Y:45) 13 13 13 13 13 13 14 14 14 14 14 14 15 15 15 15 15 15 16 16 16 16 16 16 17 17 17 17 17 17 18 18 18 18 18 18 19 19 19 19 19 19 20 20 20 20 20 20 21 21 21 21 21 21 22 22 22 22 22 22 23 23 23 23 23 23 24 24 24 24 24 24 25 25 25 25 25 25 26 26 26 26 26 26 27 27 27 27 27 28 28 28 28 28 28 29 29 29 29 29 30 30 30 30 30 30 31 31 31 31 31 31 32 32 32 32 32 32 33 33 33 33 34 34 34 34 35 35 35 35 36 36 36 36 37 37 37 37 38 38 38 38 39 39 39 39 40 40 40 40
Compartilhar