Baixe o app para aproveitar ainda mais
Esta é uma pré-visualização de arquivo. Entre para ver o arquivo original
Curso de Tecnologia de Computadores Professora: Roseli S. Wedemann roseli@ime.uerj.br salas D6019 Text: Computer Organization & Design, the Hardware Software Interface, D. A. Patterson and J. L. Hennessy, 4th edition, Morgan Kaufman Publishers Dave Patterson (http.cs.berkeley.edu/~patterson) lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/ cs 152 L1 3 .1 DAP Fa97, Ó U.CB Sobre o Curso 2 cs 152 L1 3 .2 DAP Fa97, Ó U.CB Critério de Aprovação: duas provas + prova final para passar sem prova final, média >= 7 para passar com prova final, média >= 5 direito à 2a. chamada para ir para a prova final, média >= 4 Nota = [ (P1 + P2) / 2 + PF ] / 2 Conteúdo Capítulo 6: Enhancing Performance with Pipelining Capítulo 9: Multiprocessors Recap: Microprogramming Specialize state-diagrams easily captured by microsequencer simple increment & “branch” fields datapath control fields Control design reduces to Microprogramming Microprogramming is a fundamental concept implement an instruction set by building a very simple processor and interpreting the instructions essential for very complex instructions and when few register transfers are possible overkill when ISA matches datapath 1-1 3 cs 152 L1 3 .3 DAP Fa97, Ó U.CB Summary: Microprogramming one inspiration for RISC 4 cs 152 L1 3 .4 DAP Fa97, Ó U.CB If simple instruction could execute at very high clock rate… If you could even write compilers to produce microinstructions… If most programs use simple instructions and addressing modes… If microcode is kept in RAM instead of ROM so as to fix bugs … If same memory used for control memory could be used instead as cache for “macroinstructions”… Then why not skip instruction interpretation by a microprogram and simply compile directly into lowest language of machine? (microprogramming is overkill when ISA matches datapath 1-1) Recap: Exceptions and Interrupts 5 cs 152 L1 3 .5 DAP Fa97, Ó U.CB Exceptions are the hard part of control Need to find convenient place to detect exceptions and to branch to state or microinstruction that saves PC and invokes the operating system As we get pipelined, CPUs that support page faults on memory accesses, which means that the instruction cannot complete AND you must be able to restart the program at exactly the instruction with the exception, it gets even harder Definitions of control variables 6 cs 152 L1 3 .6 DAP Fa97, Ó U.CB EPC: Exception Program Counter (stores return address) exp_addr: address of exception handler routine cause: reg used for storing the cause of exception rd: reg destination operand in instruction rt: second reg source operand in instruction SX: constant specified in instruction ZX: zero extended immediate Ori: S <= reg or ZX, S is a reg (see pg A-57) (see chapter 3 and section 5.6) Modification to the Control Specification 7 IR <= MEM[PC] PC <= PC + 4 R-type S <= A fun B R[rd] <= S S <= A op ZX R[rt] <= S ORi S <= A + SX R[rt] <= M M <= MEM[S] LW S <= A + SX MEM[S] <= B SW “instruction fetch” “decode” Execute Memory Write-back 0000 0001 0100 0101 0110 0111 1000 1001 1010 1011 1100 BEQ 0010 If A = B then PC <= S S<= PC +SX EPC <= PC - 4 PC <= exp_addr cause <= 10 (RI) other undefined instruction S <= A - B overflow EPC <= PC - 4 PC <= exp_addr cause <= 12 (Ovf) The Big Picture: Where are We Now? The Five Classic Components of a Computer Today’s Topics: Pipelining by Analogy 8 Control Datapath Memory Processor Input Output 6.1 An overview of Pipelining. It’s Natural! cs 152 L1 3 .9 DAP Fa97, Ó U.CB 9 Example: Stages of Laundry Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 30 minutes “Folder” takes 30 minutes “Storer” takes 30 minutes to put clothes into drawers A B C D Sequential Laundry cs 152 L1 3 .10 DAP Fa97, Ó U.CB 10 Sequential laundry takes 8 hours for 4 loads If they learned pipelining, how long would laundry take? 30 T a s k O r d e r Time 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 6 PM 7 8 9 10 11 12 1 2 AM Pipelined Laundry: Start work ASAP cs 152 L1 3 .11 DAP Fa97, Ó U.CB 11 Pipelined laundry takes 3.5 hours for 4 loads! T a s k O r d e r 12 2 AM 6 PM 7 8 9 10 11 1 Time Pipelining Lessons cs 152 L1 3 .12 DAP Fa97, Ó U.CB 12 Pipelining doesn’t help latency of single task, it helps throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Pipeline rate limited by slowest pipeline stage Stages should take about same amount of time Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Stall for Dependences 6 PM 7 8 9 Time T a s k O r d e r The Five Stages of MIPS Instructions, ex. Load 13 cs 152 L1 3 .13 DAP Fa97, Ó U.CB Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Read the data from the Data Memory Wr: Write the data back to the register file base addressing Ex: lw $1, 100 ($0) $1 = Memory[$0 + 100] Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load We will limit our attention to 8 instrs 14 cs 152 L1 3 .14 DAP Fa97, Ó U.CB lw: load word (8ns) instruction fetch: 2ns sw: store word (7ns) register read or write: 1ns add: add (6ns) ALU operation: 2ns sub: subtract (6ns) memory acces: 2ns and: and (6ns) or: or (6ns) slt: set-less-than (compares 2 regs and sets a third to 1, if the first is less than the second, pg. 128) (6ns) beq: branch-if-equal (branch to label if 2 regs are equal, pg 123) (5ns) Ex: beq $3, $4, L if $3 = $4 go to L see Chapter 3 for descriptions of instructions Pipelining 15 cs 152 L1 3 .15 DAP Fa97, Ó U.CB Improve perfomance by increasing instruction throughput figure 6.3 Ideal speedup is number of stages in the pipeline. Do we achieve this? Performance Metrics 16 cs 152 L1 3 .16 DAP Fa97, Ó U.CB In the single-cycle design model (every instruction takes 1 clock cycle), the clock cycle must be stretched to accommodate the slowest instruction (Load is slowest and takes 8ns in MIPS). In the pipelined design, the pipeline stages take a single clock cycle. The clock cycle must be long enough to accommodate the slowest operation (2ns). Assume write to reg file occurs in first half of clock cycle and read from reg file occurs in second half of cycle. Speedup = execution time without pipeline / execution time with pipeline = ideally to number of stages. (definition in pg. 101) Performance Improvement 17 cs 152 L1 3 .17 DAP Fa97, Ó U.CB pipe with 5 balanced stages: ideal speedup = 5, 8ns / 5 = 1.6ns clock cycle stages may be imperfectly balanced and pipeline involves some overhead, so time per instruction (CPI) will exceed minimum possible and speedup will be less than number of stages. In ex., for 3 instrs: speedup = 24ns / 14ns = 1.7 If we add 1000 lw instrs (1 instr per 2ns): speedup = 8024ns / 2014ns = 3.98 (almost 4) Pipelining increases instruction throughput, as opposed to decreasing the execution time of an individual instruction. Instruction throughput is the important metric! Basic Idea 18 cs 152 L1 3 .18 DAP Fa97, Ó U.CB fig 6.10 (pg. 450) What do we need to add to actually split the datapath into stages? Instruction sets for piplines 19 cs 152 L1 3 .19 DAP Fa97, Ó U.CB 1. MIPS instrs are all the same length. Easier to fetch instr in 1st stage and decode in 2nd stage. 2. MIPS has only a few instr formats, with the source reg fields in the same place. 2nd stage can start reading the reg file while HW is determining what type of instr was fetched. Reg read + decode in one stage. 3. Memory operands only appear in loads and stores in MIPS. Use the execute stage to calculate mem address and access mem in the next stage. Otherwise stages 3 and 4 would break into 3 stages: address, access, execute. 4. Operands must be aligned in memory (chapt 3, pg. 112). 1 data transfer takes 1 data mem access (1 pipeline stage). Graphically Representing Pipelines 20 cs 152 L1 3 .20 DAP Fa97, Ó U.CB Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths Conventional Pipelined Execution Representation 21 Program Flow Time cs 152 L1 3 .21 DAP Fa97, Ó U.CB Single Cycle, Multiple Cycle, vs. Pipeline 22 cs 152 L1 3 .22 DAP Fa97, Ó U.CB Clk Cycle 1 Multiple Cycle Implementation: Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Load Load Store Pipeline Implementation: Store Clk Single Cycle Implementation: Load Store Waste R-type R-type Cycle 1 Cycle 2 Why Pipeline? 23 cs 152 L1 3 .23 DAP Fa97, Ó U.CB Suppose we execute 100 instructions Single Cycle Machine 45 ns/cycle x 1 CPI x 100 inst = 4500 ns Multicycle Machine 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns Ideal pipelined machine 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns speedup = 4500ns / 1040ns = 4.33 Why Pipeline? Because the resources are there! 24 cs 152 L1 3 .24 DAP Fa97, Ó U.CB I n s t r. O r d e r Time (clock cycles) Inst 0 Inst 1 Inst 2 Inst 4 Inst 3 Can pipelining get us into trouble? 25 cs 152 L1 3 .25 DAP Fa97, Ó U.CB Yes: Pipeline Hazards (next instr can’t execute in next cycle) structural hazards: attempt to use the same resource two different ways at the same time E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV) data hazards: attempt to use item before it is ready E.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer instruction depends on result of prior instruction still in the pipeline control hazards: attempt to make a decision before condition is evaluated E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in branch instructions Can always resolve hazards by waiting pipeline control must detect the hazard take action (or delay action) to resolve hazards Single Memory is a Structural Hazard 26 cs 152 L1 3 .26 DAP Fa97, Ó U.CB Mem I n s t r. O r d e r Time (clock cycles) Load Instr 1 Instr 2 Instr 3 Instr 4 Reg Mem Reg Reg Mem Reg Detection is easy in this case! (right half highlight means read, left half write) Structural Hazards limit performance 27 cs 152 L1 3 .27 DAP Fa97, Ó U.CB Example: if 1.3 memory accesses per instruction and only one memory access per cycle, then 30% of instrs are loads average CPI = 0.7*1 + 0.3*2 = 1.3 cycles / instr (30% of the time stall 1 cycle) otherwise resource is more than 100% utilized Control Hazard Solutions 28 cs 152 L1 3 .28 DAP Fa97, Ó U.CB Stall (Bubble): wait until decision is clear It’s possible to move up decision to 2nd stage by adding hardware to check registers as being read, calculate branch address and update PC Impact: 2 clock cycles per branch instruction => slow I n s t r. O r d e r Time (clock cycles) Add Beq Load Reg Mem Reg Reg Mem Reg Reg Mem Reg Control Hazards Limit Performance 29 cs 152 L1 3 .29 DAP Fa97, Ó U.CB Estimate average CPI for stalling on branches. Suppose 17% of the instrs executed are branches (gcc execution benchmark). average CPI = 0.83*1 + 0.17*2 = 1.17 cycles / instr Control Hazard Solutions 30 cs 152 L1 3 .30 DAP Fa97, Ó U.CB Predict: guess one direction then back up if wrong Predict branch not taken is a common approach Impact: 1 clock cycle per branch instruction if right, 2 if wrong (right 50% of time) I n s t r. O r d e r Time (clock cycles) Add Beq Load Reg Mem Reg Reg Mem Reg Mem Reg Mem Reg Dynamic Scheme 31 cs 152 L1 3 .31 DAP Fa97, Ó U.CB Dynamic HW predictors make guesses depending on the behavior of each branch and may change predictions for a branch over the life of a program. Ex: Keep history of each branch as taken or not taken and use the past to predict the future: Such HW has 90% accuracy. When the guess is wrong, the pipeline control must ensure that the instrs following the wrongly guessed branch have no effect, and must restart the pipe from the proper branch address. Longer pipes raise cost of mispredictions as is usual with all solutions to control hazards. Control Hazard Solutions 32 cs 152 L1 3 .32 DAP Fa97, Ó U.CB Redefine branch behavior (takes place after next instruction) “delayed branch”. Place a “safe”instr after the branch instr. The compiler rearranges instrs when possible (doesn’t change program results). Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time) Longer pipes, more slots, harder to fill. I n s t r. O r d e r Time (clock cycles) Add Beq Misc Reg Mem Reg Reg Mem Reg Mem Reg Mem Reg Load Mem Reg Mem Reg Data Hazard on r1 33 cs 152 L1 3 .33 DAP Fa97, Ó U.CB An instr depends on the results of a previous instr still in the pipe. Too many of these dependencies to be solved by a compiler successfuly. add r1 ,r2,r3 sub r4, r1 ,r3 and r6, r1 ,r7 or r8, r1 ,r9 xor r10, r1 ,r11 Data Hazard on r1: 34 cs 152 L1 3 .34 DAP Fa97, Ó U.CB Dependencies backwards in time are hazards I n s t r. O r d e r Time (clock cycles) add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 IF ID/RF EX MEM WB ALU Im Reg Dm Reg Reg Dm Reg Reg Dm Reg Im ALU Reg Dm Reg Data Hazard Solution: 35 cs 152 L1 3 .35 DAP Fa97, Ó U.CB “Forward” result from one stage to another “or” OK if define read/write properly I n s t r. O r d e r Time (clock cycles) add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 IF ID/RF EX MEM WB ALU Im Reg Dm Reg Reg Dm Reg Reg Dm Reg Im ALU Reg Dm Reg Forwarding (or Bypassing): What about Loads 36 cs 152 L1 3 .36 DAP Fa97, Ó U.CB Dependencies backwards in time are hazards (shading indicates when operand is used by the instr) Can’t solve with forwarding: Must delay/stall instruction dependent on loads Study example of reordering of code to avoid stalls caused by data hazards (pg 447). Time (clock cycles) lw r1,0(r2) sub r4,r1,r3 IF ID/RF EX MEM WB ALU Im Reg Dm Reg Reg Dm Reg 6.2 Designing a Pipelined Processor 37 cs 152 L1 3 .37 DAP Fa97, Ó U.CB Go back and examine your datapath and control diagram associated resources with states ensure that flows do not conflict, or figure out how to resolve assert control in appropriate stage To retain the value of an individual instr for its other four stages, the value output by one stage must be saved in a regs. We must place regs where there are dividing lines between the stages. These are called pipeline regs. Pipelined Datapath (as in book); hard to read 38 cs 152 L1 3 .38 DAP Fa97, Ó U.CB IF ID EX MEM WB Fig 6.12 Data flowing from right to left does not affect the current instr, only later instrs in the pipe Pipelined Processor (almost) for slides 39 cs 152 L1 3 .39 DAP Fa97, Ó U.CB What happens if we start a new instruction every cycle? Exec Reg. File Mem Access Data Mem Reg File Equal PC Next PC IR Inst. Mem Valid IRex Dcd Ctrl IRmem Ex Ctrl IRwb Mem Ctrl WB Ctrl Control and Datapath 40 cs 152 L1 3 .40 DAP Fa97, Ó U.CB IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] S <– A + B; R[rd] <– S; S <– A + SX; M <– Mem[S] R[rd] <– M; S <– A or ZX; R[rt] <– S; S <– A + SX; Mem[S] <- B If Cond PC < PC+SX; Exec Reg. File Mem Access Data Mem Reg File Equal PC Next PC IR Inst. Mem Pipelining the Load Instruction 41 cs 152 L1 3.41 DAP Fa97, Ó U.CB The five independent functional units in the pipeline datapath are: Instruction Memory for the Ifetch stage Register File’s Read ports (bus A and busB) for the Reg/Dec stage ALU for the Exec stage Data Memory for the Mem stage Register File’s Write port (bus W) for the Wr stage Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 2nd lw 3rd lw The Four Stages of R-type 42 cs 152 L1 3 .42 DAP Fa97, Ó U.CB Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: ALU operates on the two register operands Update PC Wr: Write the ALU output back to the register file Cycle 1 Cycle 2 Cycle 3 Cycle 4 R-type Pipelining the R-type and Load Instruction 43 cs 152 L1 3 .43 DAP Fa97, Ó U.CB We have pipeline conflict or structural hazard: Two instructions try to write to the register file at the same time! Only one write port Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 R-type R-type Load R-type R-type Ops! We have a problem! Important Observation 44 cs 152 L1 3 .44 DAP Fa97, Ó U.CB Each functional unit can only be used once per instruction Each functional unit must be used at the same stage for all instructions: Load uses Register File’s Write Port during its 5th stage R-type uses Register File’s Write Port during its 4th stage 2 ways to solve this pipeline hazard. Solution 1: Insert “Bubble” into the Pipeline 45 cs 152 L1 3 .45 DAP Fa97, Ó U.CB Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle The control logic can be complex. Lose instruction fetch and issue opportunity. No instruction is started in Cycle 6! Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 R-type Load R-type R-type Pipeline Bubble Solution 2: Delay R-type’s Write by One Cycle 46 cs 152 L1 3 .46 DAP Fa97, Ó U.CB Delay R-type’s register write by one cycle: Now R-type instructions also use Reg File’s write port at Stage 5 Mem stage is a NOOP stage: nothing is being done. Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 R-type R-type Load R-type R-type R-type Mem 1 2 3 4 5 Modified Control & Datapath 47 cs 152 L1 3 .47 DAP Fa97, Ó U.CB IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] S <– A + B; R[rd] <– M; S <– A + SX; M <– Mem[S] R[rd] <– M; S <– A or ZX; R[rt] <– M; S <– A + SX; Mem[S] <- B if Cond PC < PC+SX; M <– S Exec Reg. File Mem Access Data Mem S Reg File Equal PC Next PC IR Inst. Mem M <– S The Four Stages of Store 48 cs 152 L1 3 .48 DAP Fa97, Ó U.CB Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Write the data into the Data Memory Cycle 1 Cycle 2 Cycle 3 Cycle 4 Store Wr The Three Stages of Beq 49 cs 152 L1 3 .49 DAP Fa97, Ó U.CB Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: compares the two register operand, select correct branch target address latch into PC Cycle 1 Cycle 2 Cycle 3 Cycle 4 Beq Wr Control Diagram 50 IR <- Mem[PC]; PC < PC+4; A <- R[rs]; B<– R[rt] S <– A + B; R[rd] <– S; S <– A + SX; M <– Mem[S] R[rd] <– M; S <– A or ZX; R[rt] <– S; S <– A + SX; Mem[S] <- B If Cond PC < PC+SX; Exec Reg. File Mem Access Data Mem Reg File Equal PC Next PC IR Inst. Mem M <– S M <– S cs 152 L1 3 .50 DAP Fa97, Ó U.CB Datapath + Data Stationary Control 51 cs 152 L1 3 .51 DAP Fa97, Ó U.CB Exec Reg. File Mem Access Data Mem Reg File Next PC IR Inst. Mem Decode Mem Ctrl WB Ctrl rs rt op rs rt fun im ex me wb rw v me wb rw v wb rw v Let’s Try it Out 52 cs 152 L1 3 .52 DAP Fa97, Ó U.CB 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 these addresses are octal Start: Fetch 10 53 cs 152 L1 3 .53 DAP Fa97, Ó U.CB Exec Reg. File Mem Access Data Mem Reg File PC Next PC IR Inst. Mem Decode Mem Ctrl WB Ctrl rs rt im 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 10 Fetch 14, Decode 10 54 cs 152 L1 3 .54 DAP Fa97, Ó U.CB Exec Reg. File Mem Access Data Mem Reg File PC Next PC IR Inst. Mem Decode Mem Ctrl WB Ctrl 2 rt im 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 14 lw r1, r2(35) Fetch 20, Decode 14, Exec 10 55 cs 152 L1 3 .55 DAP Fa97, Ó U.CB Exec Reg. File Mem Access Data Mem r2 Reg File PC Next PC IR Inst. Mem Decode Mem Ctrl WB Ctrl 2 rt 35 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 20 lw r1 addI r2, r2, 3 Fetch 24, Decode 20, Exec 14, Mem 10 56 cs 152 L1 3 .56 DAP Fa97, Ó U.CB Exec Reg. File Mem Access Data Mem r2 r2+35 Reg File PC Next PC IR Inst. Mem Decode Mem Ctrl WB Ctrl 4 5 3 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 24 lw r1 sub r3, r4, r5 addI r2, r2, 3 Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10 57 cs 152 L1 3 .57 DAP Fa97, Ó U.CB Exec Reg. File Mem Access Data Mem r4 r2+3 Reg File PC Next PC IR Inst. Mem Decode Mem Ctrl WB Ctrl M[r2+35] 6 7 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 30 lw r1 beq r6, r7 100 addI r2 sub r3 Fetch 34, Dcd 30, Ex 24, Mem 20, WB 14 58 cs 152 L1 3 58. DAP Fa97, Ó U.CB Exec Reg. File Mem Access Data Mem r6 r2+3 Reg File PC Next PC IR Inst. Mem Decode Mem Ctrl WB Ctrl r1=M[r2+35] 9 xx 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 34 beq addI r2 sub r3 r4-r5 100 ori r8, r9 17 Fetch 100, Dcd 34, Ex 30, Mem 24, WB 20 59 cs 152 L1 3 .59 DAP Fa97, Ó U.CB Exec Reg. File Mem Access Data Mem r9 Reg File PC Next PC IR Inst. Mem Decode Mem Ctrl WB Ctrl r1=M[r2+35] 11 12 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 100 beq r2 = r2+3 sub r3 r4-r5 17 ori r8 xxx add r10, r11, r12 ooops, we should have only one delayed instruction Fetch 104, Dcd 100, Ex 34, Mem 30, WB 24 60 cs 152 L1 3 .60 DAP Fa97, Ó U.CB Exec Reg. File Mem Access Data Mem r11 Reg File PC Next PC IR Inst. Mem Decode Mem Ctrl WB Ctrl r1=M[r2+35] 14 15 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 104 beq r2 = r2+3 r3 = r4-r5 xx ori r8 xxx add r10 and r13, r14, r15 n Squash the extra instruction in the branch shadow! r9 | 17 Fetch 108, Dcd 104, Ex 100, Mem 34, WB 30 61 cs 152 L1 3 .61 DAP Fa97, Ó U.CB Exec Reg. File Mem Access Data Mem r14 Reg File PC Next PC IR Inst. Mem Decode Mem Ctrl WB Ctrl r1=M[r2+35] 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 110 r2 = r2+3 r3 = r4-r5 xx ori r8 add r10 and r13 n Squash the extra instruction in the branch shadow! r9 | 17 r11+r12 Fetch 114, Dcd 110, Ex 104, Mem 100, WB 34 62 cs 152 L1 3 .62 DAP Fa97, Ó U.CB Exec Reg. File Mem Access Data Mem Reg File PC Next PC IR Inst. Mem Decode Mem Ctrl WB Ctrl r1=M[r2+35] 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 114 r2 = r2+3 r3 = r4-r5 r8 = r9 | 17 add r10 and r13 n Squash the extra instruction in the branch shadow! r11+r12 NO WB NO Ovflow r14 & R15 Summary: Pipelining 63 cs 152 L1 3 .63 DAP Fa97, Ó U.CB What makes it easy all instructions are the same length just a few instruction formats memory operands appear only in loads and stores What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction We’ll build a simple pipeline and look at these issues We’ll talk about modern processors and what really makes it hard: exception handling trying to improve performance with out-of-order execution, etc. Summary 64 cs 152 L1 3 .64 DAP Fa97, Ó U.CB Pipelining is a fundamental concept multiple steps using distinct resources Utilize capabilities of the Datapath by pipelined instruction processing start next instruction while working on the current one limited by length of longest stage (plus fill/flush) detect and resolve hazards 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 8 7 7 7 6 6 6 8 8 8 8 8 10 8 8 8 7 7 7 So where are in in the overall scheme of things. Well, we just finished designing the processor’s datapath. Now I am going to show you how to design the control for the datapath. +1 = 7 min. (X:47) 9 9 9 9 9 10 10 10 10 10 11 11 11 11 11 12 12 12 12 12 13 13 13 13 13 14 14 14 14 14 15 15 15 15 15 16 16 16 16 16 17 17 17 17 17 18 18 18 18 18 19 19 19 19 19 20 20 20 20 20 21 21 21 21 23 36 21 21 21 19 16 16 22 22 22 22 24 23 23 23 23 26 24 24 24 24 28 25 25 25 25 30 26 26 26 26 32 27 27 27 27 34 28 28 28 28 36 29 29 29 29 38 30 30 30 30 40 31 31 31 31 42 32 32 32 33 44 33 33 33 35 46 34 34 34 37 48 35 35 35 39 50 36 36 36 41 52 37 37 38 43 54 38 38 40 45 56 39 39 42 47 58 40 40 44 49 60 41 41 46 51 62 42 42 48 53 64 43 43 50 55 66 44 44 52 57 68 45 45 54 59 70 46 46 56 61 72 47 47 58 63 74 48 48 60 65 76 49 49 62 67 78 50 50 65 70 81 94 50 50 50 47 44 44 51 51 66 71 82 52 53 68 73 84 53 55 70 75 86 54 57 72 77 88 55 59 74 79 90 56 61 76 81 92 57 63 78 83 94 58 65 80 85 96 59 67 82 87 98 60 69 84 89 100 61 71 86 91 102 62 73 88 93 104 63 75 90 95 106 64 77 92 97 108
Compartilhar