11 - SuperScalar Procs and RAT

ucla | CS M151B | 2024-02-15 16:10


Table of Contents

CPU Performance

  • latency - time to complete a task
  • throughput - number of tasks per given time
    • computation capacity FLOPS
    • memory bandwidth GB/s

      Single cycle effect

  • effectively reduce CPI to 1 per instruction but CT increases due to propagation delay

    Multi-cycle effect

  • decrease CT using latches between stages by 1/num_sstages
  • but increase CPI

    Pipeline effect

  • smaller CT and CPI
  • but requires more hardware and cost and control hazards
  • adding more stages increases CPI while benefits to CT reduction are limited
  • requires more forwarding and control hazards

    Reducing CPI

  • increase instruction throughput IPC=1/CPI
  • parallelize with
    • ILP - instruction level parallelism
    • TLP - thread level parallelism
    • MLP - memory level parallelism

      SuperScalar Processor

  • e.g., extend scalar pipeline to IPC=2 by grabbing 2 instruction simultaneously; then we must change the behavior of the stages

    IF stage

  • still just 1 PC, but we increment by 8 bytes for each cycle
  • BUT, we change IM (instruction memory) to accept 2 inputs (PC & PC+4) and create 2 outputs: O1 and O2
  • i.e. multiple read ports

    ID stage

  • replicate decode hardware
  • have multiple read ports to register file

    EX stage

  • replicate ALU (2 ALUs)
  • or to decrease cost, we make each ALU partition the set of possible ops
  • will also require
  • e.g., ALU1 does arithmetic and ALU2 does logic; but then must ensure that either stalls if Instruction 2 does not require the ALU or only requires one of the 2
  • MEM stage

  • no replication, just have 1 ALU output that goes to DM
  • if the other ALU output ALSO needs to access memory, we stall for 1 cycle and redirect the 2nd output to the DM as well
  • this can happen but is much faster and much cheaper than replicating DM for when both instructions don’t need to access DM

    WB stage

  • have multiple write ports to the register file

    Control Hazards and Data Forwarding

    WAW

  • Write after write is an issue now and ordering of write must be enforced
  • we can do this by delaying the bottom wire

    RAW

  • RAW is much more of an issue now depending on the IPC inrease, e.g. if we read 4 instr in parallel (N=4):
  • so most super scalar stop at 4 parallel instrs bc checking is very expensive wrt to N

    Tradeoff

  • higher IPC
  • but higher hardware cost: replication, dependency checks
  • also stalls can affect IPC
  • but we can mitigate stalls by performing parallel instructions out of order
  • the isssue is resolving intsstrution dependency so parallel instruction handling rarely ever reaches all N scalar procs in a superscalar, isntead we an try OOO

    Out of Order Exec (OOE)

    To eliminate WAR and WAW dependency

  • go through instructions linearly, if we observe a group of parallelized N instructions that are not independent, we skip one of them and choose the next one (each time we execute an instruction it is “Removed from our list”)
  • problems: finding available instrs, tracking dependencies, pipeline hazards, memory preservation

    Static Scheduling

  • compiler internally track dependencies
  • have compiler be clever and reorder instructions when it observes instructions are independent

    Dynamic scheduling

  • now we observe that having many more registers can help mitigate dependency but register file reimplementation cost is too high and we don’t want to change ISA
  • instead we use internal registers called “physical registers” (more free regs than ARF)
  • and call ISA regs “logical or architected registers”
  • and we map logical regs to physical regs
  • then we need a mapping algo and physcial registers (to allocate/deallocate)
  • the following example is mising 2 WAR: b/w I1->I2 and I4->I5, but assume they are not there:
  • then we can decouple I2 and I3 by mapping R5=S in I3 and propagate downward - this removes a WAR dependency
  • do this for WAW as well by mapping to T
    • Mapping Algo

  • for each instr, everytime we write to a reg, rename it
  • then keep track of renamings and reqrite any decodings of it in future instructions

    Physical structure

  • we do this using a RAT (register alias table) that maps to ARF and PRF (an abstraction)

    Process

  • setup
  • dependency observation
  • rename dest regs and propagate
  • repeat until program ends; if physical regs are full, stall until they are freed
  • during operations, recycle physical regs after the last instruction in the program requires it - we can know when to do this by when the renamed register is once again used as a destination reg
  • once this happens, we can reflect this change to the ARF to make the operations “true/visible”