Instruction Scheduling in Multi Function Pipelines (1) Static ...

74
ENEE446---Lectures-11/7/07 A. Yavuz Oruç Professor, UMD, College Park Copyright © 2007 A. Yavuz Oruç. All rights reserved. Instruction Scheduling in Multi Function Pipelines (1) Static Instruction Scheduling In-Order or Compiler-Driven (Hardware) Out-of-order Execution (Software) (2) Dynamic Scheduling (Real-time Scheduling-Hardware).

Transcript of Instruction Scheduling in Multi Function Pipelines (1) Static ...

ENEE446---Lectures-11/7/07

A. Yavuz OruçProfessor, UMD, College Park

Copyright © 2007 A. Yavuz Oruç. All rights reserved.

Instruction Scheduling in Multi Function Pipelines

(1) Static Instruction Scheduling In-Order or Compiler-Driven (Hardware) Out-of-order Execution (Software)

(2) Dynamic Scheduling (Real-time Scheduling-Hardware).

Static Scheduling-In-order execution:

Fetch, decode and execute steps of instructions and operandfetch and store steps are overlapped or pipelined in the orderthey are written in the program.

Example: MIPS PipelineIF ID EX ME WB 1st Instruction IF ID EX ME WB 2nd Instruction

IF ID EX ME WB 3rd Instruction

Static Scheduling-Out-of-order execution

The latencies apply when there is a data dependence between theinstructions.

Previous Instruction Next Instruction Execution Time Latency (Stall)

FP ALU Instruction FP ALU Instruction 4 3FP ALU Instruction ST double Instruction 4 2LD double Instruction FP ALU Instruction 1 1LD double Instruction SD Double Instruction 1 0Integer Instruction Integer Instruction 1 0

Example:

Dependency Stalls

Based on the latencies specified on the previous slide

IF ID EX DM WB

IF ID stall FP1 FP2

LD

FP ALU

Out-of-order execution may reduce the number of stalls:

Assembly Code:

Loop://Load a double precision number from memory location [R1] into F0 L.D F0,0(R1) ; F0=vector element//Add the double precision numbers in F0 and F2 and store the result in F4. ADD.D F4,F0,F2 ;//Store the double precision number in F4 into memory location [R1] S.D 0(R1),F4 ;//Decrement R1 by 8. DSUBUI R1,R1,8 ;//Branch back to the loop if R1 is not zero BNEZ R1,Loop ;branch if R1!=zero//No operation NOP ;

With Stalls

Loop:

1 L.D F0,0(R1) ;2 Stall; (Because ADD.D needs F0)3 ADD.D F4,F0,F2 ;4 Stall; (Because S.D must wait on ADD.D for F4)5 Stall;6 S.D 0(R1),F4 ;7 DSUBUI R1,R1,8 ;8 Stall; (Because BNEZ must wait on DSUBUI for R1)9 BNEZ R1,Loop ;10 Stall; Delayed branch slot (To determine the branch direction)

Total pipeline time per iteration: 10 clock cycles.

Reorder to Reduce the Stalls (Compiler generated)1 Loop: L.D F0,0(R1)2 Stall (ADD.D needs F0)3 ADD.D F4,F0,F2;4 DSUBUI R1,R1,8;5 Stall; (BNEZ must wait on DSUBUI for R1)6 BNEZ R1,Loop; delayed branch (the instruction following it will always be executed!)

7 S.D 8(R1),F4 ;store address is adjusted since R1 is changed.

7 clock cycles (4 for loop overhead + 3 for useful computation).

Can the number of clock cycles be reduced further?

Loop Unrolling Technique to reduce stalls:

Loop:1 L.D F0,0(R1) Stall;2 ADD.D F4,F0,F2 Stall; Stall;3 S.D 0(R1),F4 ; //Drop DSUBUI & BNEZ4 L.D F6,-8(R1) Stall;5 ADD.D F8,F6,F2 Stall; Stall;6 S.D -8(R1),F8 ; //Drop DSUBUI & BNEZ7 L.D F10,-16(R1) Stall;8 ADD.D F12,F10,F2 Stall; Stall;9 S.D -16(R1),F12 ; //Drop DSUBUI & BNEZ10 L.D F14,-24(R1) Stall;11 ADD.D F16,F14,F2 Stall; Stall;12 S.D -24(R1),F1613 DSUBUI R1,R1,#32 //Modified as 4*8 = 3214 BNEZ R1,LOOP15 NOP

Number of clock cycles per iteration = 27/4 = 6.5.

Nothing is gained by unrolling the loop.

What if we unroll and move the loads to the beginning of the program:Loop:1 L.D F0, 0(R1);2 L.D F6,-8(R1);3 L.D F10,-16(R1);4 L.D F14,-24(R1);5 ADD.D F4,F0,F2;6 ADD.D F8,F6,F2;7 ADD.D F12,F10,F2;8 ADD.D F16,F14,F2;9 S.D 0(R1),F4;10 S.D -8(R1),F8;11 S.D -16(R1),F12;12 DSUBUI R1,R1,#32;13 BNEZ R1,LOOP;14 S.D 8(R1),F16 ; //8-32 = -24

14 cycles /4 = 3.5 cycles/iteration which is pretty close to 3operations per loop.

This is the bare minimum to load, add, and store the result, i.e.,x[i] = x[i] + s;

The number of cycles should approach 3 as the number of times weunroll increases.

However, it is not practical to unroll too many times. (why?)

Dynamic Scheduling (Out-of-Order Execution in Hardware)

Simplifies compiling

Key idea: Move the interdependent instructions as far away aspossible without violating the integrity of the code.

Example:

LDD R1,R2;ADD R1,R3;SUB R0,R4;can execute faster when it is reordered asLDD R1,R2;SUB R0,R4;ADD R1,R3;

Scoreboarding Algorithm(Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)) The notes (actualimages) that follow are taken from (http.cs.berkeley.edu/~kubitron)) with some changes.)

Key Steps:

1--Decode and issue instruction (In-order). Do not issue if there isa structural hazard (resource conflicts)

2--Do not issue if a previously issued instruction has the samedestinations address (i.e., stall when there is potential for WAW)

3--Read operands. Do not issue until all operands are read

4--Execute (Out-of-order)-Write Results

Hazards are controlled and avoided by the scoreboard by stallinginstructions that would cause structural and data hazards.

The scoreboard decides when a stalled instruction can read itsoperands, resume execution, and when it can write its results into itstarget registers. Since instructions can execute out-of-order, it ispossible to have WAR hazards unless they are stalled.

It is also possible to have a RAW hazard as scoreboarding does notuse data forwarding, and load instructions followed by registerinstructions have data dependences that cannot be avoided even ifdata forwarding is used.

Stalling instructions with WAW hazards ensures that out-of-order execution ofinstructions will not cause an incorrect writing of results into registers.

Stalling instructions until their operands become ready ensures that out-of-orderexecution of instructions will not lead to incorrect results, and also avoid potentialdeadlocks:

Stalling the issue of SUB until it gets both its operands avoids a potential deadlock:If both ADD and SUB are allowed to issue and execute with SUB ahead of ADD,then ADD must wait for SUB to complete execution as there is only one integerunit, and SUB must wait to read R1 and write back R2, leading to a deadlock.

In reality, this should not happen for the very reason that there is a logical order ofexecution between the two instructions (i.e., ADD followed by SUB), and forcingSUB not to issue until it gets both its operands avoids the deadlock.

Instructions with WAR hazards are allowed to execute out-of-order as they do not imply any data dependency betweeninstructions. They are stalled after execution to make sure thatthe previously issued instructions can read their operandsbefore they are written over.

Scoreboard Data Structure:

For each instruction we keep a table of entries:

Instruction status: (1) ID and Issue, (2) Operand Read, (3) Execute, (4) Write backIndicates where the instruction is in the pipeline.

For each functional unit we keep a table of entries:Functional unit status: Indicates the state of the functional unit (FU).

Busy: Yes or NoOp: +, -, *, /,etc.Fj,Fk: Source-registers from which the functional unit receives its operands.Fi: Destination register to which the functional writes its result.

Qj,Qk: Functional units outputting to source registers Fj, Fk.

Rj,Rk: Flags indicating when Fj, Fk are ready (To avoid RAW hazards)

FU is doneExecution Complete

For all f,if Qj(f) = FU then Rj(f) <-Yes;if Qk(f) = FU then Rk(f)<-Yes;Busy(FU)<- NoResult(Fi(FU))<- Null;(Remove Fi from FU.)

WAR hazard if, for some FUx,

Fj(FUx) = Fi(FU) AND Rj(FUx) = Yes OR

Fk(FUx) = Fi(FU) AND Rk(FUx) = Yes

Write result

Rj <- No, Rk <- No(Clear for next read of operands)

Rj(FU) and Rk(FU) are YesRead operands

Busy(FU) <- Yes;Op(FU) <- op; (From instr)Fi(FU) <- D (From instr)Fj(FU) <- S1 (From instr)Fk(FU) <- S2 (From instr)Qj(FU) <- Result(S1) (functional unit that outputs to S1)Qk(FU) <- Result(S2) (functional unit that outputs to S2)if Qj(FU) = null, Rj <- Yes else Rj <- Noif Qk(FU) = null, Rk <- Yes else Rk<- NoResult(D) <- FU

FU is free (No structural hazards), NoWAW hazard

Issue

Bookkeeping(When done)Condition to ProceedInstruction Status

-Tomasulo's Algorithm (www.cs.ucf.edu/courses/eel5708/slides/lecture_15_tomasulo.ppt)The main idea is to use register renaming to avoid stalling instructions becauseof WAR and WAW hazards.

Distributed Computing: Control is distributed into the function units.Register Renaming: Operands are held in buffers, called "reservation stations".

Registers in instructions are replaced by values or pointers to reservationstations(RS);

Parallelism: More reservation stations than registers => more parallelism thancompiler optimization.

Common Data Bus: Results flow to FUs from reservation stations over aCommon Data Bus that broadcasts the results to all FUs => avoidance of RAWhazards (similar to data forwarding).

Load and Stores: are treated as FUs with reservation stations.

Example (Register renaming):DIV.D F0,F2,F4;ADD.D F6,F0,F8; ---> RAW with DIV.DSD.D F6, 0(R1); ---> WAW with ADD.DSUB.D F8,F10,F14;--> WAR with ADD.DMUL.D F6,F10,F8; ---> WAW with ADD.D, RAW with SUB.D, WAR with SD.D

WAW and WAR hazards are eliminated if we rename F6 as S andF8 in SUB.D as T.

DIV.D F0,F2,F4;ADD.D S,F0,F8;SD.D S, 0(R1);SUB.D T,F10,F14;MUL.D F6,F10,T;

Renaming can be done statically by a compiler, but static renaming has twolimitations:

- The number of registers places an upper bound on the number of registersthat can be renamed.

-The renaming of a register is limited to blocks of code between branchesunless a sophisticated analysis of where branches might lead the programexecution is carried out across branches.

- Hardware renaming (Tomasulo's Algorithm) removes these two restrictions.

Tomasulo Algorithm Steps:

1. Issue: Get the next instruction from the FIFO instruction queue. If amatching reservation station is free (no structural hazard), control issues theinstruction, and if the operands are available, it sends them to thereservation station and otherwise, it keeps track of the functional unitsthat will produce the operands, and renames the registers in theinstruction by pointing them to the reservation station buffers.

2. Execution: Operate on operands (EX). If both operands of a function unitare ready then execute the instruction; otherwise monitor the Common DataBus for results. Waiting until all operands are ready avoids RAW hazards.(Potential structural hazard due to multiple issue of instructions- more thanone reservation station may compete for execution on the attached FU.)

3. Write result: Complete execution (WB). Write on Common Data Bus to allawaiting units; mark reservation stations as available, store results toregisters and/or memory

Each reservation station has seven fields:

1-Op: Operation to perform in the attached functional unit

2-Qj, 3-Qk: Reservation stations that will produce the source operands forthe attached functional unit (operands to be used) Qj,Qk = 0 => ready oravailable in Vj, Vk.

4-Vj, 5-Vk: Values of source operands (Either Vj and Vk or Qj and Vk arevalid.)6- A: Address of a memory operand or result.

7-Busy: Indicates reservation station or FU is busy. The register file has aQi field for each register which specifies the id of the reservation stationthat will write into the corresponding register if one exists. Qi is left blankwhen there is no pending instructions that will write that register.

End of Lecture