Pipelining Hazards
Hazards are conditions that will lead to incorrect results if not mitigated
Mitigation strategies: add hardware, stall the pipeline or fix the software (e.g. add no-ops)
Hazard types: data, control, structural
Stalling
Stalling the pipeline means to halt an instruction, not allowing it to progress.
Data Hazards
ADD X2, X0, X1
ADDI X3, X2, #1
Hazard: First instruction writes to X2
, second instruction uses it, but we expect that it has been updated at this point
Solution: We need a 2 cycle stall
We assume something called forwarding through the register file (i.e it writes in first half and reads in second half of the clock cycle)
Observation: the result of X0 + X1
is present in the pipeline after clock cycle 3 (and this matters because that means we can use it earlier instead of stalling 2 cycles, see #Data Forwarding)
Data Forwarding
Data forwarding allows us to access data as soon as it's ready, so we don't have to wait for writeback to complete.
ADD X2, X0, X1
ADDI X3, X2, #1
ADD X4, X5, X2
Now instruction 3 also uses X2
. Should we stall the pipeline again? That would defeat the point of pipelining, we might as well just use a single-cycle processor. Instead, we can forward the data.
The forwarding unit controls the ALU input (A and B) mux selects. The logic for A is as follows:
// If the current instruction writes to a register and we use that register in the next instruction
if (EX/ME.RegWrite && EX/ME.Rd == ID/EX.Rn) {
ForwardA = 2 // Forward from ME
} else if (ME/WB.RegWrite && ME/WB.Rd == ID/EX.Rn) {
ForwardA = 1 // Forward from WB
} else {
ForwardA = 0 // No forwarding
}
Load Hazards
A load hazard is a special case of a data hazard: the value is produced in ME instead of EX. The following dependent instruction must stall for 1 cycle (even with forwarding).
MOV X1, #3
LDR X2, [X0, #8]
ADD X3, X1, X2
We need to stall instruction 3 (and consequently any proceeding instructions). After 1 cycle, the load hazard is gone and the stall signal is de-asserted.
Alternatively, instead of stalling, a compiler can schedule (re-order) instructions to avoid a load hazard: ```armasm LDR X2, [X0, #8] MOV X1, #3 ADD X3, X1, X2 ```Logic to detect load hazards while in ID:
if (ID/EX.MemRead && (ID/EX.Rd == IF/ID.Rn || ID/EX.Rd = IF/ID.Rm)) {
stall = 1
} else {
stall = 0
}
in other words, if the instruction in EX is LDR (only LDR
has MemRead
control signal) and the destination register of the instruction in EX is equal to any of the src registers of instructions in ID, the stall.
Control Hazards
Control hazards are caused by branch instructions. They are an issue because we are branching to new code while there are other instructions lined up in the pipeline.
The branch penalty can be reduced to 1 cycle by:
- Moving the branch adder from EX to ID and connecting its output to the PC mux
- Conditional branches also require forwarding paths from ME and WB to ID for checking the branch condition
Conditional Branching
If a conditional branch depends on a preceding:
- ALU instr, it must stall 1 cycle
- LDR instr, it must stall 2 cycles
Case 1: conditional branch dependent on preceding ADD
ADD X1, X2, X3
CBZ X1, label
ORR X4, X5, X6
Note: if the CBZ is taken, the ORR will be squashed
Impact on CPI (cycles per instruction)
- Assume 20% of dynamic instruction mix are branches
- 50% of branches are unconditional
- 50% of branches are conditional (half are taken)
Calculating the branch penalty:
- Ideal CPI = 1
- Average CPI: 1.15
Conditional Branch Prediction
- With deeper pipelines (e.g 10-20 stages) and wider pipelines (4-8 instr. / stage), branch penalties are more significant
- Branch prediction is worth the extra hardware / chip area
- Predict the branch decision (taken or not taken) and the branch target address
- Speculatively execute the predicted path
- Execute the branch to check the prediction and recover if wrong (squash the mispredicted instructions and start fetching correct instructions)
Static prediction (without runtime history):
- Forward branch: predict not taken (easy)
- Backward branch: predict taken (more likely -- loops)
Dynamic Prediction (with runtime history):
- Low-order PC bits index into the BHT (e.g 1200134)
if (instr.address == PC && predictor == T) {
PC <- targetAddress
} else {
PC <- PC + 4
}
- On a correct prediction, the branch penalty is 0 cycles
Structural Hazards
-
Occur when different instructions need the same resource (at the same time)
-
E.g instr. and data share same mem. IF and ME (for
LDR/STR
) can't operate in parallel- stall in IF
-
E.g no branch adder: branch target addr. must be calculated with the ALU in EX
- 2 cycle branch penalty
Solution: add more hardware!