|
<< Click to Display Table of Contents >> Navigation: ASA-EMulatR Reference Guide > Introduction > Appendix > Appendix J – Branch Prediction Mechanics |
This appendix describes the branch prediction subsystem implemented by CBox and BranchPredictor. Chapter 13.11 describes how the pipeline consumes, resolves, and trains predictions; this appendix documents the predictor structures, strategies, interface, and expected performance.
CBox (CBoxLib/) owns a BranchPredictor instance (BranchPredictor.h). CBox provides the pipeline-facing interface; BranchPredictor contains the prediction structures and algorithms. CBox follows the same construction pattern as PalBox — it receives CPUIdType, HWPCBBank*, and IPRStorage_HotExt* from ExecutionCoordinator and does not own these resources.
The predictor operates independently per CPU. In SMP configurations, each CPU has its own CBox and BranchPredictor with no shared state. Context switches flush predictor state via flushPredictionState().
Branch History Table (BHT)
512 entries, 2-way set associative. Each entry contains a 2-bit saturating counter that tracks the recent behavior of the branch at that address. The counter encodes four states:
Value |
State |
Prediction |
Transition on Taken |
|---|---|---|---|
00 |
Strongly not-taken |
Not-taken |
→ 01 (weakly not-taken) |
01 |
Weakly not-taken |
Not-taken |
→ 10 (weakly taken) |
10 |
Weakly taken |
Taken |
→ 11 (strongly taken) |
11 |
Strongly taken |
Taken |
→ 11 (saturates) |
On a not-taken outcome, the counter decrements (saturates at 00). Prediction is "taken" when counter >= 2 (bit 1 set). The 2-bit design tolerates one anomalous outcome without changing the prediction — a loop that iterates 100 times and exits once will be correctly predicted 99% of the time.
BHT indexing: the entry is selected by hashing the branch PC. The 2-way set associativity allows two branches that hash to the same index to coexist without aliasing. Power-of-2 sizing enables efficient masking (index = hash & 0x1FF).
Branch Target Buffer (BTB)
1024 entries. Each entry stores a PC → target address mapping. When the BHT predicts "taken," the BTB provides the target address. BTB entries are updated when a branch resolves as taken — the actual target is written into the BTB. For not-taken branches, no BTB update occurs (the sequential PC+4 is used).
Global History Register
16-bit shift register recording the taken/not-taken outcome of the last 16 branches. Each branch outcome shifts in from the right: m_globalHistory = (m_globalHistory << 1) | taken. The global history register is maintained for future use by advanced prediction strategies (gshare, tournament). The current HistoryTable strategy does not use global history for index computation.
The active prediction strategy is controlled by the ICCSR (Implementation-Specific Control and Status Register) branch prediction mode bits. CBox provides setStrategy(BranchStrategy) to change the active strategy at runtime (typically during ICCSR writes). Three strategies are implemented:
NeverTaken (ICCSR mode 0) — always predicts not-taken. Every branch fetches sequentially from PC+4. Useful as a baseline for measuring predictor benefit. Hit rate: ~30–40% for typical workloads (most loops are backward branches that are usually taken).
DisplacementBased (ICCSR mode 1) — predicts taken if the branch displacement is negative (backward branch), not-taken if positive (forward branch). This is a static heuristic: backward branches are typically loop-closing branches that are taken on all but the final iteration. No BHT state is used. Hit rate: ~65–75% for typical workloads.
HistoryTable (ICCSR mode 2, default) — uses the BHT 2-bit saturating counter to predict taken/not-taken, and the BTB for target address. This is a dynamic strategy that learns from runtime behavior. Hit rate: ~85–95% for typical workloads, varying by branch type (see Section J.5).
All prediction hot-path methods are marked AXP_HOT AXP_ALWAYS_INLINE to eliminate call overhead. CBox avoids virtual function calls on all critical paths.
queryPrediction(pc, outPredictedTarget) → bool
Called by IBox during fetch. Checks BHT for taken/not-taken, looks up BTB for the target if taken. Returns true if a taken prediction with valid target is available. Marked AXP_HOT.
predictTaken(pc, displacement) → bool
Lower-level interface. Returns the taken/not-taken prediction for the given PC using the active strategy. Delegates directly to BranchPredictor::predict(). Marked AXP_HOT AXP_ALWAYS_INLINE.
getPredictedTarget(pc) → quint64
Returns the BTB-cached target address for the given PC. Marked AXP_HOT.
recordPrediction(pc, predictedTarget)
Called by IBox after making a speculative prediction. Updates BTB with the predicted target, increments prediction statistics, updates global history. Marked AXP_HOT AXP_FLATTEN.
recordBranchResolution(pc, actualTarget, wasTaken, wasPredicted, predictedCorrectly)
Called by the pipeline in stage_WB() at retirement. Trains the BHT (increments or decrements the saturating counter), updates BTB with actual target if taken, updates prediction accuracy statistics, and updates global history with the actual outcome. This is the sole training entry point — all predictor learning happens here.
flushPredictionState()
Clears all prediction state: invalidates all BHT entries, invalidates all BTB entries, clears global history register, resets all counters to weakly not-taken (01). Called on context switch and pipeline flush from PAL mode. After flush, the predictor cold-starts — all branches will initially be predicted not-taken until training rebuilds the tables.
setStrategy(BranchStrategy)
Changes the active prediction strategy. Called when ICCSR branch prediction mode bits are written. Takes effect immediately for the next prediction.
getStatistics() / resetStatistics()
Returns or resets cumulative prediction statistics: totalPredictions, correctPredictions, incorrectPredictions, and computed accuracy().
The following hit rates are expected for the HistoryTable strategy under typical workloads. These are design targets derived from published EV6 branch prediction literature and empirical measurement of the 2-bit saturating counter scheme.
Branch Type |
Instructions |
Expected Hit Rate |
Notes |
|---|---|---|---|
Unconditional direct |
BR, BSR |
100% |
Target known at decode, no prediction needed |
Backward conditional (loops) |
BNE, BGT, etc. |
95%+ |
Strongly biased taken; 2-bit counter saturates quickly |
Forward conditional |
BEQ, BLE, etc. |
80–90% |
More variable; error handling branches bias not-taken |
Function call (direct) |
BSR |
100% |
Unconditional, target at decode |
Function return |
RET |
~60% (BTB only) |
Indirect target varies by call site; RAS would improve to ~98% |
Indirect jump |
JMP, JSR, COROUTINE |
~50–70% |
Target varies; BTB caches last target only |
Aggregate: Across a typical workload mix (integer-heavy, moderate branching), the HistoryTable strategy is expected to achieve 85–93% overall accuracy. The primary sources of misprediction are indirect jumps (switch statements, virtual calls) and loop exit branches (the single misprediction when a loop terminates).
Misprediction cost: Each misprediction invalidates up to 3 pipeline stages (IF, DE, IS), wasting the cycles spent fetching and decoding the wrongly-predicted instruction stream. With a 6-stage pipeline, the maximum penalty is 3 cycles per misprediction.
The complete prediction lifecycle across the pipeline:
IBox fetch: detect branch in decoded instruction
↓
IBox → CBox::queryPrediction(branchPC, outTarget)
↓ BHT lookup → taken/not-taken
↓ BTB lookup → predicted target (if taken)
↓
IBox → CBox::recordPrediction(branchPC, predictedTarget)
↓ Updates statistics, global history
↓
IBox populates FetchResult: predTaken, predTarget
↓
stage_IF: transfers prediction into PipelineSlot
↓
stage_EX: grain resolves actual branch outcome
↓ Sets slot.branchTaken, slot.branchTarget
↓ Compares actual vs predicted
↓ If mismatch → slot.flushPipeline = true
↓
stage_WB: retirement
↓
Pipeline → CBox::recordBranchResolution(pc, target, taken, ...)
↓ BHT counter increment/decrement (training)
↓ BTB update (if taken)
↓ Statistics update
↓ Global history update
Cache friendliness: BHT and BTB are contiguous arrays (not linked structures). Power-of-2 sizing enables efficient index masking. The 2-way set-associative BHT uses simple linear probing within each set.
Hot path optimization: predictTaken() and getPredictedTarget() are marked AXP_HOT AXP_ALWAYS_INLINE. recordPrediction() is marked AXP_HOT AXP_FLATTEN. The hash function is AXP_ALWAYS_INLINE. All critical paths avoid virtual function calls.
Debug support: DEBUG_BRANCH_PREDICTION compile-time flag enables detailed per-branch logging. Statistics tracking is always enabled with minimal overhead. Statistics can be queried via getStatistics() and reset via resetStatistics().
Context switch behavior: flushPredictionState() is called on context switch. All BHT and BTB entries are invalidated. Counters reset to weakly not-taken (01). The predictor cold-starts for the new process. This prevents stale predictions from the previous process from causing systematic mispredictions in the new process.
The current implementation is a simplified model of the EV6 branch prediction subsystem. The following table compares the emulator's predictor against the documented EV6 hardware:
Feature |
EV6 Hardware |
ASA-EMulatR |
|---|---|---|
BHT entries |
4096 |
512 (2-way) |
BTB entries |
2048 |
1024 |
Local history predictor |
4096-entry, 10-bit history |
Not implemented |
Global history bits |
12 |
16 (maintained, not yet used for indexing) |
Return Address Stack |
8-entry |
Not implemented |
Planned enhancements:
Return Address Stack (RAS) — an 8-entry stack dedicated to JSR/RET prediction. JSR pushes PC+4 onto the stack; RET pops the predicted return address. This would improve RET prediction from ~60% to ~98%, as function return targets follow a strict LIFO pattern.
Gshare predictor — XOR the PC with global history to index the BHT. This captures correlations between branches (e.g., if branch A was taken, branch B is likely not-taken). Relatively simple to implement using the existing global history register.
Tournament predictor — run multiple prediction schemes in parallel with a meta-predictor that selects the most accurate scheme for each branch. The EV6 hardware uses a tournament between local and global predictors; this is the most architecturally accurate enhancement.
Indirect branch predictor — a specialized predictor for computed jumps (switch statements, virtual method dispatch) that tracks multiple targets per PC rather than the single BTB entry used today.
These enhancements are not required for architectural correctness — misprediction affects performance modeling only, never program behavior. The emulator guarantees correct execution regardless of prediction accuracy.
See Also: 13.11 Branch Handling (pipeline integration); Chapter 14 – Execution Domains (“Boxes”) (CBox role); CBoxLib/BranchPredictor.h (source); Alpha 21264/EV6 Hardware Reference Manual, Section 2.5 (branch prediction).