Appendix L – Pipeline Cycle Mechanics

<< Click to Display Table of Contents >>

Navigation:  ASA-EMulatR Reference Guide > Introduction > Appendix > Appendix H - Alpha Pipeline  >

Appendix L – Pipeline Cycle Mechanics

This appendix provides the unified view of one pipeline cycle — what happens during a single tick() call when 6 instructions occupy 6 stages simultaneously. The individual stages are documented in Chapter 13.6; the retirement contract is documented in Appendix K; this appendix explains how they combine into a functioning pipeline.

 


 

L.1 The Steady-State Snapshot

 

At steady state, the pipeline holds 6 instructions at 6 different stages of execution. Each instruction is at a different point in its lifecycle. The oldest instruction is about to retire in WB; the youngest has just been fetched in IF. All 6 are processed within a single tick() call.

 

Cycle N — Steady State (6 instructions in flight):

 

Stage

Slot

PC

Status

 WB (5)

[head]

0x1000

Retiring (oldest, about to commit)

 MEM (4)

[h-1]

0x1004

Register writeback via commitPending

 EX (3)

[h-2]

0x1008

Grain executing (all real work here)

 IS (2)

[h-3]

0x100C

Issue (pass-through)

 DE (1)

[h-4]

0x1010

Decode (pass-through)

 IF (0)

[h-5]

0x1014

Just fetched (youngest)

 

 Architectural PC: 0x1000 (matches WB, about to become 0x1004)

 Fetch PC: 0x1018 (IBox will fetch this next cycle)

 

The 6 instructions span 24 bytes of sequential code (6 × 4 bytes per Alpha instruction). The architectural PC always matches the WB slot — it advances only when an instruction retires. The fetch PC runs 5 instructions ahead of the architectural PC.

 


 

L.2 Anatomy of One tick() Call

 

tick() is called once per cycle by AlphaCPU::runOneInstruction(). It performs four actions in strict order:

 

Phase 1 — Supply fetch result. The FetchResult from IBox (containing the DecodedInstruction and grain pointer) is stored in m_pendingFetch, ready for stage_IF() to consume.

 

Phase 2 — Execute all 6 stages. execute() runs all stages in reverse order: stage_WB() → stage_MEM() → stage_EX() → stage_IS() → stage_DE() → stage_IF(). Each stage processes its slot independently. This is where all pipeline work occurs — retirement, register writeback, grain execution, fetch consumption.

 

Phase 3 — Rotate the ring buffer. advanceRing() increments m_head: m_head = (m_head + 1) % STAGE_COUNT. This causes every instruction to "move forward" one stage — the slot that was IF becomes DE, DE becomes IS, and so on. The slot that was WB wraps around and becomes the new IF slot (cleared, ready for the next fetch).

 

Phase 4 — Increment cycle counter. m_cycleCount++ records the passage of one pipeline cycle. BoxResult is returned to AlphaCPU indicating pipeline state (advance, stall, fault, PAL transfer).

 

Critical detail: advanceRing() does not move instruction data. It moves the head pointer. The PipelineSlot objects remain in their physical array positions — the logical-to-physical mapping changes because stage(logicalIndex) computes m_slots[(m_head - logicalIndex + STAGE_COUNT) % STAGE_COUNT]. By incrementing m_head, the physical slot that was WB (oldest) becomes IF (youngest), and every other slot shifts one logical position toward WB. This is O(1) — no data copies, no slot swaps.

 


 

L.3 Why Reverse Execution Order

 

The stage execution order (WB→MEM→EX→IS→DE→IF) is not arbitrary — it is the sole correctness mechanism for three pipeline properties:

 

RAW hazard resolution without forwarding. commitPending() runs in stage_WB() (or stage_MEM() depending on configuration) before stage_EX(). This means the register file contains the result from the instruction that executed last cycle before the current cycle's instruction reads registers. No forwarding muxes, no scoreboard stalls for data dependencies, no bypass paths. The reverse order creates an implicit one-cycle forwarding window for free.

 

Older-before-younger guarantee. By processing WB first, the oldest instruction retires (or faults) before any younger instruction can modify state. If the WB instruction faults, the flush discards all younger stages. If it retires normally, its results are committed before any younger instruction's results could interfere. This is the foundation of precise exceptions.

 

Commit-before-execute safety. commitPending() at the top of WB commits the deferred result from the previous cycle before the current cycle's stage_WB() makes any fault/CALL_PAL/retirement decisions. The committed result is from an older, already-validated instruction. Even if the current WB instruction faults, the older instruction's result is safely in the register file.

 

If the stages ran in forward order (IF→DE→IS→EX→MEM→WB), all three properties would require explicit forwarding logic, hazard detection hardware, and speculative register state management. The reverse order eliminates all of this with zero runtime cost.

 


 

L.4 Pipeline Warmup

 

After reset or flush, the pipeline is empty. It takes 6 cycles to reach steady state. During warmup, the pipeline contains bubbles (empty slots with valid = false) in the later stages:

 

Cycle 0: IF=Instr[0] DE=bubble IS=bubble EX=bubble MEM=bubble WB=bubble

Cycle 1: IF=Instr[1] DE=Instr[0] IS=bubble EX=bubble MEM=bubble WB=bubble

Cycle 2: IF=Instr[2] DE=Instr[1] IS=Instr[0] EX=bubble MEM=bubble WB=bubble

Cycle 3: IF=Instr[3] DE=Instr[2] IS=Instr[1] EX=Instr[0] MEM=bubble WB=bubble

Cycle 4: IF=Instr[4] DE=Instr[3] IS=Instr[2] EX=Instr[1] MEM=Instr[0] WB=bubble

Cycle 5: IF=Instr[5] DE=Instr[4] IS=Instr[3] EX=Instr[2] MEM=Instr[1] WB=Instr[0]

 └─── FIRST RETIREMENT: Instr[0] exits ───┘

 

Cycle 6: IF=Instr[6] ... steady state ... WB=Instr[1]

 One instruction enters, one retires — peak throughput.

 

Key observations: No instruction retires during cycles 0–4. The first retirement occurs at cycle 5 (6-cycle latency from fetch to retirement). Bubbles flow through harmlessly — stage_WB() checks slot.valid and returns immediately for empty slots. During warmup, stage_EX() processes empty slots as no-ops, so there is no wasted computation — only wasted pipeline capacity.

 

Warmup cost: 5 cycles of zero throughput before the first retirement. This cost is paid on every full flush (fault delivery, CALL_PAL, HW_REI, interrupt). For the SRM boot sequence, where CALL_PAL and TLB miss faults occur frequently, the warmup cost is a measurable fraction of total execution time.

 


 

L.5 Pipeline Drain

 

Pipeline drain is the inverse of warmup — instructions exit the pipeline without new instructions entering. Drain occurs on flush: after the flush clears all slots, the pipeline is empty. If IBox cannot supply a fetch (e.g., IBox is stalled on an ITB miss), the IF slot receives a bubble, and the pipeline begins to drain naturally as instructions retire from WB without replacement.

 

Drain on full flush (e.g., fault delivery):

 

 Cycle N: 6 instructions in flight (steady state)

 Cycle N: Fault detected in WB → flush() called

 Cycle N: All 6 slots cleared instantly (single call)

 Cycle N+1: Pipeline empty — warmup begins at fault vector PC

 

Full flush discards up to 5 cycles of in-flight work (the 5 instructions in IF through MEM that had not yet retired). Only the WB instruction's fault is architecturally visible — the other 5 instructions are silently discarded as if they never executed.

 

Drain on partial flush (branch misprediction):

 

 Cycle N: 6 instructions in flight

 Cycle N: Misprediction detected in EX → flushYoungerSlots()

 Cycle N: IF, DE, IS cleared (3 instructions discarded)

 Cycle N: EX, MEM, WB preserved (3 instructions continue)

 Cycle N+1: IF fetches from corrected target

 Cycle N+3: Pipeline refilled — steady state resumes

 

Partial flush discards 3 instructions and refills in 3 cycles. The misprediction penalty is 3 cycles of lost throughput — the pipeline never fully empties.

 


 

L.6 The One-In-One-Out Invariant

 

At steady state (6 valid instructions in flight, no stalls, no flushes), exactly one instruction enters the pipeline at IF and exactly one instruction exits at WB per cycle. This is the single-issue throughput ceiling — one instruction per cycle (IPC = 1.0).

 

The invariant holds because:

 

The ring buffer has exactly STAGE_COUNT (6) slots. advanceRing() rotates the head pointer by one position per cycle. The slot exiting WB is recycled as the new IF slot. If IBox supplies a valid FetchResult, the new IF slot is filled. If IBox cannot supply (stall), the IF slot remains a bubble.

 

Throughput in practice: IPC = 1.0 is the theoretical maximum and the typical case for sequential integer code. Actual IPC is reduced by: pipeline stalls (barriers, multi-cycle FP, device backpressure), branch mispredictions (3-cycle refill penalty), full flushes (5-cycle warmup penalty), and IBox fetch stalls (ITB miss, cache miss). For SRM boot code with frequent CALL_PAL and TLB misses, observed IPC can drop below 0.3.

 


 

L.7 Multi-Cycle Stall Impact

 

When an instruction stalls in stage_EX() (multi-cycle FP operation, device backpressure), the pipeline partially empties from the WB end while backing up from the EX end:

 

Cycle N: Steady state (6 instructions)

Cycle N+1: EX stalls (FP divide, ~20 cycles)

 WB retires its instruction normally

 MEM commits its instruction normally

 IF, DE, IS frozen (cannot advance past stalled EX)

 

Cycle N+2: WB: empty (retired last cycle, no replacement from MEM)

 MEM: empty (committed last cycle, no replacement from EX)

 EX: still stalled

 IS, DE, IF: frozen

 

Cycles N+3 through N+20: EX still stalled

 WB: empty (no retirements — throughput = 0)

 MEM: empty

 EX: stalled instruction

 IS, DE, IF: frozen (3 instructions waiting)

 

Cycle N+21: EX stall clears — instruction advances to MEM

Cycle N+22: Stalled instruction reaches WB, retires

 IS instruction enters EX

Cycle N+23: Pipeline refilling...

Cycle N+25: Steady state resumes (3-cycle refill from EX forward)

 

Observations: During the stall, the pipeline drains from the WB end. WB and MEM empty within 2 cycles. EX holds the stalled instruction. IS, DE, IF are frozen and hold their instructions intact — they are not discarded, just blocked. No retirement occurs for the duration of the stall (throughput = 0). After the stall clears, it takes 2 additional cycles for the stalled instruction to reach WB and retire, plus 3 more cycles for the pipeline to refill to steady state from the IF end.

 

Barrier stalls behave identically but stall in stage_MEM() instead of stage_EX(). The stalled instruction has already executed (its result is in slot.payLoad and m_pending). Only 1 stage (WB) drains ahead of the stall point, and the pipeline refills in 4 cycles after clearing (EX through IF must refill).

 


 

L.8 advanceRing() — Head Pointer Rotation

 

The ring buffer advancement is the simplest operation in the pipeline and the most important to understand correctly:

 

void advanceRing() noexcept {

 m_head = (m_head + 1) % STAGE_COUNT;

}

 

What this does: The physical slot at m_head was WB (oldest). After increment, that same physical slot is now IF (youngest) — ready to receive the next fetch. Every other physical slot shifts one logical position toward WB. No data is copied. No slots are swapped. The instructions stay in place; the interpretation of which slot is which stage changes.

 

Before advanceRing() (m_head = 3):

 Physical:

[0]

[1]

[2]

[3]

[4]

[5]

 Logical:

IS

DE

IF

WB

MEM

EX

 PCs:

100C

1010

1014

1000

1004

1008

 ↑ m_head (oldest)

 

After advanceRing() (m_head = 4):

 

Physical:

[0]

[1]

[2]

[3]

[4]

[5]

 Logical:

EX

IS

DE

IF

WB

MEM

 PCs:

100C

1010

1014

1018

1004

1008

 ↑ m_head (oldest)

 

Physical slot [3] was WB (PC 0x1000, just retired). After rotation, it becomes IF and receives PC 0x1018 from the next fetch. Physical slot [4] is now the oldest (WB), holding PC 0x1004. Every instruction has advanced one logical stage.

 

When advanceRing() is skipped: If the pipeline is stalled (isPipelineStalled() returns true), advanceRing() is not called. All instructions remain in their current stages. The stalled stage will be re-evaluated on the next tick(). This is the mechanism that freezes the pipeline on stalls — not special stall logic in each stage, but simply not rotating the ring.

 


 

L.9 Pipeline Performance Characteristics

 

Characteristic

Value

Notes

Pipeline depth

6 stages

IF → DE → IS → EX → MEM → WB

Issue width

1 (single-issue)

One instruction enters IF per cycle

Peak throughput

1 IPC

One retirement per cycle at steady state

Fetch-to-retire latency

6 cycles

Instruction spends 1 cycle per stage

Full flush penalty

5 cycles

Warmup from empty to first retirement

Misprediction penalty

3 cycles

Partial flush (IF/DE/IS), refill from corrected target

 

Comparison to EV6 hardware: The real EV6 is a 4-wide superscalar with a 7-stage pipeline, capable of 4 IPC peak. ASA-EMulatR's 6-stage single-issue pipeline is an intentional simplification — it preserves architectural correctness (instruction ordering, exception precision, memory model) while eliminating the complexity of superscalar dispatch, out-of-order execution, and speculative register renaming. The pipeline depth (6 vs 7 stages) is close enough that stall and flush penalties are representative of the hardware's behavior.

 


 

L.10 Pipeline Cycle Invariants

 

The following invariants hold for every pipeline cycle:

 

One tick, one cycle. tick() is called exactly once per pipeline cycle by AlphaCPU::runOneInstruction(). There is no mechanism to execute half a cycle or skip a cycle. The cycle counter (m_cycleCount) increments by exactly 1 per tick() call.

 

All stages execute per cycle. execute() calls all 6 stage functions every cycle, regardless of slot validity. Empty slots (valid = false) return immediately from their stage function. There is no logic that selectively skips stages.

 

Reverse order is unconditional. The execution order WB→MEM→EX→IS→DE→IF never varies. There is no dynamic reordering, no priority inversion, and no conditional stage skipping. The order is a compile-time constant expressed by the sequence of function calls in execute().

 

At most one instruction enters per cycle. stage_IF() consumes at most one FetchResult per tick(). If IBox did not supply a fetch (stall, miss), the IF slot is a bubble.

 

At most one instruction exits per cycle. stage_WB() retires at most one instruction per tick(). If the WB slot is empty (bubble), no retirement occurs.

 

advanceRing() is all-or-nothing. Either the ring advances (all 6 slots shift one logical position) or it does not (pipeline stalled). There is no partial advancement — you cannot advance some stages while holding others. This is enforced by the single m_head pointer.

 

Slot data is never copied between stages. Instructions do not move between PipelineSlot objects. The logical stage assignment changes via head pointer rotation. A PipelineSlot is born in one physical array position and dies in that same position. This eliminates copy overhead and simplifies pointer/reference stability.

 

See Also: 13.3 Pipeline Structure - Ring Buffer ; 13.5 Pipeline Execution - tick() and execute() ; 13.6 Stage Implementations ; B.2 - Pipeline Retirement Mechanics ; cpuCoreLib/AlphaPipeline.h.