|
<< Click to Display Table of Contents >> Navigation: ASA-EMulatR Reference Guide > Introduction > Appendix > Appendix G – Instruction Grain Mechanics |
This appendix provides the authoritative reference for the instruction grain subsystem — the mechanism by which Alpha AXP instructions are represented, resolved, cached, and executed. The grain is the central abstraction in ASA-EMulatR: it mediates every instruction execution from decode through retirement. Understanding the grain lifecycle is prerequisite to understanding any chapter that discusses instruction execution.
An instruction grain is a lightweight, immutable, pre-registered C++ object that encapsulates the identity and execution behavior of a single Alpha AXP instruction type. One grain exists for each unique combination of opcode, function code, and platform — approximately 616 grains cover the entire Alpha AXP instruction set including platform-specific variants for VMS, Tru64/Unix, and PAL-internal operations.
Flyweight pattern: Grains are not allocated per-instruction. They are singletons — one shared object per instruction type, pre-registered at program startup. Every ADDQ instruction executed during the lifetime of the emulator uses the same ADDQ grain object. The grain carries no per-instance state; all per-instruction data (operands, results, faults) lives in the PipelineSlot. This is the GoF flyweight pattern: shared intrinsic state (the grain) with extrinsic state passed in via the slot.
What a grain knows: its opcode, its function code, its mnemonic, its instruction format (operate, memory, branch, PAL), its grain type classification, its target platform, and how to execute itself given a PipelineSlot. What a grain does not know: the current PC, the register values, the memory address, the fault state. All of those belong to the slot.
Every grain inherits from InstructionGrain, which defines the polymorphic interface:
struct InstructionGrain {
// ── Data Members (8 bytes) ────────────────────────────
quint32 rawBits; // Original 32-bit instruction (template default)
quint8 flags; // Packed format/issue flags (GrainFlags)
quint8 latency; // Expected cycle latency
quint8 throughput; // Instructions per cycle (reciprocal)
// ── Vtable pointer (8 bytes) ──────────────────────────
// Total: 16 bytes per grain
virtual ~InstructionGrain() = 0;
// ── Virtual Interface ───────────────────────────────
virtual quint8 opcode() const = 0;
virtual quint16 functionCode() const = 0;
virtual QString mnemonic() const = 0;
virtual GrainType grainType() const = 0;
virtual void execute(PipelineSlot& slot) const noexcept = 0;
virtual GrainPlatform platform() const {
return GrainPlatform::NONE;
}
AXP_ALWAYS_INLINE bool canDualIssue() const noexcept {
return (flags & GF_CanDualIssue) != 0;
}
};
Memory footprint: 8 bytes of data + 8 bytes of vtable pointer = 16 bytes per grain. With 616 grains, the entire grain registry occupies approximately 10 KB. This fits comfortably in L1 cache.
The execute() contract: execute(PipelineSlot& slot) is the sole entry point for instruction execution. Every grain overrides this method. It takes a PipelineSlot by reference, performs all work through box delegation, and returns void. All results, faults, and stall conditions are communicated by modifying the slot in place — never by return value. This is the "letter box" pattern: the slot is both the input envelope and the output mailbox.
The rawBits field: rawBits on the grain is a template default, not the current instruction's bits. The actual instruction bits for the executing instruction are in slot.di.rawBits(). The grain is a shared singleton — it cannot carry per-instruction data.
The authoritative definition of all Alpha AXP instructions is a tab-separated table (GrainMaster.tsv) containing one row per instruction type. Each row specifies: mnemonic/qualifier, opcode (hex and decimal), function code, instruction format, architecture (VMS, Unix, PAL, Alpha), and classification metadata.
This table is the single source of truth. All grain header files, registration code, and validation are generated from this table. Corrections to instruction encodings are made in the table and regenerated — never by manually editing individual grain headers.
A Python code generator (python/generate_all_grains.py) reads GrainMaster.tsv and produces:
Per-grain inline headers — one header per instruction type, organized by category (grains/generated/PAL/, grains/generated/Integer/, grains/generated/FloatingPoint/, grains/generated/Memory/, grains/generated/Branch/). Each header defines a class inheriting from InstructionGrain, implementing all virtual methods, and including the REGISTER_GRAIN macro.
A master include file — a single header that includes all generated grain headers, ensuring every grain is compiled and registered.
The generator validates the input table during generation: detecting opcode/function code collisions, flagging architecture conflicts (VMS vs Unix overlapping encodings), verifying function code ranges, and ensuring no duplicate definitions. Generation fails on validation errors.
Each generated grain header includes the REGISTER_GRAIN macro (from GrainRegistrationCore.h):
#define REGISTER_GRAIN(GrainType) \
static GrainAutoRegistrar<GrainType> \
GRAIN_CONCAT(autoReg_, __COUNTER__)
GrainAutoRegistrar is a template struct whose constructor creates a static instance of the grain class and registers it with the InstructionGrainRegistry singleton:
template<typename GrainType>
struct GrainAutoRegistrar {
GrainAutoRegistrar() {
static GrainType grain;
InstructionGrainRegistry::instance().registerGrain(
grain.opcode(), grain.functionCode(), &grain);
}
};
Registration occurs during static initialization, before main() executes. By the time AlphaCPU initializes, all 616 grains are registered and ready for lookup. No runtime allocation occurs — every grain is a static object with program lifetime.
InstructionGrainRegistry is a global singleton that stores all registered grains in a flat hash map keyed by makeGrainKey(opcode, functionCode). The key combines the 6-bit opcode and the function code into a single lookup value. The registry supports platform-aware lookup: it first attempts an exact match with the specified GrainPlatform, then falls back to GrainPlatform::NONE (architecture-universal grains) if no platform-specific match exists.
GrainResolver is the runtime entry point for grain lookup. IBox calls GrainResolver::instance().resolve(pc, opcode, func) when decoding an instruction. The resolver delegates to the registry's lookup(opcode, func, platform) method. If lookup returns nullptr, the instruction is unimplemented — IBox records it as nullptr in the DecodedInstruction and the pipeline will fault with ILLEGAL_INSTRUCTION when the empty grain reaches stage_EX().
The resolver holds a configurable platform setting (m_overridePlatform) defaulting to GrainPlatform::VMS. This allows the same binary to emulate different PAL operating system personalities by switching the platform before boot.
Five opcodes are PAL-mode hardware instructions whose "function code" field carries a variable operand (typically an IPR index) rather than a fixed encoding: HW_MFPR (0x19), HW_LD (0x1B), HW_MTPR (0x1D), HW_REI (0x1E), HW_ST (0x1F). For these opcodes, the registry cannot match on function code because the function code varies per instruction instance (e.g., HW_MTPR to SIRR uses function code 0x0018, HW_MTPR to ASTRR uses 0x0019).
The resolution strategy: on exact-match miss, if the opcode is one of the five PAL hardware opcodes, the resolver falls back to lookup(opcode, 0x0000) — the wildcard registration. Each PAL hardware grain is registered with function code 0x0000, and the actual IPR index or operand is extracted from slot.di.rawBits() at execution time by the grain's execute() method. This avoids registering thousands of per-IPR grains for a handful of PAL opcodes.
IBox accesses two independent decode caches through global singleton accessors (pcDecodeCache() and paDecodeCache()). Both caches store DecodedInstruction records — not grain pointers alone. A DecodedInstruction contains the grain pointer, decoded register indices (ra, rb, rc), literal value, branch displacement, instruction semantics, PC, and physical address. This means decode happens once; all subsequent encounters of the same instruction skip the entire decode and grain resolution path.
The two caches serve fundamentally different roles and have different sharing semantics:
PA Cache (Physical Address) — Global, Shared Across All CPUs
Keyed by physical address (PaKey::fromPA(pa)). The PA cache is a single global instance shared by all CPUs. A decoded instruction at physical address 0x5000 is identical regardless of which CPU decodes it — the same physical memory contains the same instruction bits, resolving to the same grain, with the same decoded register fields. When CPU 0 decodes an instruction and inserts it into the PA cache, CPU 1 encountering the same physical address gets an immediate hit and skips the entire decode path. This is the "decode once, shared everywhere" principle.
The PA cache survives context switches because physical addresses are identity-stable — a context switch changes virtual-to-physical mappings but does not change the code at a physical address. The PA cache is invalidated only on self-modifying code (a store to a physical address that has a cached decode) and page unmap (the physical page is recycled for different content).
PC Cache (Virtual Address) — Global Singleton, Internally Sharded by CPU
Keyed by virtual PC (PcKey::fromVA(pc)). The PC cache is a single global singleton but is internally sharded by CPU ID. Each CPU has its own partition within the PC cache because virtual-to-physical mappings are per-process — two CPUs running different processes may map the same virtual address to different physical pages. CPU 0's PC=0x10000 may decode to ADDQ while CPU 1's PC=0x10000 (in a different process) may decode to SUBQ.
The PC cache is the fast-path cache. On sequential code execution within a single process, the PC cache provides a direct hit on the next instruction. The PC cache is invalidated per-CPU on context switch (VA mappings change for that CPU), TBIA (TLB invalidate all), and ITBIS (instruction TLB invalidate single). Invalidation uses an O(1) generation counter bump — no entry walking is required.
Cache hierarchy summary:
Cache |
Key |
Scope |
Structure |
Survives Context Switch |
|---|---|---|---|---|
PC Cache |
Virtual PC >> 2 |
Per-CPU (sharded) |
4-way set-associative, 4096 buckets |
No — flushed per-CPU |
PA Cache |
Physical Address >> 2 |
Global (shared, all CPUs) |
Direct-mapped or set-associative |
Yes — physical identity unchanged |
Memory footprint: The PA cache is a single shared instance — its memory cost is fixed regardless of CPU count. The PC cache is sharded by CPU, so its memory scales with CPU count. In a 4-CPU configuration: one shared PA cache (~64–128 KB) plus four PC cache shards (4 × ~2 MB = ~8 MB). Total decode cache memory: approximately 8–9 MB for a 4-CPU system. Compare to the naive approach of duplicating both caches per CPU (4 × ~4 MB = ~16 MB with no cross-CPU decode sharing).
Concurrency: Both caches use lock-free reads via seqlock/versioning. The PA cache (shared across CPUs) requires atomic operations for concurrent inserts from multiple CPUs — a seqlock on each bucket ensures torn reads are detected and retried. The PC cache shards are per-CPU with no cross-CPU contention on reads or writes within a shard.
The instruction fetch and decode flow follows a lazy decode-once pattern — the grain is resolved and the instruction is decoded only on the first encounter. All subsequent encounters use the cached DecodedInstruction:
IBox::fetchAndDecode(pc, pa, cpuId):
┌─ STEP 1: PC Cache Lookup (fastest path, per-CPU) ────────────┐
│ PcKey pcKey = PcKey::fromVA(pc) │
│ cached = pcDecodeCache().lookup(cpuId, pcKey) │
│ if (cached && cached→physicalAddress == pa) │
│ return cached ← HIT (~4–7 cycles) │
└──────────────────────────────────────────────────────────────┘
│ miss
┌─ STEP 2: PA Cache Lookup (shared, all CPUs) ─────────────────┐
│ PaKey paKey = PaKey::fromPA(pa) │
│ cached = paDecodeCache().lookup(paKey) │
│ if (cached) │
│ pcDecodeCache().insert(cpuId, pcKey, *cached) ← promote │
│ return cached ← HIT (~10–20 cycles) │
│ (another CPU already decoded this instruction) │
└──────────────────────────────────────────────────────────────┘
│ miss
┌─ STEP 3: Full Decode (cold path) ────────────────────────────┐
│ rawBits = m_guestMemory→readInst32(pa) │
│ opcode = extractOpcode(rawBits) │
│ func = extractFunction(rawBits) │
│ grain = GrainResolver::resolve(pc, opcode, func) │
│ di = buildDecodedInstruction(grain, rawBits, pc, pa) │
│ paDecodeCache().insert(paKey, di) ← shared globally │
│ pcDecodeCache().insert(cpuId, pcKey, di) ← per-CPU shard │
│ return di ← MISS (~135–370 cycles) │
└──────────────────────────────────────────────────────────────┘
For hot loops, the PC cache hit rate approaches 100% — the entire decode and resolution cost is amortized to zero. For cold code (first execution, post-flush, post-context-switch), the full decode path runs once and populates both caches. In SMP configurations, the PA cache provides cross-CPU sharing: when CPU 0 decodes a shared library function, CPUs 1–3 hitting the same physical address get an immediate PA cache hit without repeating the decode.
Event |
PC Cache |
PA Cache |
Scope |
|---|---|---|---|
Context switch |
Flush (generation bump) |
Untouched |
Single CPU (the switching CPU) |
TBIA (invalidate all TLB) |
Flush (generation bump) |
Untouched |
Single CPU |
ITBIS (invalidate single VA) |
Invalidate entry |
Untouched |
Single CPU |
Self-modifying code (store to code PA) |
Flush (conservative) |
Invalidate PA entry |
All CPUs |
Page unmap (PFN recycled) |
Untouched |
Invalidate page range |
All CPUs (shared PA cache) |
The key principle: PC cache invalidations are virtual events (the VA→PA mapping changed, not the code). PA cache invalidations are physical events (the code itself changed or the physical page was recycled). This separation is what allows the PA cache to be shared globally and survive context switches — physical identity is stable across processes and CPUs.
Self-modifying code note: On self-modifying code detection (a store to a physical address with a cached decode), the PA cache entry is invalidated globally (all CPUs see the invalidation because it is a single shared instance). The PC cache is flushed conservatively for all CPUs because any CPU may hold a PC cache entry pointing to the now-stale physical decode. This is the most expensive invalidation event but is extremely rare in practice — Alpha AXP code is not self-modifying outside of PAL loader operations.
Every instruction grain follows the same execution contract without exception. There is no special-case logic for any instruction category. The contract:
1. The pipeline calls grain→execute(slot). This is the only entry point. The pipeline does not inspect the grain type, does not branch on instruction category, and does not call any box directly. The grain is the sole dispatcher.
2. The grain calls box→executeXXX(slot). The grain knows which box handles its instruction class: EBox for integer operations, FBox for floating-point, MBox for memory access, PalBox for PAL functions. The grain calls the specific box method for its instruction (e.g., slot.m_eBox→executeAdd(slot)). The grain does not perform the computation itself — it delegates to the box.
3. The box performs all work and modifies the slot. The box reads operands from the register file via the slot, performs the computation, and writes results back into the slot: slot.payLoad for the result value, slot.needsWriteback = true if a register write is pending, slot.faultPending = true with slot.trapCode if a fault was detected, slot.stalled = true if the operation cannot complete this cycle.
4. The grain returns. No return value. The pipeline inspects the slot after execute() returns and takes action based on the slot's state: advance to MEM (normal), stall (slot.stalled), prepare fault delivery (slot.faultPending), or flush younger stages (slot.flushPipeline for branch misprediction).
No grain ever modifies architectural state directly. No grain writes to the register file. No grain writes to memory. No grain updates the PC. All side effects flow through the slot and are committed by the pipeline at the appropriate stage (commitPending in MEM, store commit in WB, PC update at retirement). This invariant is what makes the pipeline restartable — a faulting or flushed instruction's grain execution has no observable side effect.
The grain execution model has no execution bias. Every grain — integer arithmetic, floating-point, memory load, memory store, branch, barrier, CALL_PAL — follows the identical four-step contract described in Section G.6. The pipeline does not prioritize, reorder, or treat any grain category differently from any other.
Specifically:
No opcode-based dispatch in the pipeline. The pipeline never inspects slot.di.opcode to decide what to do. It calls grain→execute(slot) unconditionally. The grain's virtual dispatch (vtable call) is the sole routing mechanism.
No fast-path / slow-path distinction. Integer operations do not get a cheaper execution path than floating-point operations at the pipeline level. Latency differences are modeled within the box (FBox may set slot.stalled for multi-cycle FP), not by the pipeline treating the grain differently.
No instruction-specific pipeline logic. The only instruction-specific behavior in the pipeline is in stage_WB(), where isCallPal(slot.di) is checked for the CALL_PAL serialization path. This is a retirement concern (Section K.6), not an execution concern — the grain itself executed through the standard path in stage_EX() like any other instruction.
Why this matters: Bias-free execution guarantees that adding a new instruction type requires only a new grain class and registration — zero pipeline modifications. The pipeline is closed to modification, open to extension. This is the primary maintainability benefit of the grain architecture.
Each grain routes to exactly one execution box. The routing is determined at compile time by the grain's execute() implementation — there is no runtime dispatch table.
Box |
Grain Types |
Examples |
Access Pattern |
|---|---|---|---|
EBox |
Integer arithmetic, logical, shift, compare, conditional move |
ADDQ, SUBQ, AND, BIS, SRL, CMPEQ, CMOVNE |
slot.m_eBox→executeXXX(slot) |
FBox |
Floating-point arithmetic, conversion, comparison |
ADDT, MULT, DIVT, CVTQS, CMPTEQ, SQRTS |
slot.m_fBox→executeXXX(slot, variant) |
MBox |
Memory load, store, unaligned access, LDA/LDAH |
LDQ, STQ, LDQ_U, STQ_U, LDA, LDAH, LDL, STL |
slot.m_mBox→executeXXX(slot) |
CBox |
Memory barriers, cache hints |
MB, WMB, EXCB, TRAPB, FETCH, FETCH_M, ECB |
slot.m_cBox→executeXXX(slot) |
PalBox |
PAL functions |
CALL_PAL (HALT, CSERVE, SWPCTX, etc.) |
slot.m_palBox→executeCallPal(slot) |
IBox |
Branch resolution (condition evaluation only) |
BEQ, BNE, BGT, BR, BSR, JMP, JSR, RET |
Grain evaluates condition directly from slot register values |
(None) |
NOP-like instructions |
UNOP, FNOP, WH64 |
Grain returns immediately (no box call) |
FP variant routing: Floating-point grains decode the instruction variant (/S, /SU, /SUC, /SUI, /D, /C, etc.) from the function code before calling the box. The variant is passed as a parameter to the FBox method: slot.m_fBox→executeAdd(slot, variant). The FBox uses the variant to select rounding mode, trap handling, and IEEE compliance behavior.
Box access: Boxes are owned by AlphaCPU and pointers are published directly to the PipelineSlot. The slot holds direct box pointers: slot.m_eBox, slot.m_fBox, slot.m_mBox, slot.m_cBox, slot.m_palBox. The call chain slot.m_eBox→executeAdd(slot) resolves to a single pointer dereference — no intermediate accessor, no APC indirection. This is a hot path (called for every instruction execution) and is always in L1 cache.
The 616 grains are organized into categories that correspond to the Alpha AXP instruction set architecture. Categories determine the generated header directory and the GrainType classification:
Category |
Directory |
Approx. Count |
Opcode Range |
|---|---|---|---|
Integer Operate |
grains/generated/Integer/ |
~120 |
0x10–0x13 |
Floating-Point Operate |
grains/generated/FloatingPoint/ |
~250 |
0x14–0x17 |
Memory Access |
grains/generated/Memory/ |
~50 |
0x08–0x0F, 0x20–0x2F |
Branch |
grains/generated/Branch/ |
~20 |
0x30–0x3F, 0x1A |
PAL (CALL_PAL functions) |
grains/generated/PAL/ |
~60 |
0x00 |
PAL Hardware |
grains/generated/PAL/ |
~5 |
0x19, 0x1B, 0x1D, 0x1E, 0x1F |
Miscellaneous / Barrier |
grains/generated/Misc/ |
~15 |
0x18 (MISC function codes) |
Floating-point grains account for approximately 40% of the total count because each FP mnemonic (ADDT, SUBT, etc.) is expanded into multiple variants representing different IEEE rounding and trap modes (/S, /SU, /SUC, /SUI, /D, /C, etc.). Each variant has a unique function code and a dedicated grain.
The complete lifecycle of a grain from creation to consumption:
BUILD TIME:
generate_all_grains.py reads GrainMaster.tsv
→ produces ~616 inline header files
→ produces master include file
COMPILE TIME:
REGISTER_GRAIN macro creates static GrainAutoRegistrar objects
→ each registrar creates a static grain instance (program lifetime)
→ each registrar calls InstructionGrainRegistry::registerGrain()
→ 616 grains registered before main() executes
RUNTIME — First encounter of an instruction:
IBox::fetchAndDecode(pc, pa)
→ PC cache miss → PA cache miss
→ readInst32(pa) fetches raw bits from memory
→ GrainResolver::resolve(pc, opcode, func) → grain pointer
→ buildDecodedInstruction(grain, raw, pc, pa) → full decode
→ insert into PA cache and PC cache
→ return DecodedInstruction to pipeline via FetchResult
RUNTIME — Subsequent encounters:
IBox::fetchAndDecode(pc, pa)
→ PC cache hit → return cached DecodedInstruction
(grain resolution and decode cost = zero)
RUNTIME — Execution:
stage_EX() → slot.grain→execute(slot)
→ grain calls box→executeXXX(slot)
→ box reads operands, computes, writes slot.payLoad
→ grain returns → pipeline inspects slot state
RUNTIME — Retirement:
stage_WB() → commitPending() writes register result
→ store commit writes SafeMemory (if store)
→ predictor trained (if branch)
→ instruction retired
(grain object is untouched — it is stateless, reusable forever)
Why flyweight singletons instead of per-instruction objects? The Alpha AXP has 616 instruction types but executes billions of instructions per boot. Allocating an object per executed instruction would generate enormous allocation pressure. Flyweight singletons reduce the grain subsystem's memory footprint to ~10 KB regardless of execution volume.
Why virtual dispatch instead of function pointers or switch statements? Virtual dispatch via the vtable provides O(1) polymorphic execution with type safety. A switch statement on opcode would require the pipeline to know about instruction types — violating the bias-free execution principle. Function pointers would work but lack the encapsulation and type safety of the class hierarchy. Virtual dispatch also enables derived grain classes (e.g., template-based FP grains) to share implementation across variants.
Why code generation from a table instead of hand-written grains? With 616 instruction types, manual maintenance is unsustainable. Encoding errors (wrong opcode, wrong function code, missing registration) are the most common source of grain bugs. The table-driven approach ensures that corrections propagate uniformly and instantly. Validation is automated as part of generation.
Why dual decode caches instead of a single cache? Virtual addresses change on context switch; physical addresses do not. A single VA-keyed cache would require full invalidation on every context switch, flushing hot decode state for code that hasn't changed. The dual cache allows the PA cache to survive context switches while the PC cache is rebuilt from PA cache promotions — reducing cold-start cost after a switch.
Why no execution bias? Bias-free execution means the pipeline is a fixed machine that processes any instruction identically. New instructions require only a new grain and registration — no pipeline code changes. This is critical for a 616-instruction ISA where instruction-specific pipeline logic would create an unmaintainable web of special cases.
See Also: Chapter 13 – AlphaPipeline Implementation (stage_EX grain execution); Chapter 14 – Execution Domains (“Boxes”) (box routing); B.2 - Pipeline Retirement Mechanics (deferred writeback, store commit); B.1 - Pipeline Cycle Mechanics (stage_EX in context); grainFactoryLib/InstructionGrain.h; grainFactoryLib/InstructionGrainRegistry.h; grainFactoryLib/GrainResolver.h; grainFactoryLib/ GrainMaster.tsv.