|
<< Click to Display Table of Contents >> Navigation: ASA-EMulatR Reference Guide > Introduction > Appendix > Appendix M – SPAM TLB/PTE Management Mechanics |
This appendix provides the authoritative reference for the SPAM (Software Page Address Map) subsystem — the software model of the Alpha AXP Translation Lookaside Buffer. SPAM is the sole mechanism by which virtual addresses are translated to physical addresses in ASA-EMulatR. Every instruction fetch, every data load, every data store passes through SPAM. Understanding this subsystem is prerequisite to understanding Chapter 17 (Translation and TLB/PTE Management), Chapter 18 (Fault Dispatcher), and Chapter 20 (PAL Emulation and Boot Sequence).
SPAM is a software TLB — a per-CPU cache of recently used page table entries (PTEs) that maps virtual page numbers (VPNs) to physical frame numbers (PFNs) with associated permissions. On a real Alpha processor, the TLB is a content-addressable memory (CAM) array in silicon. In ASA-EMulatR, SPAM replaces this hardware structure with a software-managed set-associative cache built on a 4-dimensional shard array.
SPAM is not a simple hash table. It is a full TLB implementation that supports all Alpha AXP TLB operations: insertion (from PTE fill), lookup (translation hot path), existence check (TBCHK instruction), per-VA invalidation (TBIS/TBISD/TBISI), per-ASN invalidation (TBIAP), full invalidation (TBIA), and context-switch invalidation. It models address space numbers (ASNs), granularity hints (GH) for superpages, the ASM (address space match) global bit, and instruction vs. data stream separation. It enforces the same coherency contracts that real Alpha hardware enforces in SMP configurations.
The core data structure of SPAM is a statically allocated 4-dimensional array of buckets:
m_shards[cpuId][realm][sizeClass][bucketIndex]
Each dimension serves a specific isolation purpose:
Dimension 1 — CPU ID (0..MAX_CPUS-1): Each emulated CPU owns its own complete shard. CPU 0 never touches CPU 1's buckets during lookup or insertion. This is the fundamental SMP isolation property: TLB lookups are per-CPU with zero cross-CPU contention on the hot path. Cross-CPU interaction occurs only during invalidation broadcasts.
Dimension 2 — Realm (D=0, I=1): Instruction-stream and data-stream translations are separated into distinct shard planes. The Alpha architecture defines separate ITB (instruction TLB) and DTB (data TLB), and SPAM preserves this separation. TBISI invalidates only the I-realm; TBISD invalidates only the D-realm; TBIS invalidates both. This separation also allows the I-realm and D-realm to have different population characteristics — kernel code pages may be permanently resident in the I-realm while data pages churn in the D-realm.
Dimension 3 — Size Class (GH 0..3): Alpha supports four page sizes via the Granularity Hint (GH) field in the PTE: GH=0 (8 KB), GH=1 (64 KB), GH=2 (512 KB), GH=3 (4 MB). Each size class has its own bucket plane because the VPN computation differs by page size — a 4 MB superpage masks out more VA bits than an 8 KB page. Sharding by size class eliminates the need to probe all four page sizes on every lookup when only one size is populated (which is the common case — see GH coverage bitmap, Section M.7).
Dimension 4 — Bucket Index (0..BUCKETS_PER_SHARD-1): Within a shard plane, entries are distributed across buckets using a splitmix-style hash of the VPN, realm, and size class. Bucket count is a power of two (default 128) so the index is computed by masking: hash(vpn, realm, sizeClass) & (BUCKETS_PER_SHARD - 1). ASN and the global flag are not included in the hash — they affect match, not identity.
Sizing: With MAX_CPUS=4, REALMS=2, SIZE_CLASSES=4, BUCKETS_PER_SHARD=128, and 4-way associativity, the total entry count is 4 × 2 × 4 × 128 × 4 = 16,384 entries. Each entry is compact (PFN + PermMask + epoch captures + tag), keeping the entire SPAM structure in tens of kilobytes per CPU — well within L2 cache.
The canonical AlphaPTE is a 64-bit wrapper around the raw PTE value from memory. It provides constexpr accessors for all architectural fields: Valid (V), Fault-on-Execute (FOE), Fault-on-Write (FOW), Fault-on-Read (FOR), Address Space Match (ASM), Granularity Hint (GH bits 6:5), physical frame number (PFN, bits 32+), and the protection enables (KRE, KWE, ERE, EWE, SRE, SWE, URE, UWE).
PTETraits is a template policy class parameterized by CPU generation (EV4/EV5/EV6) and physical address width (typically 44 bits for EV6). It provides generation-portable accessors for PFN extraction, permission mask computation, GH decoding, and PTE sanitization. All three generations share the same 8 KB base page (pageShift=13); the traits encapsulate the differences in PTE bit layout and TLB register encoding.
The compact PermMask is a 4-bit field extracted from the PTE protection enables: User-Read, User-Write, Kernel-Read, Kernel-Write. This is the only permission information cached in SPAM entries — the full PTE protection bits are reduced to this minimal set at insertion time, avoiding per-lookup bit extraction.
Each SPAM bucket entry consists of a tag and a payload:
SPAMTag: {VPN, sizeClass, realm}. The tag is the lookup identity. Two entries match if their tags are identical and their ASN/epoch checks pass. The tag does not include ASN because multiple ASNs may map the same VPN to different PFNs — ASN is a match qualifier, not an identity component.
SPAMEntry: {tag, pfn, permMask, asn, flags (valid, global, locked, transitioning), globalGenAtFill, asnGenAtFill, accessCount}. The entry caches the translation result (PFN + PermMask) and the epoch snapshots captured at fill time. The flags.global bit reflects the PTE's ASM bit (ASM=1 → global, survives context switches). The accessCount supports replacement policy decisions (SRRIP/Clock).
Each SPAMBucket is a bounded set-associative cache with configurable associativity (default 4-way, maximum 64-way). The bucket holds up to AssocWays entries in a flat array. Slot occupancy is tracked by a 64-bit atomic bitmap (m_occ): bit N is set if slot N contains a valid entry. This bitmap enables O(1) empty-slot detection for insertion and provides a level-zero fast-reject for lookups — if no bits are set, the bucket is empty and the lookup returns immediately without entering the seqlock.
The bucket uses a classic even/odd seqlock (m_ver) to allow concurrent readers and a single writer:
Writer (insert/invalidate): beginWrite() sets m_ver to odd (write in progress), the writer mutates entry data, endWrite() sets m_ver to even (write complete). The odd→even transition publishes all writes via release semantics.
Reader (find/probe): The reader loads m_ver with acquire semantics (v0). If v0 is odd, a writer is active — the reader spins. The reader then reads the entry data. After reading, the reader loads m_ver again (v1). If v0 == v1, the read was consistent (no writer intervened). If v0 ≠ v1, the read was torn — the reader retries from the top.
The seqlock provides lock-free reads for the common case (no concurrent writer) and bounded retry for the rare case (concurrent write). No mutex is ever acquired on the lookup hot path. The seqlock's acquire fence on the read side orders subsequent relaxed epoch loads correctly, which is critical for the lazy invalidation scheme.
Insertion uses lock-free CAS (compare-and-swap) on the occupancy bitmap to claim a free slot. The insert path scans m_occ for a clear bit, attempts a CAS to set it, and if successful writes the entry under the seqlock. If no free slot exists (all ways occupied), the replacement policy selects a victim for eviction.
The defining feature of SPAM is its lazy invalidation scheme. Rather than walking every entry on a context switch or TBIAP — which would be O(N) in the entry count — SPAM bumps a generation counter in O(1) and detects stale entries at lookup time by comparing captured-at-fill generations against the live epoch. An entry is considered live if and only if both epoch axes pass:
Axis 1 — Global Epoch (context-switch guard):
One atomicuint32 counter per CPU: globalEpoch. Bumped on every non-ASM invalidation (context switch, PTBR write). Global entries (ASM=1) are unconditionally alive on this axis — they survive context switches by definition. Non-global entries (ASM=0) must have entry.globalGenAtFill == epochTable→globalEpoch. A mismatch means a context switch occurred since the entry was filled; the entry is lazily dead and find() returns nullptr. The entry is not touched, not cleared, not walked — it simply fails the epoch check on the next lookup and is ignored.
Axis 2 — Per-ASN Epoch (TBIAP guard):
Two arrays of 256 atomicuint32 counters per CPU: itbEpoch[256] and dtbEpoch[256], indexed by ASN. Bumped per-ASN on TBIAP. Global entries skip this check entirely. Non-global entries must have entry.asnGenAtFill == getCurrent(epochTable, realm, asn). A mismatch means TBIAP was issued for this ASN since the entry was filled; the entry is lazily dead.
Combined liveness predicate:
isAlive(entry) =
if entry.flags.global:
return true // global entries always alive
if entry.globalGenAtFill != globalEpoch:
return false // axis 1 failed: context switch
if entry.asnGenAtFill != realmEpoch[asn]:
return false // axis 2 failed: TBIAP
return true // both axes pass: entry is live
Invalidation cost summary:
Operation |
SPAMEpoch Method |
Cost |
|---|---|---|
Context switch / invalidateNonASM |
bumpGlobal() |
O(1) — one atomic increment |
TBIAP (per ASN, both realms) |
bumpBoth(asn) |
O(1) — two atomic increments |
TBISI (per ASN, I-stream only) |
bumpITB(asn) |
O(1) — one atomic increment |
TBISD (per ASN, D-stream only) |
bumpDTB(asn) |
O(1) — one atomic increment |
TBIA (nuke everything) |
bumpGlobal() + bumpAll() |
O(256) — 513 atomic increments |
TBIS (single VA) |
Walk buckets for all GH values |
O(GH_COUNT) — probe + invalidate matching entries |
Memory ordering contract: All epoch bumps use memory_order_release (publishes the invalidation). All epoch reads use memory_order_relaxed — the seqlock's acquire fence in SPAMBucket::find() provides the necessary ordering. This avoids unnecessary fence overhead on the read hot path.
Each emulated CPU has its own PerCPUEpochTable, allocated as a member of the SPAMShardManager and attached to every bucket during construction. The table is aligned to a 64-byte cache line to prevent false sharing between CPUs:
struct alignas(64) PerCPUEpochTable {
static constexpr unsigned MAX_ASN = 256;
std::atomic<quint32> globalEpoch { 0 }; // axis 1
std::atomic<quint32> itbEpoch[MAX_ASN] {}; // axis 2 (I-realm)
std::atomic<quint32> dtbEpoch[MAX_ASN] {}; // axis 2 (D-realm)
};
Per-CPU isolation is the critical SMP property. The epoch tables are per-CPU, not shared. When CPU 0 performs a context switch, it bumps only m_asnEpochs[0].globalEpoch. CPU 1's epoch table is untouched. This means:
On the lookup hot path, a CPU reads only its own epoch table — zero cross-CPU cache line contention. No CPU ever writes another CPU's epoch table during normal operation. The only cross-CPU interaction occurs during explicit SMP invalidation broadcasts (Section M.10), and even those are implemented by each CPU bumping its own table in response to an invalidation command.
During construction, SPAMShardManager attaches each CPU's epoch table to all of that CPU's buckets via bucket.attachEpochTable(&m_asnEpochs[cpuId], realm). Each bucket holds a pointer to its owning CPU's epoch table and its realm, which find() uses for the liveness check without any additional indirection.
Because the Alpha architecture allows four page sizes (GH 0..3) and a lookup does not know the page size a priori, SPAM must probe up to four size classes per lookup. To avoid this penalty in the common case — where only 8 KB pages (GH=0) are populated — SPAM maintains a per-CPU, per-realm GH coverage bitmap:
quint8 m_ghCoverage[MAX_CPUS][REALMS];
The bitmap encodes 8 bits — two nibbles of 4 bits each:
Bit |
Meaning |
Nibble |
|---|---|---|
0 |
GH=0 non-global entries exist |
Low |
1 |
GH=1 non-global entries exist |
Low |
2 |
GH=2 non-global entries exist |
Low |
3 |
GH=3 non-global entries exist |
Low |
4 |
GH=0 global entries exist |
High |
5 |
GH=1 global entries exist |
High |
6 |
GH=2 global entries exist |
High |
7 |
GH=3 global entries exist |
High |
Set on insert: When tlbInsert() successfully inserts an entry, the corresponding bit is set via a single OR operation. Cleared on TBIA: The byte is zeroed. Not cleared on TBIAP: Other ASNs may still have entries at that GH, so clearing would create false negatives.
The bitmap is conservative one-directional: bits are set on insert and only cleared on TBIA. A stale set-bit costs one extra empty-bucket probe (killed in ~1 ns by the bucket's m_occ bitmap fast-reject). False negatives are impossible. This is an acceptable trade: the bitmap eliminates 75% of GH probes in the typical userspace case (only GH=0 populated) at the cost of a single byte load per lookup.
The bitmap is NOT atomic. It is a plain quint8 because the contract guarantees single-writer semantics: only CPU N's insert path sets bits in m_ghCoverage[N][realm], and only CPU N's TBIA clears it. No cross-CPU contention exists.
Three-level fast-reject stack (outermost to innermost):
Level 0: GH coverage bitmap — skip entire size classes (manager)
Level 1: m_occ bitmap — skip empty buckets (bucket)
Level 2: Seqlock + epoch axes — detect stale entries (bucket)
In the typical userspace workload (GH=0 only, bitmap reads 0b0001), the GH loop in tlbLookup() skips GH=3, GH=2, GH=1 immediately — each skip costs one shift + AND + branch-not-taken. The probe drops from 8 bucket lookups (4 GH × 2 global/non-global) to 2.
SPAM implements the complete Alpha AXP TLB management instruction set:
Instruction |
SPAM Method |
Action |
Scope |
|---|---|---|---|
ITB_TAG + ITB_PTE write |
tlbInsert(cpuId, Realm::I, ...) |
Fill I-stream TLB entry from PTE |
Local CPU |
DTB_TAG + DTB_PTE write |
tlbInsert(cpuId, Realm::D, ...) |
Fill D-stream TLB entry from PTE |
Local CPU |
TLB lookup (fetch) |
tlbLookup(cpuId, Realm::I, va, asn, ...) |
Translate VA→PA for instruction fetch |
Local CPU |
TLB lookup (load/store) |
tlbLookup(cpuId, Realm::D, va, asn, ...) |
Translate VA→PA for data access |
Local CPU |
TBCHK |
tbchkProbe(cpuId, realm, va, asn) |
Boolean existence check (no PFN returned) |
Local CPU |
TBIS (both ITB+DTB) |
tbisInvalidate(cpuId, va, asn) |
Invalidate VA in both realms, all GH |
Local CPU |
TBISD (DTB only) |
tbisdInvalidate(cpuId, va, asn) |
Invalidate VA in D-realm, both banks |
Local CPU |
TBISI (ITB only) |
tbisiInvalidate(cpuId, va, asn) |
Invalidate VA in I-realm |
Local CPU |
TBIAP (per-ASN) |
invalidateTLBsByASN(cpuId, asn) |
Epoch bump + optional sweep for ASN |
Local CPU |
TBIA (invalidate all) |
invalidateAllTLBs(cpuId) |
Nuke all entries for CPU, clear GH bitmap |
Local CPU |
GH validation on insert: GH is extracted from the PTE (source of truth, Rule 2.1). VA and PFN alignment are validated against the claimed page size (Rule 2.2). Misaligned superpage PTEs are degraded to 8 KB (GH=0) — validation may only reduce GH, never increase it (Rule 4.4). The tag uses the validated GH verbatim (Rule 4.5).
Lookup probing: When the page size is unknown (standard lookup), SPAM probes all populated GH values largest-first, filtered by the coverage bitmap (Rule 5.1). When the page size is known (from a TAG register write), SPAM probes only that size class (Rule 5.2, via tlbLookupWithKnownGH).
When a bucket is full (all ways occupied) and an insertion is required, a victim must be selected for eviction. SPAM supports pluggable replacement policies via template parameters:
SRRIP (Static Re-Reference Interval Prediction) — default: Each entry maintains a 2-bit RRPV (Re-Reference Prediction Value). On hit, RRPV is reset to 0 (most recently used). On insert, new entries receive a high RRPV (long re-reference distance). On eviction, the entry with the highest RRPV is selected as victim. SRRIP provides good stability under scan workloads and balances recency against frequency.
Clock (second-chance): Each entry maintains a 1-bit reference flag. On hit, the reference bit is set. On eviction, a clock hand sweeps until an entry with reference bit 0 is found. Approximates LRU with minimal metadata overhead.
Random: A random way is selected. Zero state, zero overhead. Useful as a baseline comparison.
All policies respect entry pinning: entries with the locked flag set are never selected as victims. This allows kernel text and hot trampolines to be pinned in the TLB.
Dead entry reclamation: Lazily dead entries (failed epoch check) are preferred as eviction candidates — they occupy a slot but will never match a lookup. The insert path checks for dead entries before invoking the replacement policy. The optional sweepDeadForASN() method eagerly reclaims dead slots for a specific ASN; it is called after TBIAP on the local CPU but not on remote CPUs, which rely on lazy detection. Over time, if dead entries accumulate without being reclaimed, insertion may fail (bucket full) even though no live entries are present — a capacity leak. The sweep mitigates this for the common local case; the rare remote case is acceptable because remote CPUs will naturally reclaim dead entries as their insert paths encounter them.
Unlike the PA decode cache (Appendix G, Section G.5.1) which is globally shared, the SPAM TLB is strictly per-CPU. CPU 0's TLB entries are never visible to CPU 1. This matches real Alpha hardware, where each processor has its own physically separate TLB. Sharing a TLB across CPUs would require cross-CPU coherency on every lookup — defeating the purpose of having a TLB at all.
The consequence: the same virtual page mapped by multiple CPUs results in independent SPAM entries in each CPU's shard. Each CPU pays its own TLB miss cost the first time it touches a page. This is architecturally correct — real hardware behaves identically.
When the operating system modifies a page table entry and needs to ensure all CPUs stop using the stale translation, it issues a TLB shootdown. On real Alpha hardware, this is done via Inter-Processor Interrupts (IPIs). In SPAM, the cross-CPU invalidation path broadcasts the invalidation command and each CPU acts on its own epoch table:
invalidateTLBsByASN_AllCPUs(sourceCpu, asn):
for each cpu in 0..MAX_CPUS-1:
m_asnEpochs[cpu].itbEpoch[asn].fetch_add(1, release)
m_asnEpochs[cpu].dtbEpoch[asn].fetch_add(1, release)
// optional: sweep dead entries on source CPU only
sweepDeadForASN(sourceCpu, asn)
Each CPU's epoch table is bumped by the broadcasting CPU. Subsequent lookups on the target CPUs will detect the bumped epoch and treat matching entries as dead. No locks are taken. No IPI handler executes on the target CPUs. The invalidation is published via release semantics on the epoch counter; it becomes visible to the target CPU when that CPU next enters a seqlock read (which has acquire semantics).
Why this works without an IPI handler: In a real Alpha system, the IPI is necessary because the target CPU may have the stale TLB entry cached in silicon and must be interrupted to flush its hardware TLB. In SPAM, the "hardware TLB" is software memory (the epoch table), so bumping the counter from any thread is sufficient — the target CPU will see the new epoch value on its next lookup due to the seqlock's acquire fence. No interruption of the target CPU is required for correctness.
The PerCPUEpochTable is aligned to 64 bytes (alignas(64)) to ensure each CPU's epoch data occupies its own cache line. Without this alignment, two CPUs' epoch tables could share a cache line, causing false sharing: CPU 0's epoch bump would force CPU 1's cache line to reload, even though CPU 1's data was not modified. With proper alignment, each CPU's hot-path reads (epoch checks during find()) never compete with another CPU's writes (epoch bumps during invalidation).
The Alpha architecture defines a single-bit field in each PTE: ASM (Address Space Match).
ASM=1 (global): The translation is valid for all address spaces. Kernel text, kernel page tables, and shared-memory mappings carry this bit. These entries survive context switches and are immune to the global-epoch check (axis 1). They are also immune to per-ASN invalidation (axis 2) — only TBIA or explicit TBIS can remove a global entry.
ASM=0 (process-local): The translation is valid only when the current ASN matches the one stored in the entry. On a context switch, the global epoch is bumped, instantly killing all ASM=0 entries that were filled under the old epoch — without touching them.
The GH coverage bitmap tracks global and non-global entries separately (high nibble vs. low nibble). This allows the lookup path to skip probing for global entries when none exist at a given GH value, and vice versa.
Lazy invalidation creates a correctness guarantee but introduces a capacity concern: dead entries (failed epoch check) are not automatically removed from buckets. They still occupy an m_occ bit and their slot cannot be reclaimed by a new insertion unless explicitly detected and cleared.
How dead entries accumulate: CPU 0 issues TBIAP for ASN 5. CPU 0's epoch table is bumped and sweepDeadForASN reclaims dead entries on CPU 0. However, CPU 1 also has entries for ASN 5 in its shard. The cross-CPU epoch bump makes those entries lazily dead, but no sweep runs on CPU 1. Without reclamation, CPU 1's buckets would fill with dead entries until insertion fails.
Solution: Three-phase insert with inline dead reclamation.
The bucket's insert() method uses a three-phase slot acquisition sequence that handles dead entries as a natural part of the insert path — no external sweep is required for correctness:
bucket.insert(entry, currentAsn):
Phase 1: tryClaimSlot() — bitmap has a clear bit?
Yes → write entry, return true (common case, ~1 CAS)
No → bucket is full, fall through
Phase 2: findDeadSlot(currentAsn) — epoch-dead entry exists?
Scans kWays entries using same liveness predicate as find():
Axis 1: entry.globalGenAtFill != globalEpoch → dead
Axis 2: entry.asnGenAtFill != realmEpoch[asn] → dead
Found → reclaimSlot(deadSlot), tryClaimSlot(), write entry
Not found → fall through
Cost: kWays × (2 integer compares + 1 flag check) ≈ 8 insns
Phase 3: chooseVictim() — evict a live entry
Replacement policy (SRRIP/Clock/Random) selects victim.
Locked entries are never selected.
Found → reclaimSlot(victimSlot), tryClaimSlot(), write entry
All locked → return false (extremely rare)
Why this solves the remote-CPU capacity leak: When CPU 0 bumps CPU 1's epoch via cross-CPU TBIAP, no sweep runs on CPU 1. CPU 1's dead entries remain in their buckets. The next time CPU 1 needs to insert into a full bucket, Phase 2 detects the dead entries via the epoch mismatch and reclaims them on demand. No separate sweep, no periodic tick, no cross-CPU coordination. Dead entries are reclaimed exactly when the capacity is needed.
Concurrency note: findDeadSlot() is a read-only scan called before the seqlock is taken. It reads entry fields that find() on another thread may also be reading. This is safe because findDeadSlot() does not modify entries — the actual reclamation (clearing the m_occ bit and overwriting the slot) happens inside the seqlock critical section. A false positive (entry appears dead but was just refilled) is harmless: the seqlock ensures the overwrite is atomic. A false negative (entry appears live but just became dead) is harmless: Phase 3 will evict a live entry instead, and the dead entry will be reclaimed on the next insert.
Additional reclamation mechanisms (defense in depth):
1. Manager-level sweep (safety net): If the bucket's three-phase insert still returns false (all entries live and locked — pathological), the manager sweeps all 256 ASNs via sweepDeadForASN() and retries. This is O(256 × kWays) and extremely rare.
2. Local sweep after TBIAP: sweepDeadForASN() walks the local CPU's buckets and clears dead entries for the invalidated ASN. This runs only on the CPU that issued the invalidation — an eager optimization, not required for correctness.
3. TBIA as hard reset: A full TBIA clears all entries and resets the GH coverage bitmap, reclaiming all capacity in a single operation.
Why 4-dimensional sharding? Each dimension eliminates a class of contention or search cost. CPU sharding eliminates SMP contention. Realm sharding separates I/D (matching real hardware). Size-class sharding eliminates multi-GH probing in the common case. Bucket indexing provides O(1) address-to-entry mapping. Together, these four dimensions reduce the hot-path lookup to: one coverage bitmap byte load, one hash computation, one bucket probe with seqlock, and two epoch comparisons. The typical hit latency is 10–20 ns.
Why lazy invalidation instead of eager walk? An eager TBIAP would walk every bucket in the local CPU's shard looking for entries matching the target ASN — O(BUCKETS × WAYS) per invalidation. With 128 buckets × 4 size classes × 2 realms × 4 ways = 4096 entries to scan, this is costly and scales poorly. Lazy invalidation reduces TBIAP to two atomic increments. The amortized cost is paid at lookup time — one integer comparison per lookup against the epoch — which is negligible compared to the address translation work already being performed.
Why per-CPU epoch tables instead of a shared global epoch? A shared global epoch would create cross-CPU cache-line contention on every context switch and TBIAP — the very operations that must be fast. Per-CPU tables ensure that a context switch on CPU 0 does not force a cache-line reload on CPU 1. The only cross-CPU writes occur during SMP invalidation broadcasts, which are infrequent and dominated by IPI latency anyway.
Why seqlock instead of mutex? The TLB lookup is the single most performance-critical operation in the emulator — every instruction fetch and every data access requires a translation. A mutex on the lookup path would serialize all translations on a CPU, destroying throughput. The seqlock allows concurrent reads with zero contention and adds a retry loop only on the rare event of a concurrent write. In practice, writes (insertions and invalidations) are orders of magnitude less frequent than reads (lookups), making the seqlock optimal for this workload.
Why not share TLBs across CPUs like the PA decode cache? The PA decode cache (Appendix G) can be shared because decoded instructions at a physical address are CPU-independent — the same instruction bits produce the same grain. TLB entries cannot be shared because they carry ASN-specific and CPU-specific state (which process is running on which CPU). A shared TLB would require cross-CPU ASN matching and coherency checks on every lookup — adding overhead to the hottest path in the system for a sharing benefit that rarely materializes (most TLB entries are process-local).
See Also: Chapter 17 – Address Translation, TLB, and PTE (architectural context); Chapter 18 – Fault Dispatcher & Precise Exceptions (TLB miss faults); Chapter 19 – Interrupt Architecture & IPI (TLB shootdown protocol); Chapter 20 – Boot Sequence, PAL, and SRM Integration (ITB/DTB fill in PAL); Appendix G - Instruction Grain Mechanics (contrast: shared PA decode cache vs per-CPU TLB); pteLib/alpha_spam_manager.h; pteLib/alpha_spam_bucket.h; pteLib/SPAMEpoch_inl.h; pteLib/alpha_pte_traits.h; pteLib/alpha_pte_core.h.