Emergent Hardware Verification

Chapter M AI Hardware Systems with Actors: GPUs, Accelerators, and Distributed Compute

Chapter 6 developed Emergent Verification (EV) with the actor model as its substrate, and §6.10 showed that the same model carries through specification, architecture exploration, synthesis, FPGA emulation, and AI-driven RTL generation. The methodology argument was: pick a structural model that matches the silicon once, and the choice carries through every subsequent phase of the design flow.

This appendix follows the argument one step further — not by leaving hardware behind, but by climbing the hardware-systems ladder. Chapters 1–6 anchored EV at the IP, protocol, SoC, and chiplet-assembly level. Modern AI compute extends that ladder upward.

GPUs are massively parallel SoCs in their own right, with thousands of execution units, on-chip networks, and dedicated tensor cores. AI accelerators — Google’s Tensor Processing Unit (TPU), AWS Trainium, Microsoft Maia, Cerebras Wafer-Scale, Groq LPU, Tenstorrent Tensix — are silicon designed specifically for the linear-algebra workloads of modern neural networks.

Distributed training and inference clusters wire many such devices together with NVLink, RDMA over Converged Ethernet (RoCE), and InfiniBand fabric, behaving as one logical hardware system spanning racks. Robotic platforms close the loop with sensors, actuators, and embedded compute.

Each is a hardware system at a different scale, and each is structurally an actor system. Picking the actor model once carries forward through all of them — through both their design (the silicon and the cluster topology) and their programming (the kernels, the training pipeline, the inference engine).

The contribution of this appendix is to make the carry-through to AI hardware systems explicit. The framework introduced in Chapter 6 — the core actor_pkg plus its ten sibling packages — already contains the primitives that GPU programming, AI accelerator design, neural-network execution, and inference serving each re-invent in their own vocabulary.

One authored actor, one more rendering target. The carry-through has a sharper form in the book’s own terms. The thesis of the synthesis appendix (Appendix E) is that a single authored actor is rendered per substrate — a SystemVerilog class for in-simulator verification, synthesizable RTL for an FPGA or emulator, a C++ object for the host (Appendix K), a SystemC module for high-level synthesis (Appendix J) — and the renderings are the same definition, not rewrites. AI hardware adds two rendering targets to that list: the GPU kernel (a warp or persistent kernel pulling from a device-resident mailbox) and the accelerator tile (a Cerebras processing element or a Tenstorrent Tensix tile with private SRAM and hardware message ports). The lowering that takes an actor topology to a GPU kernel is the same mechanical compiler problem as the lowering that takes it to RTL (Appendix E §E.5; Appendix H §H.3), pointed at a different backend. The first-in-industry automatic conversion the book demonstrates — one synthesizable actor graph crossing from simulation to FPGA with no manual rewrite (Chapter 7) — is the mechanism; AI hardware is the highest rung of the ladder it reaches.

§M.1 maps GPU programming primitives to actor concepts. §M.2 establishes the neuron–actor isomorphism. §M.3 expresses neural networks as actor topologies, with the backward pass falling out of the lineage infrastructure the framework already provides.

§M.4 catalogues the framework primitives that ML systems re-implement under different names. §M.5 shows how the four standard parallelism strategies of distributed ML training — data, pipeline, tensor, and expert — become four ‘WIRE topologies over the same primitives. §M.6 shows how modern fabrics — RDMA NICs, programmable switches, DPUs, and on-chip meshes — have themselves converged on the actor model in silicon.

§M.7 treats inference serving (vLLM, TensorRT-LLM, Triton) as an actor topology with backpressure. §M.8 extends the model to robotic and embodied AI hardware systems. §M.9 states the unifying claim: the actor model is the structural model of concurrent hardware systems at every scale.

M.1 GPU Programming as Actor Programming

Modern GPU programming is fragmented across a stack of abstractions. A working CUDA application coordinates kernel launches, manual host-device memory transfers (cudaMemcpy), CUDA streams for asynchronous concurrency, CUDA graphs for compiled pipelines, NVIDIA Collective Communications Library (NCCL) collectives for multi-GPU communication, and Message Passing Interface (MPI) or Remote Direct Memory Access (RDMA) primitives for multi-host coordination.

Each layer has its own model, its own debugging story, and its own performance-tuning vocabulary. Production inference engines like vLLM and TensorRT-LLM are essentially hand-coded coordinations across all of these layers.

The reason this is hard is not the silicon. The silicon is a regular, tractable concurrent system: streaming multiprocessors (SMs) execute independently, warps are independent execution units, NVLink and PCIe are transport substrates, and high-bandwidth memory (HBM) is shared between actor populations.

The GPU is structurally an actor system, with the same proof that motivated Chapter 6’s hardware-actor isomorphism in §6.3. What makes GPU programming hard is that the dominant interface — CUDA — exposes the implementation primitives (threads, blocks, grids, memory tiers, streams) instead of the structural model (independent units passing messages). Programmers learn to think in CUDA’s vocabulary because the tooling does not give them another option.

The mapping

The relationship between GPU primitives and actor concepts is direct, and a mapping table makes it concrete:

• Thread / warp / block / grid — different placement granularities for an actor. A warp (32 threads on NVIDIA hardware) is the natural unit because it has shared local memory and executes cooperatively.
• Shared memory (per-block) — per-actor state when the actor is placed at block granularity.
• Global memory (HBM) — cross-actor message storage.
• Cooperative groups, __shfl_sync — in-warp message passing.
• Persistent kernel polling a queue — an actor’s run() loop reading its mailbox.
• CUDA stream — an actor’s execution context.
• CUDA graph — a compiled ‘WIRE topology.
• NCCL all-reduce, all-gather, reduce-scatter — distributed transport bridges, the same role as the bridges in actor_distributed_pkg with different codecs.
• NVLink, PCIe, RDMA — alternative transport substrates with different bandwidth and latency characteristics.
• Warp specialization (FlashAttention-3, Hopper) — different actors per warp role: load (producer) actors and compute (consumer) actors.

The last entry deserves attention. FlashAttention-3 and the warp-specialization patterns introduced for NVIDIA Hopper architecture organize a thread block as a producer-consumer pipeline: some warps (the producers) issue memory loads while others (the consumers) perform the matrix multiplication.

The pattern was developed for performance reasons — separating the roles allows latency hiding and reduces register pressure — but the result is a small actor system inside a single thread block, with each warp acting as one actor and the shared memory acting as the inter-actor mailbox. The pattern has actor structure without actor vocabulary.

Why CUDA feels hard

Programmers learn CUDA because no high-level abstraction has decisively replaced it for general-purpose GPU work. The closest substitutes — Halide for image-processing pipelines, TVM for deep-learning operator scheduling, JAX with XLA for functional dataflow — all start from a different abstraction (tensors, operators, schedules) and rediscover the underlying topology through compiler analysis.

Tensors are the wrong starting point for a model whose natural shape is concurrent message passing, because they discard the topology that the compiler then has to reinfer.

The actor model preserves the topology by construction. A topology of actors with typed-message ‘WIRE edges already says everything the compiler needs to know about which work depends on which, what can run in parallel, and where memory traffic occurs. The lowering from actor topology to GPU kernels (CUDA, HIP, oneAPI, Metal, ROCm, SYCL) is a code-generation problem, not an abstraction-reconstruction problem.

Actors as front-end, GPU primitives as backend

The structural shape of the proposal mirrors how programming languages relate to silicon already. Programmers write structured code in C++ or Rust, the compiler emits machine instructions, and nobody hand-writes assembly except in narrow performance-critical paths. The actor framework plays the same role for GPU work: programmers write actor topologies with typed messages, the framework lowers them to GPU kernels with appropriate placement, and CUDA or HIP becomes the equivalent of the assembly layer that exists but is rarely touched.

The lowering is not vapor; the literature has the pieces. CUDA graphs already accept a static topology and lower it to a launchable graph with minimal per-iteration overhead. Persistent kernels already run continuously and pull work from device-resident queues. Warp specialization already partitions a thread block into role-specific cooperating units. What is missing is a single front-end abstraction that a methodology can be built around. The actor framework supplies that abstraction.

The lowering is also not hypothetical in kind. The book already demonstrates a mechanical lowering of an authored actor to a hardware backend: a counter actor placed and routed on an iCE40 FPGA at 126.6 MHz (Appendix E), and the whole substrate-swap example verification loop — DUT, stimulus, scoreboard, coverage — synthesized to one fabric and run on both substrates with no manual rewrite: the four actors as software objects and the same four as the synthesized fabric, identical results under Verilator and under FireSim metasimulation, with the FPGA bitstream path scaffolded through FireSim (Chapter 7). That lowering carries a formal cycle-exactness guarantee through the Golden Gate / latency-insensitive-bounded-dataflow transform (Appendix E §E.5). The GPU placement layer is the same compiler aimed at a different backend — CUDA graphs in place of a netlist, persistent kernels in place of flip-flops, warp specialization in place of pipeline registers. What is established is the kind of lowering; what remains is the GPU backend itself, which §M.9 scopes as engineering.

Orchestration and computation, from top to bottom

Actors in this framework compute as well as orchestrate. An actor’s act() body holds arbitrary computation; the hardware-actor isomorphism in Chapter 6 (§6.3) established the correspondence at the level of flip-flops and logic gates, and Appendix E shows the same actor compiled to synthesizable RTL — a counter actor synthesized to 98 standard cells, and a two-instance chain of them placed and routed for iCE40 at a 126.6 MHz Fmax bound. The framework spans the full hardware-systems ladder, in one structural model, from a single gate up through distributed transports.

This vertical span is what eliminates the intermediate plumbing layers that today’s systems require. A working CUDA program contains a substantial fraction of code that is not computation per se: explicit host-device memory transfers, stream and event synchronization, kernel-launch parameter computation, shared-memory allocation, thread-block sizing. That code exists because the host programming model and the GPU programming model are different abstractions; the plumbing translates between them. The same pattern repeats at every abstraction boundary — between OOP testbench and SIMT kernel, between MPI rank and CUDA stream, between inference scheduler and kernel queue. Each boundary requires its own bridge code.

Under one structural model that spans both sides of every abstraction boundary, those bridges become compiler lowering targets rather than user-maintained code. The silicon is unchanged; the surface area of code the programmer maintains is smaller, because the abstraction layers that produced the plumbing are gone. This is the strongest version of the front-end argument from §M.1: not that actors are a nicer way to organize CUDA code, but that the actor model at every rung of the hardware-systems ladder removes the need for many of the layers between today.

Designing the silicon, not just programming it

The methodology applies symmetrically to programming GPUs and to designing the next generation of AI hardware. Google TPU, AWS Trainium and Inferentia, Microsoft Maia, Cerebras Wafer-Scale Engine, Groq LPU, Tenstorrent Tensix, and the Apple Neural Engine are silicon being designed today by hardware teams that face exactly the verification, synthesis, and AI-RTL problems Chapters 1–6 and Appendix H address — often at much larger scale than typical SoCs.¹ A modern AI accelerator is structurally a regular tile array of compute units connected by an on-chip network, with memory hierarchies, interconnect protocols, and reset/clock domains that match the verification problems the actor framework was built for. The seven-step continuum of §6.10 (specification \(\rightarrow \) architecture \(\rightarrow \) verification \(\rightarrow \) synthesis \(\rightarrow \) FPGA \(\rightarrow \) AI-RTL \(\rightarrow \) verification by construction) applies as cleanly to a TPU-class design as it does to an OpenTitan IP, and the framework’s distributed-transport bridges are the natural choice for cross-die communication in chiplet-based accelerator designs. The methodology is not specific to one rung of the hardware-systems ladder; it is the structural model of every rung.

The verification side carries the same guarantee at this scale. The construction argument of Appendix D §D.8 — a property proven once on the actor model, preserved into RTL by the cycle-exact Golden Gate transform, and anchored against the synthesized netlist by sequential equivalence checking — does not weaken when the design is a tensor tile array rather than an OpenTitan IP. The per-tile actor is small and the array is regular, and that regularity is exactly what makes the proof reusable: prove the tile once, and the structural induction over the mesh is the same argument the framework already supports at gate scale. The cross-die seam in a chiplet-based accelerator, in turn, is the same universal seam primitive that crosses a process boundary or a software-to-hardware boundary elsewhere in the book (Appendix L §L.4): one TransportBridgeActor declaration, with a different carrier — here a die-to-die link such as UCIe — behind it.

¹ The author led both formal and functional verification of a compute block on Microsoft Maia, Microsoft’s first AI accelerator. The methodology arguments in this appendix draw on that industrial experience.

M.2 The Neuron–Actor Isomorphism

The actor model’s structural shape coincides with that of biological neurons, and the coincidence is neither metaphorical nor accidental. Both are concurrent systems of independent units holding local state and exchanging messages over fixed connections.

The biological neuron

A neuron has private state: a membrane potential maintained by ion concentrations across the cell membrane. It receives inputs from many upstream neurons through synaptic connections, where neurotransmitter release at the presynaptic terminal modulates the membrane potential. The neuron integrates these inputs, and when the membrane potential crosses a threshold, an action potential propagates down the axon and broadcasts to all downstream neurons connected at the axon terminals. The neuron does not block on upstream activity; it integrates whatever arrives and fires when the threshold is met. There is no shared global execution context across the brain.

The artificial neuron

The mathematical model in a neural network keeps the same shape with arithmetic in place of biochemistry. State is the weight vector \(\mathbf {w}\) and bias \(b\). Inputs are activations \(\mathbf {x}\) from upstream neurons. The neuron computes \(y = f(\mathbf {w} \cdot \mathbf {x} + b)\) and broadcasts \(y\) to all downstream neurons in the next layer. The activation function \(f\) replaces the threshold-driven action potential. The arithmetic is different, but the topology, state ownership, and message flow are identical.

The actor

The actor model from Chapter 6 has the same structural shape. Private state lives in actor fields. Messages arrive through the mailbox. The act() method integrates state with each incoming message and decides what to publish downstream. Subscribers receive published messages through their own mailboxes. There is no shared execution context across the actor population. Here is a neuron written as one — private weights as fields, incoming activations integrated in act(), the scalar result published downstream:

typedef struct { real values[]; } Activation_s;   // vector: one value per input
typedef struct { real value;    } Scalar_s;       // one neuron's scalar output


class NeuronActor extends Actor;
   real weights[];
   real bias;


   function new(string name, real w[], real b);
      super.new(name);
      weights = w; bias = b;
   endfunction


   virtual task act(MsgBase msg);
      Msg#(Activation_s) m;
      Scalar_s out;
      real acc;
      if (!$cast(m, msg)) return;
      acc = bias;
      foreach (weights[i]) acc += weights[i] * m.payload.values[i];
      out.value = relu(acc);
      `PUBLISH_TRACED(out, msg)
   endtask
endclass

The biological neuron, the artificial neuron, and the actor share the same structural shape. Biology converged on this organization because brains are genuinely concurrent and need to operate without a global clock. Neural-network mathematics adopted the shape because it matches the underlying biological computation. The actor model arrived at the same shape from a different direction, by way of distributed-systems modeling. Three independent threads of work converging on the same answer is one piece of evidence that the answer is correct.

M.3 Neural Networks as Actor Topologies

A neural network is a directed graph of neurons. With the neuron–actor identity established, a neural network is a directed graph of actors, which is precisely an actor topology. A layer is a population of actors fanned out from a shared input source. A multi-layer network is a chain of such populations connected by ‘WIRE edges. Forward inference is message flow along these edges. Training, as developed below, is message flow along the same edges in reverse.

Forward pass

The forward pass through a feedforward neural network is a textbook actor topology:

• An InputActor publishes an activation message containing the input vector.
• A LayerActor composite contains \(N\) NeuronActors. Each subscribes to the upstream activation. Each independently computes its scalar output and publishes it.
• A ConcatActor subscribes to all \(N\) neuron outputs and assembles them into a single activation message for the next layer.
• The next LayerActor subscribes to that concatenated activation, and the pattern repeats.

For a small fully-connected network this is one-to-one with the math. For a transformer, the same shape applies at coarser granularity: each attention head is an actor, each feedforward block (FFN) is an actor, the residual connection is a fan-out followed by an addition actor, and layer normalization is a single actor on the activation stream. Modern architectures look complicated because of the depth and width, but the structural pieces remain the same handful of actor patterns composed.

Backward pass

The backward pass appears at first to break the unidirectional flow. Gradients have to travel against the forward edges, which seems to require either a separate reverse topology or an out-of-band mechanism. The actor framework already has the mechanism, in the form that Chapter 6’s observability infrastructure provides. Every MsgBase envelope carries trace_id and parent_span fields, populated by the ‘PUBLISH_TRACED macro. The forward pass leaves a lineage trail through the topology, recorded in those fields and in the RecorderActor when persistence is enabled.

Backward propagation is gradient messages flowing along the recorded lineage in reverse:

• Each forward activation message carries a trace_id and a parent_span pointing to the message that triggered it.
• A LossActor computes the loss and publishes a Gradient_s message tagged with the same trace_id.
• Each NeuronActor subscribes to its own Gradient_s stream. When a gradient arrives on that stream, the neuron pairs it with the cached forward activation for that pass (grouped by trace_id), applies the chain rule, and publishes upstream gradients along its wired inputs.
• The pattern terminates at the InputActor, by which point every neuron has accumulated its weight-update message for the optimizer.

The autograd graph that PyTorch and JAX construct dynamically during the forward pass has the same structure as the lineage graph the actor framework’s trace_id and parent_span fields already record. The autograd engine that walks the graph in reverse during loss.backward() is the same walking pattern that backward gradient messages would take through the actor topology. The graph structure exists in the framework already; the autograd use case is a different application of the same structure.

Activation checkpointing

Training large networks routinely exceeds GPU memory, and the standard remedy is activation checkpointing: discard intermediate activations during the forward pass and recompute them on demand during the backward pass, trading compute for memory. Frameworks implement this with custom torch.utils.checkpoint hooks that intercept the forward pass and re-run sub-graphs.

The actor framework’s actor_persistence_pkg (Chapter 6, persistence subsection) already provides the same capability under different vocabulary. RecorderActor captures upstream messages to a chosen scope and ReplayActor re-publishes them deterministically. Activation checkpointing is selective recording: capture the inputs to a sub-topology, discard the intermediate activations the sub-topology produces, and on the backward pass replay the captured inputs through the same sub-topology to recover the activations needed for gradient computation. The framework’s existing primitive does this directly; the only addition is a per-actor policy specifying which messages to retain through the backward pass.

M.4 ML’s Substrate, Already Built

The work of mapping ML systems to the actor framework is substantially less than it appears, because the framework’s existing packages already implement the substrate that ML systems require. A correspondence table makes the overlap concrete:

Table M.1: Framework primitives and their ML-systems equivalents.
Framework primitive	ML systems vocabulary	Status

`trace_id` + `parent_span` lineage	Autograd graph	Implemented
`RecorderActor` / `ReplayActor`	Activation checkpointing	Implemented
`EventSourcedActor`	Gradient accumulation, stateful optimizer	Implemented
Transport bridges (ZMQ, iceoryx, libfabric, NATS)	NCCL, Gloo, MPI	Same pattern, swap codecs
`Supervisor` / `LinkRegistry`	Fault-tolerant training, checkpoint-and-resume	Implemented
`RoundRobin` / `Broadcast` / `ScatterGather` routers	Data parallelism, all-reduce, scatter-gather collectives	Implemented
`ConsistentHash` / `LeastBusy` routers	Mixture-of-experts dispatch, request batching	Implemented
`MailboxMetrics` + `LatencyHistogram`	Profiler trace, GPU activity timeline	Implemented
`try_publish` returning false	Backpressure for inference serving, request shedding	Implemented
`Stash` pattern	KV-cache eviction policy, request reordering	Implemented
`Become` pattern	Adaptive batching mode switch, engine state transitions	Implemented

ML frameworks and inference engines have built each of these as separate subsystems over the past fifteen years. PyTorch’s autograd, FSDP’s gradient sharding, Megatron-LM’s pipeline scheduler, vLLM’s continuous batching, DeepSpeed’s ZeRO partitioning — each is a custom system specialized to one role. The actor framework expresses each of these in the same set of primitives, removing the duplication that comes from many parallel point solutions.

This is not a claim that the actor framework can immediately replace PyTorch. PyTorch is a multi-million-line ecosystem with extensive operator coverage and a large user base. The narrower claim is that the architectural primitives that PyTorch and the surrounding tools have repeatedly re-invented are already present in the actor framework, and a from-scratch ML platform built on the actor model would not need to re-invent them. For new systems — particularly inference serving and distributed-training infrastructure — the framework offers a coherent foundation rather than a stack of point solutions.

M.5 Distributed Training as Topology Patterns

Distributed training of large neural networks is partitioned along four orthogonal axes, each addressed historically by a different framework or framework extension. The axes interact with each other — production training jobs combine all four — and the orchestration is genuinely difficult. The interesting observation is that all four axes become ‘WIRE topology patterns over the same actor primitives.

Data parallelism

The simplest axis. Replicate the entire model topology \(N\) times, one per GPU. Partition the input batch into \(N\) shards and feed each replica one shard. After each backward pass, gradients are averaged across replicas via an all-reduce.

• Topology: \(N\) identical actor sub-topologies, each placed on its own GPU.
• Cross-GPU communication: a Broadcast or ScatterGather router collects per-replica gradients and publishes the averaged gradient back to each replica’s optimizer actor.
• Transport: actor_distributed_pkg bridges with an NCCL codec.

PyTorch DistributedDataParallel (DDP) is approximately this pattern with NCCL all-reduce hooks attached to autograd. Fully Sharded Data Parallel (FSDP) adds parameter sharding within the data-parallel group, which is itself another topology pattern — a ConsistentHash router across GPUs distributing parameter shards.

Pipeline parallelism

Layers as actors in series across GPUs, with transport bridges between adjacent layers. Each GPU holds a contiguous slice of the layer chain and processes microbatches in pipeline fashion to keep all GPUs busy.

• Topology: a single chain of LayerActors, with the chain partitioned across GPUs at chosen split points.
• Cross-GPU communication: transport bridges at the partition boundaries.
• Microbatch scheduling: a TimerActor or pipeline-coordinator actor that tracks microbatch arrival and schedules forward and backward steps.

GPipe and PipeDream implement this pattern with custom microbatch schedulers. The actor framework’s existing primitives — a chain with transport bridges and bounded mailboxes for backpressure — provide the same scaffolding.

Tensor parallelism

A single layer is split across multiple GPUs, with each GPU computing a shard of the layer’s matrix multiplication. The output is reassembled via an all-reduce or all-gather. The topology is a ScatterGather router that fans the layer’s input out to \(N\) ShardActors, each computing one tensor shard, with a subsequent reduction actor that collects and combines. The cross-GPU communication is all-reduce or all-gather via the same NCCL codec used by data parallelism.

Megatron-LM is the canonical implementation of tensor parallelism for transformer models. It is implemented as a custom set of torch.distributed primitives wrapping the relevant operators. The actor framework expresses the same pattern as a ScatterGather topology, with the operator-specific math living inside each shard actor’s act() body.

Expert parallelism (Mixture-of-Experts)

In a Mixture-of-Experts (MoE) model, each token is routed to one or a few of many possible expert sub-networks. The router decides which experts handle which tokens. The pattern is heavily used in models like Mixtral and DeepSeek-V3. The topology is a Router (ConsistentHash or learned) that dispatches each token’s activation to the chosen ExpertActors. The cross-GPU communication is all-to-all dispatch and combine, with the transport bridge handling the underlying all-to-all exchange (NCCL point-to-point send/recv groups).

DeepSpeed-MoE implements this pattern with a custom routing kernel and all-to-all dispatch. The actor framework’s existing Router primitives express the same pattern; the learned routing function lives inside the router actor’s act() body.

Composing the four

Real production training combines all four axes. A typical large-language-model training job is, for example, \(8\)-way tensor parallel within each node, \(4\)-way pipeline parallel across nodes, \(32\)-way data parallel across the resulting groups, and an MoE layer with \(64\) experts distributed across the data-parallel replicas. Today this composition is hand-orchestrated using framework-specific primitives that compose with varying degrees of friction.

In the actor framework, each axis is one topology pattern. Composing them is composing the topologies: a tensor-parallel sub-topology nested inside a pipeline-parallel chain, replicated \(N\) times for data parallelism, with an MoE Router substituted for one of the layers. The composition is a topology, not a system. The orchestration code that today fills tens of thousands of lines becomes ‘WIRE edges in a configuration file.

M.6 Wire-Level Actors: Kernel Bypass and In-Network Compute

The transport-bridge discussion treated NCCL, MPI, libfabric, and ZMQ as different codecs the framework can swap behind a uniform actor interface. That is correct at the application boundary, but it understates what is happening underneath. Each generation of high-performance network technology moved message-passing work out of software and into hardware, and the result is that modern fabrics are themselves implementing the actor model — in NIC silicon, switch silicon, and on-chip mesh networks. The actor abstraction is not riding on top of a network so much as the network has been converging toward an actor substrate.

The kernel as latency budget

A traditional TCP send pays for itself many times over before a byte hits the wire: a syscall ( 100 ns), a user-to-kernel buffer copy, kernel TCP processing (header construction, checksum, congestion accounting), a NIC driver call, a DMA into the kernel’s transmit ring; and on receive, an interrupt, kernel TCP reassembly, ACK generation, and another buffer copy. An 8-byte message on a 10 Gb LAN spends 5 ns on the wire and 50–100 µs in kernel-mediated processing round-trip. For an actor system trying to dispatch millions of messages per second, the kernel is not a background service; it is the entire latency budget.

Each subsequent generation of network technology pushed work out of the kernel and into hardware. RDMA over InfiniBand and RoCE register user buffers once with the NIC, after which posting a send is a memory-mapped write to a NIC doorbell with no syscall; the NIC DMAs directly from user memory to the wire, and the receiving NIC DMAs directly into a pre-registered user buffer. Reliable delivery, retransmit, and flow control happen in NIC silicon. The latency floor drops from 100 µs to 1–2 µs, dominated by PCIe traversal rather than software. Hardware tag matching on InfiniBand HCAs and Slingshot NICs matches incoming messages against a 64-bit tag space in hardware, dispatching to the correct pre-posted receive buffer with no CPU lookup. The tag identifies the destination, the buffer is its mailbox; this is, literally, a hardware actor mailbox, and libfabric’s FI_TAGGED endpoint exposes it directly. AWS EFA, NVIDIA DOCA, and hyperscaler kernel-bypass paths re-expose the same shape on commodity cloud infrastructure: user-space queues, NICs that DMA without kernel involvement, completion polling without interrupts. From the framework’s perspective each is one more TransportBridgeActor subclass; the application code is unchanged.

Switch fabrics as actors

The next step pushes computation, not just transport, into the network. NVIDIA’s SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) implements all-reduce inside the switch silicon. Each node sends its partial gradient toward the root; the switch fabric reduces partials at each hop, so the root receives an already-aggregated result. A 1024-node all-reduce that takes 2 ms with a ring algorithm shrinks to under 1 ms with SHARP, and the host CPUs do no aggregation work. The switch is, structurally, a ReduceActor: it receives messages from many sources, applies a reduction kernel, and emits the result downstream.

Programmable switches (Intel Tofino) generalize this. P4 — a domain-specific language for packet processing — compiles to switch ASIC pipelines, letting a P4 program implement any custom routing, transformation, or aggregation actor at line rate. From a topology’s point of view, the switch becomes one more participant in the actor graph; the difference is that this actor is made of silicon, runs at terabit speeds, and adds picoseconds rather than microseconds per hop.

Data Processing Units — NVIDIA BlueField, Intel IPU, AMD Pensando — complete the picture by moving the actor runtime onto the NIC itself. A DPU has its own ARM or embedded cores, packet processors, RDMA hardware, and accelerators. An actor that does protocol parsing, security inspection, or compression can be hosted bodily on the DPU, with the host CPU completely uninvolved in routine messaging. From the framework’s perspective the DPU is just another deployment node in the topology.

Networks-on-Chip: hardware-resident actor meshes

The same actor structure reappears, considerably smaller, inside accelerator chips themselves. Cerebras WSE-3 has roughly 900,000 processing elements on a wafer, each with its own SRAM and a hardware mailbox, communicating only through a static mesh routing fabric. Tenstorrent’s Tensix tiles each contain compute units, private SRAM, and FIFO-based message ports to neighbors on a 2D mesh NoC. The Google TPU pod’s inter-chip interconnect (ICI) pre-allocates routes for collectives and performs reductions over the ICI ring with the arithmetic computed on-chip (rather than in the switch fabric, as SHARP does). NVIDIA NVLink/NVSwitch silicon lets GPU memory regions be addressed across the fabric with hardware coherence on certain primitives.

Every one of these systems shares the same structural model: each tile has local state in private SRAM; the only mechanism for interaction is typed messages on wires; mailboxes are hardware FIFOs at each tile boundary; routing is performed by silicon without per-message host involvement; and failures and back-pressure are visible to the framework rather than buried in firmware. This is the actor model expressed in silicon. The framework’s actors and the silicon’s tiles are not analogous — they are the same model at different scales. A neuron actor (§M.2) mapped onto a Cerebras PE or a Tenstorrent tile is not compiled in any deep sense; the source-level actor and the hardware-resident actor share an isomorphism that lets the mapping be largely mechanical.

CXL and the memory-fabric horizon

The next dissolution of boundaries comes from CXL 3.0 (Compute Express Link), which exposes remote memory over a coherent fabric, adding roughly 200 ns of latency beyond local DRAM. From an actor’s perspective, a mailbox can be a region of CXL-attached memory shared across nodes in a rack, with hardware coherence handling synchronization. The distinction between “messages crossing a network” and “writes to a local mailbox” collapses into a routing decision made by the fabric, not a protocol decision made by the application. Combined with DPUs and programmable switches, the resulting architecture is one in which the application sees an actor topology while every layer of the substrate beneath it — NoC, NIC, switch, fabric — also implements an actor topology, with progressively coarser tile sizes.

Implications for the framework

This perspective sharpens what TransportBridgeActor is doing. It is not just a portability layer that lets the same actor code run over ZMQ in development and NCCL in production. It is the seam at which the application’s actor topology meets the substrate’s actor topology. Each transport choice picks a different point on a spectrum:

• TCP / ZMQ. Software-only, kernel-mediated; 50–100 µs latency. Right for development and for systems where the network is not the bottleneck.
• RDMA / libfabric. Kernel bypass, NIC handles transport; 1 µs latency. Right for HPC clusters and on-prem distributed training.
• NCCL. GPU-direct, fabric-aware collectives; bandwidth-optimal. Right for the GPU-cluster case the rest of this appendix develops.
• Shared memory / iceoryx. Zero-copy intra-node; 100 ns latency. Right for adjacent processes on the same machine.
• In-network compute (SHARP, P4). Aggregation and routing in switch silicon; sub-microsecond reductions. Right for collectives at cluster scale.
• NoC / on-chip mesh. Hardware-resident actors with picosecond hop latency. Right when actors and silicon tiles are the same object.

The application does not pick between these by rewriting itself; it picks a bridge implementation, and the same actor topology runs over each with different latency–bandwidth characteristics at the wire layer. The kernel is increasingly absent from the data path; the CPU is increasingly used only for actor-local compute, not for message movement; the NIC, switch, and on-chip mesh are increasingly doing what the actor framework would otherwise do in software. The framework’s role is to declare the topology, configure the bridges, and observe the system through trace IDs and structured logs — while the substrate, at every layer beneath, executes that same topology natively.

M.7 Inference Serving as Actor Topology

Inference serving for large language models is a custom-systems problem today. vLLM is on the order of hundreds of thousands of lines of Python, C++, and CUDA; TensorRT-LLM is comparable in scale. NVIDIA’s Triton Inference Server, addressing the broader serving role, adds another layer. Each system coordinates incoming request batching, KV-cache management, GPU stream orchestration, attention kernel launches, scheduling policies, and response streaming.

The structural shape underneath these systems is an actor topology with backpressure:

Each of vLLM’s components maps directly to one actor:

• Continuous batching is a BatcherActor reading from its mailbox and accumulating requests until either a batch-size threshold or a latency budget triggers a publish.
• KV-cache management is a KvCacheActor subscribing to per-sequence allocate, append, and free messages.
• Paged attention is the same KvCacheActor with a paged-allocator state. The pattern does not change.
• Backpressure is try_publish returning false: when the GPU is saturated, upstream actors observe the false return and either queue, shed, or signal upstream.
• Request tracing is the existing trace_id flowing through the topology, giving end-to-end latency breakdowns at no additional cost.
• Speculative decoding is a parallel DecodeActor running a draft model that publishes candidate tokens for the main DecodeActor to verify, with a DiffActor pattern checking acceptance.
• Multi-model serving is multiple inference topologies running concurrently, with a router actor at the front dispatching by model name.

The whole inference engine is a topology, not a system. The actor framework supplies the primitives; the model-specific behavior lives inside each actor’s act() body. Expressed as actors, the topology would be considerably smaller because the framework primitives already provide what each system today re-implements. The size delta cannot be claimed precisely without a worked rewrite, parallel to the UBUS one in Appendix A; that worked rewrite is left as future work. The architectural point — one set of primitives instead of many parallel point solutions — is independent of the specific number.

M.8 Robotic and Embodied Hardware Systems

Robotic platforms are hardware systems that close the loop with the physical world: sensors are hardware, actuators are hardware, and the embedded compute that mediates between them is hardware. The software is the orchestration layer between hardware boundaries. The structural shape is the same sensor-perception-policy-actuator pipeline regardless of platform:

• Sensor actors publish observations: camera frames, LIDAR returns, IMU samples, joint encoder readings.
• Perception actors subscribe to raw sensor streams and publish higher-level state: object detections, segmentation masks, pose estimates.
• Policy actors subscribe to perception output and publish actions: motor commands, gripper torques, navigation waypoints. For learned policies, this actor wraps a neural network whose internals are themselves an actor topology, often running on an embedded AI accelerator.
• Actuator actors subscribe to action commands and drive the underlying hardware.
• Safety supervisor actors subscribe to all of the above and maintain a watchdog that interposes between policy outputs and actuators, vetoing unsafe commands.

The Robot Operating System (ROS) is widely used for this stack and is itself a publish-subscribe message-passing framework. ROS topics are mailboxes, ROS nodes are actors, ROS bag files are recorder/replay. The vocabulary differs but the structure is the same. The actor framework’s contribution to this domain is the same as elsewhere: explicit lineage tracing across the perception-policy-actuator chain, supervision for fault tolerance, and transition between simulation and real hardware via transport bridges — the same bridges that connect chip-level testbenches to FPGA emulators in Appendix G carry over to connecting a robotics simulator to a physical platform.

M.9 The Unifying Claim

The methodology argument the book has been making in earlier chapters has a sharper form once GPU programming, AI accelerator design, distributed training and inference, and robotic systems are within scope:

The actor model fits the structural shape of concurrent hardware systems across multiple scales — the flip-flop and gate inside a block, the block inside an SoC, the SoC inside a chiplet assembly, the chiplet inside a GPU or AI accelerator, the accelerator inside a distributed training cluster, the cluster inside a robotic platform. The same actor framework, with appropriate transport substrates and placement decisions, can serve to design and verify systems at each of these scales. Chapters 1–6 establish the methodology at IP and SoC scale anchored in verification; this appendix extends it upward to AI hardware scale. The two halves are continuous in shape; the engineering work at each scale is its own undertaking.

Three independent lines of work converge on the same structural shape. Biology evolved neuron-as-actor for concurrent computation in nervous systems. Silicon settled on flip-flop-as-actor and gate-as-actor for concurrent computation in chips, and on warp-as-actor and SM-as-actor for the modern parallel hardware that runs ML workloads. The verification community arrived at actors from the bottom up, by way of dissatisfaction with the OOP component hierarchy that earlier methodologies imposed. The convergence is structural rather than coincidental, reflecting the shape of concurrent hardware systems.

The practical consequence is that the methodology spans the full hardware-systems ladder from gates to clusters, and the methodology investment compounds rather than fragmenting across scales. The same actor framework that runs the UBUS testbench in Chapter 6, the Earl Grey twenty-eight-IP chip-level test in Appendix C, and the Yosys-synthesized counter actor in Appendix E, also expresses a vLLM-class inference engine, a four-axis distributed training topology, and a sensor-policy-actuator robotic loop. The seven-step continuum from §6.10 (specification \(\rightarrow \) architecture \(\rightarrow \) verification \(\rightarrow \) synthesis \(\rightarrow \) FPGA \(\rightarrow \) AI-RTL \(\rightarrow \) verification by construction) applies as cleanly to a TPU or a Cerebras wafer as it does to an OpenTitan IP. Picking the right model of computation once carries forward through every subsequent phase and every subsequent scale.

M.10 Conclusion

Chapter 6 introduced the actor model as the substrate for hardware verification and showed that the model carries through specification, architecture exploration, synthesis, FPGA emulation, and AI-driven RTL generation. This appendix extends the methodology up the hardware-systems ladder: GPUs as massively parallel SoCs, AI accelerators as silicon designed for tensor workloads, distributed training and inference clusters as one logical hardware system spanning racks, and robotic platforms as embedded hardware closing the loop with the physical world. In each case the actor framework’s existing primitives — lineage tracing, recorder/replay persistence, supervision, routing, transport bridges, backpressure, observability — map directly onto capabilities that the corresponding domain has historically built as a custom system.

The four parallelism strategies of distributed deep learning (data, pipeline, tensor, expert) become four ‘WIRE topology patterns over the same primitives. Inference-serving stacks like vLLM and TensorRT-LLM are actor topologies with backpressure. Robotic control loops are sensor-policy-actuator chains with supervision. The neuron itself is an actor; a neural network is an actor topology; backward propagation is gradient messages flowing through the lineage that the framework already records.

The unifying claim is that the actor model is the structural model of concurrent hardware systems at every scale. Emergent Verification at IP-block and SoC scale — the methodology developed across Chapters 1–6 — is one rung of the ladder. GPU programming, AI accelerator design, distributed training, inference serving, and embodied AI are other rungs of the same ladder. The same framework, with appropriate transport substrates and placement decisions, designs and verifies them all — which is why this book is titled Emergent Hardware Verification rather than scoped to any single rung — the verification methodology is the same up and down the ladder.

The remaining work is engineering, and it is the lowering the book already demonstrates for the RTL backend (Appendix E; Chapter 7) aimed at new targets: a GPU placement layer that lowers actor topologies to CUDA, HIP, or oneAPI; a JIT-compilation pass that fuses adjacent actors with regular structure into single kernels; codecs for NCCL, Gloo, and MPI to layer onto the existing transport-bridge framework; and the natural extension of the synthesizable form (Appendix E) to AI-accelerator-style tile arrays. Each of these is a tractable project — the same compiler pointed at a new backend, not a new methodology. What the framework provides is a coherent foundation against which they each can be built.

Looking forward. The pure-C++ port that Appendix K develops is the natural starting point for this work: the same header-only library that supports general concurrent software is also the natural front-end for a GPU-targeting code generator. The transport-bridge pattern that Appendix L demonstrates with ZMQ generalizes to NCCL with the same shape. The neuron-as-actor mapping sketched here is implementable today as a feedforward network demonstration; scaling to a transformer is a question of adding attention and FFN actors with appropriate tensor shapes inside their act() bodies. The framework that Chapter 6 introduced supplies the foundation; the rest is engineering.

M.11 Notes and References

This appendix names a number of public systems and research contributions — vLLM, Megatron-LM, GPipe, FlashAttention, the TPU architecture papers, and the rest. Where the appendix makes a structural claim about one of them, the primary source documents the system as built; the actor-framework re-expression is this appendix’s contribution and is, with the exception of the Chapter 6 framework primitives the mapping reuses, future engineering work. The citations are for the systems being mapped onto, not for the mapping itself, and the full list is collected with the rest of the book’s citations in the References at the end of the book.

References

Inline, this book cites by author and year and keeps titles short so the prose stays readable; the full citations are collected here, ordered by the chapter that first relies on them. Entries follow one convention throughout: authors (year), title, venue or publisher, and — where useful — a one-line note on what the work grounds.

Chapter 1 — Emergent Verification

• Hewitt, C., Bishop, P., Steiger, R. (1973). A Universal Modular ACTOR Formalism for Artificial Intelligence. IJCAI. The original actor-model paper; the pedigree behind the substrate this book builds on.
• Foster, H. (Wilson Research Group / Siemens EDA, biennial since 2010; preceded by a 2007 Far West Research edition). The Wilson Research Group Functional Verification Study. The industry survey behind the verification-effort figures cited in Chapter 1.
• Fang, W., et al. (2024). AssertLLM: Generating and Evaluating Hardware Verification Assertions from Design Specifications via Multi-LLMs. arXiv preprint. The academic LLM-driven property-generation line referenced in §1.6.

Chapter 2 — RTL Design

• IEEE Std 1800-2023. IEEE Standard for SystemVerilog — Unified Hardware Design, Specification, and Verification Language. IEEE. The language reference behind the event-scheduling regions, data types, and SVA constructs of this chapter.
• Wolper, P. (1986). Expressing Interesting Properties of Programs in Propositional Temporal Logic. POPL. The data-independence argument behind the FIFO’s symbolic-token proof.
• Clarke, E., Grumberg, O., Jha, S., Lu, Y., Veith, H. (2000). Counterexample-Guided Abstraction Refinement. CAV. The CEGAR loop the MSI example develops.

Chapter 3 — Formal Verification

Proof engines: SAT, BMC, induction, interpolation, IC3.

• Tseitin, G. S. (1968). On the Complexity of Derivation in Propositional Calculus. Studies in Constructive Mathematics and Mathematical Logic. The linear-size CNF encoding every industrial SAT flow uses.
• Biere, A., Cimatti, A., Clarke, E., Zhu, Y. (1999). Symbolic Model Checking without BDDs. TACAS. The original bounded-model-checking paper.
• Sheeran, M., Singh, S., Stålmarck, G. (2000). Checking Safety Properties Using Induction and a SAT-Solver. FMCAD. \(k\)-induction.
• Craig, W. (1957). Three Uses of the Herbrand–Gentzen Theorem in Relating Model Theory and Proof Theory. Journal of Symbolic Logic 22. The interpolation theorem.
• McMillan, K. L. (2003). Interpolation and SAT-Based Model Checking. CAV. Craig interpolation adapted to hardware model checking.
• Bradley, A. R. (2011). SAT-Based Model Checking without Unrolling. VMCAI. The IC3 algorithm.
• Eén, N., Mishchenko, A., Brayton, R. (2011). Efficient Implementation of Property Directed Reachability. FMCAD. The PDR refinements of IC3.
• Vizel, Y., Gurfinkel, A. (2014). Interpolating Property Directed Reachability. CAV. The AVY engine: IC3 frames generalized by interpolants.
• Vardi, M. Y., Wolper, P. (1986). An Automata-Theoretic Approach to Automatic Program Verification. LICS. The \(\omega \)-automaton product construction behind every SVA model checker.

Logic and computability.

• Frege, G. (1879). Begriffsschrift. The birth of predicate logic.
• Gödel, K. (1931). Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I. Monatshefte für Mathematik und Physik 38. The incompleteness theorems.
• Church, A. (1936). An Unsolvable Problem of Elementary Number Theory. American Journal of Mathematics 58. Undecidability of first-order validity.
• Turing, A. M. (1936). On Computable Numbers, with an Application to the Entscheidungsproblem. Proceedings of the London Mathematical Society. The machine model behind every engine in the chapter.
• Matiyasevich, Y. (1970). Enumerable Sets are Diophantine. Soviet Mathematics Doklady 11. The negative solution of Hilbert’s tenth problem; why quantifier-free non-linear integer arithmetic is undecidable.
• Curry, H. B. (1934). Functionality in Combinatory Logic. PNAS 20; and Howard, W. A. (1969, published 1980). The Formulae-as-Types Notion of Construction. In To H. B. Curry. Academic Press. The propositions-as-types correspondence.
• Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal 27. The probabilistic sibling’s foundation.
• Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review 65.
• Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS. The transformer architecture.

Theorem proving: tools, projects, and case studies.

• Boyer, R. S., Moore, J S. (1979). A Computational Logic. Academic Press. The prover tradition ACL2 descends from.
• Moore, J S., Lynch, T., Kaufmann, M. (1998). A Mechanically Checked Proof of the AMD5K86 Floating-Point Division Program. IEEE Transactions on Computers 47(9). The K5 FDIV-era proof of the case study.
• Russinoff, D. M. (2022). Formal Verification of Floating-Point Hardware Design: A Mathematical Approach. Springer, 2nd edition. The deepest single source on hardware theorem proving; the FPU pyramid of this chapter follows its Part V.
• Hunt, W. A. Jr., Swords, S. (2009). Centaur Technology Media Unit Verification. CAV. The daily ACL2 proof pipeline of the case study.
• Harrison, J. (2009). HOL Light: An Overview. TPHOLs. The prover behind Intel’s post-FDIV floating-point verification.
• Klein, G., et al. (2009). seL4: Formal Verification of an OS Kernel. SOSP. The Isabelle/HOL microkernel proof.
• Gonthier, G. (2008). Formal Proof — The Four-Color Theorem. Notices of the AMS 55(11).
• Leroy, X. (2006). Formal Certification of a Compiler Back-End. POPL. CompCert.
• Gonthier, G., et al. (2013). A Machine-Checked Proof of the Odd Order Theorem. ITP. The Feit–Thompson formalization.
• de Moura, L., Ullrich, S. (2021). The Lean 4 Theorem Prover and Programming Language. CADE-28.

AI-assisted proving.

• DeepMind (2024). AI Achieves Silver-Medal Standard Solving International Mathematical Olympiad Problems. Technical announcement. AlphaProof and AlphaGeometry 2.
• Yang, K., et al. (2023). LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. NeurIPS Datasets and Benchmarks. LeanDojo and ReProver.
• Song, P., Yang, K., Anandkumar, A. (2024). Towards Large Language Models as Copilots for Theorem Proving in Lean. arXiv preprint. Lean Copilot.
• ByteDance Seed (2025). Seed-Prover. arXiv preprint. LLM tactic search integrated into Mathlib workflows.

Textbooks and learning resources.

• Kroening, D., Strichman, O. (2016). Decision Procedures: An Algorithmic Point of View. Springer, 2nd edition. The standard reference for the bit-vector and DPLL(T) decision procedures this chapter builds on.
• Bradley, A. R., Manna, Z. (2007). The Calculus of Computation: Decision Procedures with Applications to Verification. Springer. Source of the decidability tables; unifies SAT, SMT, and theorem proving.
• Nipkow, T., Klein, G. (2014). Concrete Semantics: With Isabelle/HOL. Springer. The induction heuristics of Stage 5.
• Pierce, B. C., et al. Software Foundations. Online, free. The standard Coq tutorial.
• Theorem Proving in Lean 4 and Mathematics in Lean. Online, free. The Lean language reference and the Mathlib entry point.

Chapter 4 — Transaction-Level Verification

• IEEE Std 1364-2005. IEEE Standard for Verilog Hardware Description Language. IEEE. The original scoping and language rules this chapter contrasts SystemVerilog against.

Chapter 5 — The UVM Framework

• IEEE Std 1800.2-2020. IEEE Standard for Universal Verification Methodology Language Reference Manual. IEEE. The UVM standard this chapter walks and critiques.

Chapter 6 — Emergent Verification with Actors

• Armstrong, J. (2003). Making Reliable Distributed Systems in the Presence of Software Errors. PhD thesis, KTH. Erlang/OTP and the “let it crash” supervision philosophy the framework borrows.

Chapter 7 — Emulation

• Pfister, G. F. (1982). The Yorktown Simulation Engine: Introduction. DAC. The custom-processor-array ancestor of every modern emulator.

Appendix M — The Hardware-Systems Ladder

Where Appendix M makes a structural claim about an existing system (vLLM, Megatron-LM, GPipe, and the rest), the cited source documents the system as built; the actor-framework re-expression is the appendix’s contribution. The references are for the systems being mapped onto, not for the mapping itself.

Robotics.

• Quigley, M. et al. (2009). ROS: An Open-Source Robot Operating System. ICRA Workshop on Open Source Software.

Distributed deep-learning training.

• Huang, Y. et al. (2019). GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. NeurIPS. Pipeline parallelism with microbatch scheduling.
• Narayanan, D. et al. (2019). PipeDream: Generalized Pipeline Parallelism for DNN Training. SOSP.
• Shoeybi, M. et al. (2019). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053. Tensor parallelism for transformers.
• Rajbhandari, S. et al. (2020). ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. SC. The DeepSpeed ZeRO partitioning scheme.
• Rajbhandari, S. et al. (2022). DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale. ICML. Expert-parallel routing and all-to-all dispatch.
• Chen, T. et al. (2016). Training Deep Nets with Sublinear Memory Cost. arXiv:1604.06174. Activation checkpointing.

Inference serving and attention kernels.

• Kwon, W. et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP. The vLLM paper; documents PagedAttention and continuous batching.
• Dao, T. et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS.
• Shah, J. et al. (2024). FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. NeurIPS. Warp-specialization on Hopper.
• Leviathan, Y., Kalman, M., Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. ICML. Speculative-decoding pattern.

Mixture-of-Experts in production.

• Jiang, A. et al. (Mistral AI, 2024). Mixtral of Experts. arXiv:2401.04088.
• DeepSeek-AI (2024). DeepSeek-V3 Technical Report. arXiv:2412.19437.

AI accelerator silicon.

• Jouppi, N. et al. (2017). In-Datacenter Performance Analysis of a Tensor Processing Unit. ISCA. The TPUv1 architecture paper; subsequent TPU generations are described in vendor documentation.
• Microsoft (2023). Azure Maia 100 and Azure Cobalt 100 announcement, Microsoft Ignite, November 2023. The first-party AI accelerator referenced in Appendix M.

Software systems and toolchains referenced.

• PyTorch DistributedDataParallel (DDP) and FullyShardedDataParallel (FSDP) — the torch.distributed and torch.distributed.fsdp reference documentation.
• NVIDIA NCCL — vendor documentation and source repository for the collective-communications library.
• NVIDIA CUDA Graphs and Cooperative Groups — the CUDA Toolkit Programming Guide.
• Halide, TVM, JAX/XLA — each a distinct compiler/runtime project with its own reference documentation; cited as comparison points for the front-end abstraction question, not as competitors.