Emergent Hardware Verification

Chapter G FPGA Emulation: Automatic Conversion of the Actor Testbench to Hardware

Chapter 7 makes the industrial argument that a synthesizable-actor testbench converts from software simulation to hardware emulation with no manual rewrite, for both a commercial emulator and FireSim. This appendix is the engineering reference behind that claim: it specifies the flow, the mechanism that makes the rendering mechanical and cycle-exact, the three compilation back-ends that share one authored graph, the concrete form the host-to-hardware seam takes on each, and the performance model. The methodology continuum (Appendix D) named FPGA emulation as step five — synthesized actor RTL produces a bitstream that runs at MHz clock rates while the same framework drives observation, scoreboarding, and coverage; here that step is laid out concretely.

The central practical question is the one Chapter 7 answers in prose and this appendix answers in implementation detail: once the whole actor graph — design and verification alike — synthesizes onto the fabric, what remains on the host, and how does that thin remainder attach without losing the topology? The answer is that very little remains on the host. Every actor that is a finite state machine — the design blocks, and the scoreboards, coverage, RAL, and stimulus that check them — synthesizes and rides the fabric together (Appendix E), and the ‘WIRE edges between them become wires in the fabric. What stays on the host is only what is genuinely software: the workload read-out a person or dashboard consumes, external I/O, and any link the program chose to terminate in software. Those few seams attach through the same distributed transports (ZMQ, NATS, iceoryx, libfabric) that Appendix L used for cross-machine deployment — no DPI, no host-side parsing of vendor debug protocols, no per-platform plumbing.

G.1 The claim, in engineering terms: no manual conversion step

The contribution is not that an actor testbench can be made to run on an emulator with effort; it is that the conversion has no manual step. Chapter 7’s Table 7.3 lays the two columns side by side; restated at the level this appendix works:

• The status quo is a manual port because the dominant testbench language (UVM) is non-synthesizable and host-resident by construction. Four manual efforts follow: the testbench is split and partly re-coded (a hand-written synthesizable BFM, non-synthesizable VIP re-coded — the published Palladium Ethernet case study reports its UVM VIP “was rewritten in C++”); the boundary is hand-plumbed over SCE-MI per interface; coverage is hand-relocated into a bound emulator module; the partition is hand-hinted.
• An actor testbench has neither property. It is synthesizable in full (Appendix E’s Five Rules, with the whole verification loop — scoreboard included — synthesized to gates in §E.4), and its only host crossings are the ones the engineer deliberately placed at real seams. Each of the four manual efforts therefore becomes a compiler pass or a graph walk over a declaration the engineer already wrote: render each actor to RTL; emit the ‘WIRE topology as wires; generate the seam adapter from the TransportBridgeActor declaration; hand the ‘WIRE graph to the partitioner as the cut plan; carry coverage in the intermediate representation; reuse the workload artifact unchanged.

Remove the root cause — a non-synthesizable, host-resident testbench — and all four manual efforts disappear together. The rest of this appendix is the mechanism that performs each automatically.

G.2 Why the rendering is mechanical and cycle-exact

The load-bearing step is the first: rendering each actor to RTL. If that step required human judgment, “no manual conversion” would be false. It does not, and the strongest available existence proof is the open FireSim/Golden Gate stack, whose published guarantees this appendix borrows directly.

An actor is a latency-insensitive dataflow node. An actor in the synthesizable form is a finite state machine with bounded mailboxes and a ready/valid handshake on every channel (Appendix E). That is, structurally, a primitive latency-insensitive bounded dataflow network (LI-BDN) node in the sense Golden Gate uses (Magyar/Biancolin et al., ICCAD 2019): a node connected to others by bounded token queues, obeying the partial-implementation, self-cleaning, and no-extraneous-dependencies properties. The ‘WIRE graph is an LI-BDN; publish()/mailbox hand-offs are token channels.

Rendering is a compiler pass with a formal guarantee. Golden Gate takes arbitrary FIRRTL and emits a cycle-exact hardware model automatically; the LI-BDN partial-implementation property is a formal guarantee that the emitted model’s output token history matches the source RTL cycle for cycle, verified per optimized model by a push-button bounded-model-checker (the paper’s LIME flow). The mechanical rendering Golden Gate performs on a processor’s register file is exactly what the framework performs on a scoreboard’s match table: a structural lowering, not a re-authoring. Appendix E demonstrates the same lowering by hand on a counter actor (98 cells; a two-instance chain of them places and routes at 126.6 MHz on iCE40); Golden Gate shows the lowering is automatable at scale.

Abstract pieces become target-time RTL too. Where a verification actor models something that has no direct gate form — a golden reference computed in software, a memory’s timing — the FASED precedent applies (Biancolin et al., FPGA 2019): write the model as target-time RTL and apply the same transform to it as to the design, binding host services (host DRAM, a host computation) behind the same token handshake. The actor’s act() body is already that target-time description; nothing about a scoreboard forces it onto the host.

The consequence is that “each actor becomes RTL” is, for the whole graph, a single automatable lowering with a cycle-exactness guarantee — the property that lets the conversion be a build target rather than a project. Appendix E §E.5 makes the same argument from the synthesis side — the by-hand lowering there and the automatic lowering here are the same structural pass — and that cycle-exactness guarantee is what lets the derivation stand in for the equivalence proof in verification by construction (Appendix D §D.8).

G.3 The flow

The path from class-based actor specification to running hardware is four stages.

1. Synthesizable form. Each actor’s class form — the design blocks and the verification actors alike — is rewritten or transpiled into the synthesizable form (Appendix E). For SystemVerilog actors this is the manual translation today (the Five Rules are mechanical, and Appendix H sketches the AI-assisted automation); for SystemC actors (Appendix J) the existing HLS tool chain handles this stage automatically; the FIRRTL-based path (§G.2) automates it for any front-end that lowers to FIRRTL.
2. Synthesis to vendor, open-source, or decoupled-simulator primitives. The synthesizable RTL is mapped to a target by a compiler. Yosys (open-source) targets Lattice iCE40 and ECP5 directly; nextpnr places and routes; icepack/ecppack produces a bitstream. For larger devices (Xilinx Versal, Intel Stratix) the vendor tool chain (Vivado, Quartus) or the commercial emulator’s compiler takes the same lint-clean RTL. For a decoupled simulator (FireSim), Golden Gate FAME-1-transforms the same RTL into an LI-BDN. One RTL, three compilers (§G.4).
3. On-fabric instantiation. The bitstream (or emulator image, or FAME target) contains the synthesized actor blocks plus a thin seam adapter for each genuine host crossing. The adapter is the hardware-side counterpart of actor_distributed_pkg: it serializes message bundles into a byte stream over a host link (PCIe DMA, USB, JTAG, Ethernet) or implements the platform’s native bridge protocol. The framework does not care which link.
4. Host-side seam. The host runs only what cannot synthesize: the workload read-out and any external-I/O endpoints. Where such a seam exists, a bridge actor sits on it (§G.6). The verification actors are not on the host — they synthesized in stage 2 and ride the fabric with the design.

A bridge actor at such a seam is exactly what actor_distributed_pkg already provides for cross-process and cross-machine deployment. The fabric is one more substrate the transport abstraction spans; it does not care whether the actor on the far side of a seam is in another Linux process, on another machine, or in hardware. The difference from cross-machine regression is only that here most of the graph sits on one side — the fabric — and the seam carries workload-level traffic, not per-transaction traffic.

G.4 Three back-ends, one graph

The four-stage flow targets three kinds of hardware from the same authored graph. They differ only in stage 2’s compiler and stage 4’s seam protocol; both are generated, not authored.

Open-source FPGA (Yosys $\to $ nextpnr $\to $ bitstream). The fully open path, used for bring-up and for this book’s runnable artifacts. Yosys synthesizes the lint-clean RTL; nextpnr places and routes; icepack/ecppack emits the bitstream; the seam is a serial or ZMQ link. The capacity ceiling is the single board, so this path proves the mechanism (the whole graph synthesizes and runs) rather than the scale. §G.5 gives the chain.

Commercial emulator (Palladium / ZeBu / Veloce). The vendor compiler takes the same synthesized RTL — it does not distinguish a scoreboard actor from a DUT block, because both are RTL — and maps it onto the processor array or FPGA array. The ‘WIRE graph is handed to the partitioner as the cut plan (§G.7); the genuine seams become SCE-MI transactors (the Accellera Standard Co-Emulation Modeling Interface, v2.1/v2.4), generated from the TransportBridgeActor declaration (§G.6). The scale path: billions of gates, MHz-class clock, the full debug and power apps Chapter 7 surveys.

FireSim (decoupled, FPGA-accelerated, cycle-exact). Golden Gate FAME-1-transforms the same RTL into an LI-BDN target; the genuine seams become FireSim bridges (a target-side stub plus a host-side bridge_driver_t). Because FireSim is open and offers metasimulation, this back-end is the one on which the conversion is not just automatic but provable (§G.10): the whole graph runs in software under the FAME transform, bit- and cycle-exactly reproducible against an FPGA run, on a laptop, before any bitstream exists.

The invariant across all three is the authored graph. A program that buys a commercial emulator and a program that runs FireSim author the same testbench; the back-end and the seam adapter are the only differences, and both are emitted.

G.5 Open-source tool chain path

For the open-source FPGA back-end, the full chain is:

class-based actor (actor_pkg/*.sv)
    |
    v    manual or AI-assisted transpilation
    |    (synthesizable-form rules + AI-RTL translation)
synthesizable .sv
    |
    v    Verilator --lint-only -Wall (sanity)
    |
synthesizable .sv (lint-clean)
    |
    v    Yosys (read_verilog -sv ... ; synth_ice40 / synth_ecp5)
    |
gate-level netlist
    |
    v    nextpnr-ice40 / nextpnr-ecp5
    |
placed-and-routed design
    |
    v    icepack / ecppack
    |
.bin / .bit bitstream
    |
    v    iceprog / openFPGALoader
    |
running FPGA hardware (design + verification actors)
    |
    v    ZMQ / serial over USB / Ethernet      (the one seam)
    |
host: workload read-out via a bridge actor

Every stage is open-source and runs on commodity hardware. Lattice iCE40 and ECP5 boards (the iCEBreaker, ULX3S, ECPIX-5) are inexpensive (iCEBreaker and ULX3S are well under $200), well-documented, and supported end-to-end. The synthesizable counter actor from Appendix E (98 cells per stage) fits trivially; that appendix places and routes it to a 126.6 MHz Fmax bound (nextpnr, iCE40). The verification loop of the substrate-swap example (§G.10) — stimulus, DUT, scoreboard, coverage — lowers to roughly 300 flip-flops through Yosys, the same flow. These are single-FPGA numbers; multi-FPGA emulation of a real SoC lands lower (1–5 MHz under load), constrained by partitioning, inter-FPGA TDM, and the routing cliff (Chapter 7, §7.6).

For Xilinx and Intel devices the chain is the same shape with vendor tools replacing the open-source steps; for FireSim the same RTL enters Golden Gate instead of Yosys. The methodology does not change.

G.6 The bridge: one declaration, three seam realizations

The bridge actor is the integration primitive at a software-to-hardware seam — the host read-out, external I/O, or a link whose far side is software. It is the only place the host link appears; the rest of the graph is on the fabric and needs no bridge. One declaration — a typed message in, a typed message out, a transport named — yields three concrete realizations, one per back-end. This is the universal seam primitive: the same TransportBridgeActor actor that crosses a process boundary in distributed regression, a machine boundary across a farm, and a language boundary to a Python consumer crosses the software-to-hardware boundary here, with only the carrier changing (Appendix L §L.4). The three realizations below are that one declaration over three carriers.

Realization 1: the distributed-transport proxy (simulation and cross-machine). The base form, identical to the cross-process deployment of Appendix L:

class FpgaProxyActor extends Actor;
  // Configured at construction with the on-FPGA actor's name and the
  // transport's address (ZMQ endpoint, serial port, etc.).
  TransportBridgeActor bridge;


  task act(MsgBase msg);
    // Forward inbound to the FPGA: serialize, send over transport.
    bridge.send(serialize(msg));
  endtask


  // Background thread: receive from FPGA, deserialize, publish into
  // the local `WIRE topology so subscribers see the message as if
  // the FPGA actor had published locally.
  task receive_loop();
    forever begin
      Bytes b = bridge.recv();
      MsgBase m = deserialize(b);
      publish(m);        // fan out to all wired consumers
    end
  endtask
endclass

It is on the order of fifty lines: act() is serialize-and-send, a background loop deserializes and republishes, and the rest of the framework is unaffected. The names are illustrative — in the shipped actor_distributed_pkg the transport calls are send_bytes/recv_bytes over a byte unsigned [] buffer and the background loop is the bridge actor’s forked run(); the listing keeps short pseudonyms to show the shape. The serialization is whatever the transport encodes (MessagePack/Protobuf over ZMQ; zero-copy shared memory over iceoryx for same-machine PCIe/USB boards); the hardware-side serializer is generated from the message struct definitions, not hand-written.

Realization 2: the SCE-MI transactor (commercial emulator). On a commercial emulator the same declaration must become a SCE-MI transactor — and its shape is fixed by the declaration, which is why it can be generated rather than hand-written. SCE-MI standardizes a multi-channel message interface that carries transactions, not pins; a well-formed transactor encapsulates the per-cycle bus protocol on the emulator side and crosses the boundary once per transaction. (The published gap between a 64$\times $ signal-based and a 2091$\times $ transaction-based ZeBu acceleration is exactly this transaction-versus-pin choice.) For an actor whose outbound message is an AXI burst the target-side transactor takes the shape:

// Emitted target-side SCE-MI transactor for an AXI-burst bridge actor.
// One inbound message = one full burst; one outbound = one response.
// The host crosses the boundary TWICE per burst, not 2*N times.
module axi_burst_xtor (
     input   logic           clk, rst_n,
     input   axi_burst_msg_t burst_in,         // {addr, len, data[]} -- one txn
     input   logic           burst_in_valid,
     output logic            burst_in_ready,
     output axi_resp_msg_t   resp_out,          // {status, rdata[]} -- one txn
     output logic            resp_out_valid,
     input   logic           resp_out_ready,
     axi_if.master           axi               // pin-level AXI to the DUT
);
     // The actor's act() handler, rendered to RTL, runs the per-beat AXI
     // protocol internally: unpack burst_in once, drive len+1 beats, collect
     // the responses, pack them into one resp_out. No per-beat host crossing.
endmodule

Three facts make this a generation problem, not an authoring one. The message type fixes the wire format (the marshalling is the field walk the distributed transport already does). The protocol unfold is the actor’s own act() body, rendered to RTL by §G.2 — there is nothing to re-code. And the transaction granularity is automatic: because the actor communicates in whole typed messages and never in pins, the emitted transactor is transaction-level by construction and physically cannot be the “chatty” per-pin transactor that collapses throughput (§G.9).

Realization 3: the FireSim bridge. On FireSim the same declaration maps to a FireSim bridge — a target-side BlackBox mixing in the bridge interface (channel annotations, a serialized constructor argument, the module name) and a host-side bridge_driver_t with a non-blocking tick(). The target stub is the FireSim equivalent of the SCE-MI BFM; the bridge_driver_t is the equivalent of the host proxy. The same mechanism produces both from the TransportBridgeActor declaration, exactly as for the SCE-MI pair.

The point across the three: an actor on either side of the seam wires to the bridge with the same ‘WIRE call it would use for any local actor, and neither side knows the other is across a substrate boundary. The bridge earns its place because of that boundary — not because the testbench lives on the host. Its few lines are the surface area of one seam, not of the substrate swap, which re-renders the whole graph and adds no bridge inside the verification loop at all. And the host side of each realization — the SCE-MI host proxy, the FireSim bridge_driver_t, the distributed-transport receiver — is itself a plain C++ object, which is to say a pure-C++ actor of the same framework (Appendix K §K.5): the seam is not a language boundary on either side, only a substrate boundary in the middle.

G.7 The partition plan is the wire graph

A real SoC does not fit one FPGA, and partitioning is the long pole of FPGA-based emulation: a bad cut slices a high-fanout bus and its per-FPGA place-and-route spins for days. The classical recovery is manual partition hints that do not transfer between RTL revisions. The actor graph removes the inference.

The ‘WIRE graph at elaboration time is a list of edges between named producer and consumer actors, each with a known message type (hence a known bit width), a known fan-out cardinality, and an explicit ready/valid back-pressure structure. It is a directly usable partition plan: place each actor (or each densely-coupled subgraph) on its own device, with the inter-actor edges becoming the inter-device crossings. FireAxe (ISCA 2024) confirms this is the cheap kind of cut — its fast mode partitions at latency-insensitive (ready/valid or credit-based) boundaries at near-zero accuracy cost, falling back to the slower exact mode only when a boundary has combinational paths through it. The Five Rules put a ready/valid handshake on every ‘WIRE edge, so every actor boundary is already a fast-mode cut. A design not written to that discipline must hunt for latency-insensitive cut points; an actor graph is nothing but latency-insensitive cut points. The partitioner is handed the plan rather than inferring it, and the partition is invisible to the verification environment: when a cut crosses a device boundary it is bridged by time-division multiplexing, and neither the producing nor the consuming actor can tell the wire between them now crosses a seam.

G.8 Observation continuity

Because the verification actors are the same authored actors, re-rendered rather than rewritten, the methodology carries across the swap intact.

• The same scoreboard runs. It checked a behavioral actor at architecture-exploration time, a synthesizable RTL block during DV, and now is itself synthesized onto the fabric, wired to the synthesized DUT by the same ‘WIRE edge. The properties it checks do not change.
• The same coverage actor runs, and its bins port through the IR. Coverage bins are defined at the message level, not the signal level, and — expressed at the intermediate-representation level (the simulator-independent-coverage result, ASPLOS 2023) — produce identical coverage maps in a software simulator, on an emulator, and under formal, merging trivially across them. A “ready/valid handshake” coverage metric is coverage of ‘WIRE edges by another name.
• The same RAL drives stimulus. Symbolic register access (ral.write_reg("CTRL", val)) becomes a typed register-write message the synthesized RAL actor unfolds into bus transactions on the fabric; only the human-issued commands and summary replies cross the seam.
• The same regression infrastructure aggregates. A central dashboard subscribed to test-pass and coverage messages aggregates identically whether they came from simulation or hardware; actor_distributed_pkg already does this across machines, and the fabric is one more endpoint on the same bus.
• The same workload drives both. A version-controlled workload description (the FireMarshal pattern, ISPASS 2021) drives the functional simulation and the hardware run unchanged.

The continuity is architecturally necessary, not aspirational: the framework knows only typed messages and ‘WIRE edges and does not know whether the actor across an edge is in this process, on another machine, or in hardware. That ignorance is the methodology’s strength.

G.9 Performance: FMR, Amdahl, and the seam

FPGA-based execution is faster than software simulation but introduces timing realities the software framework does not face.

Direct versus decoupled (FMR). A direct FPGA prototype runs one host cycle per target cycle (an FPGA Multiple Ratio, FMR, of 1) at maximum speed but cannot host FPGA-hostile structures. A decoupled simulator (FireSim) runs a variable host-cycle ratio (FMR $\geq 1$) — slower, but it tolerates host-latency, hosts FPGA-hostile structures, and stays deterministic; representative figures are FMR $\approx 1.0$–$1.5$ unoptimized, rising to $\approx 7$–$9$ when a host-side DRAM timing model is bound in (Golden Gate / FASED). The fabric clock itself is 50–200 MHz on mid-range devices; the effective rate is that divided by FMR.

Amdahl on the seam. The system rate is governed by the fraction $f$ of workload cycles that block on a host round trip (Chapter 7, §7.16). The actor methodology drives $f$ toward zero by construction: the verification actors are on the fabric, so per-transaction traffic is intra-fabric wires, and the seam carries only workload-level read-out and commands. The per-cycle control loop that would pin $f$ near one in a host-resident UVM flow simply does not exist here.

Throughput, latency, trace. The seam’s per-transaction traffic never crosses (it is on the fabric); only summaries do, at workload rate. If even the summary stream outruns the host, the bridge reports backpressure through the standard try_publish-returns-false mechanism and the bounded-mailbox discipline (actor_pkg’s capacity parameter) absorbs it. A host-to-fabric round trip is hundreds of nanoseconds at best (PCIe) to microseconds (USB/socket); the seam tolerates it because it carries workload-level traffic, not a per-cycle loop. Trace bandwidth is bounded by construction: the coverage actor aggregates on the fabric and publishes summaries; a tracing actor stores full traces in on-fabric RAM and dumps on demand (the DESSERT snapshot-and-replay pattern, FPL 2018, is the open precedent). Multi-clock designs are handled by the framework’s clock discipline; FireSim’s clock-token mechanism is the decoupled-simulator instance of it.

G.10 Metasimulation: the conversion is provable without hardware

The FireSim back-end makes the conversion not merely automatic but provable on a laptop. FireSim metasimulation runs the FAME-transformed graph under Verilator with no FPGA, and the platform guarantees that what metasimulation observes is bit- and cycle-exactly what an FPGA run produces. The whole conversion — every actor rendered, the ‘WIRE edges as token channels, the seam bridge — is therefore validated before any bitstream exists.

appG_firesim_substrate_swap is the shipped artifact. The verification loop — a stimulus actor (a 16-bit LFSR), an accumulator DUT, a scoreboard actor (golden model, expected-value FIFO, comparator), and a coverage actor — is wired into one fabric and run two ways: the four actors as software objects, and the same four synthesized under Verilator. The results are identical (256 transactions, zero mismatches, full coverage); Yosys confirms every actor maps to gates (the scoreboard’s 231 flip-flops carry its golden model, FIFO, and counters, beside the DUT’s 33); and the ./firesim/ scaffold carries the identical fabric onto FireSim, where the host reads only the final counters. The scoreboard and coverage are on the fabric, not on the host — the substrate swap demonstrated end to end, not asserted.

G.11 What is shipped, what is roadmap

Honesty about the boundary, consistent with Chapter 7. Shipped: Appendix E’s counter actor synthesized and placed-and-routed on iCE40 at 126.6 MHz; the substrate-swap example’s whole verification loop synthesized to gates and run in metasimulation against its software rendering with identical results; the stimulus generator’s constraint solver compiled to fabric on a production constraint set and checked both directions against the reference solver (Appendix F); the bridge/transport architecture of §G.6 in its distributed-transport form (Appendix L). Mechanism, grounded in published work but not re-implemented here: the automatic FIRRTL-to-cycle-exact rendering (Golden Gate), the FireAxe fast-mode partition, the IR-level coverage, the FireMarshal workload artifact, the FireSim bridge emission. Roadmap: the end-to-end bring-up of a full SoC’s actor graph on a commercial emulator and on an FPGA board — bounded engineering on the architecture this appendix specifies, not architectural exploration. The per-actor synthesis and the whole-loop fabric that underwrite the claim are demonstrated; the industrial-scale bring-up is the next step.

G.12 What this appendix has established

• The conversion of an actor testbench from simulation to hardware has no manual step: each of the four manual efforts the status quo requires (re-code, hand-plumb, relocate coverage, hand-partition) becomes a compiler pass or a graph walk over a declaration the engineer already wrote (§G.1).
• The first step — rendering each actor to RTL — is mechanical and cycle-exact, with a formal guarantee borrowed from the open Golden Gate/LI-BDN stack (§G.2).
• One authored graph compiles to three back-ends: open-source FPGA, commercial emulator (SCE-MI seam), and FireSim (FAME seam); only the compiler and the generated seam adapter differ (§G.4, §G.6).
• The seam adapter is generated, not authored, in all three forms; the ‘WIRE graph is the partition plan; coverage and workload port through the IR and the workload artifact (§G.6–§G.8).
• The conversion is provable without hardware via FireSim metasimulation, and a runnable artifact (the substrate-swap example) demonstrates it end to end (§G.10).

The flow from class-based actor to running hardware is a sequence of well-understood stages, all backed by open-source (Yosys, nextpnr, Golden Gate) or vendor (Vivado, Quartus, the emulator compilers) tooling. There is no novel tool chain required — only the discipline of authoring the testbench as an actor graph, which Appendix E shows costs nothing at the silicon level and Chapter 7 shows buys the automatic substrate swap.