Emergent Hardware Verification

Chapter 7 Emulation

Chapter 1 introduced Emergent Verification as a methodology with three properties — adaptive planning, lightweight architecture, and a Formal + Simulation + Emulation pipeline running concurrently in a sprint cycle. The third property is itself a three-leg pipeline: formal, simulation, emulation. Chapter 3 took the formal leg. Chapters 4–6 took the simulation leg, ending at the actor framework whose typed-message wiring is the EV substrate the rest of the book builds on. This final chapter takes the third leg of the pipeline.

Emulation is the verification step where the RTL stops running as software on a workstation and starts running as logic on a specialized machine — custom emulation silicon (Cadence Palladium’s Boolean-processor array, Siemens Veloce’s emulator-on-chip fabric) or an array of large FPGAs (Synopsys ZeBu and Cadence Protium as single-reference-clock emulators, Synopsys HAPS as a distributed-clock prototype). The throughput multiplier is the headline: software simulation runs the design at single-digit kHz to low MHz; an emulator — custom silicon, or a single-reference-clock FPGA platform such as ZeBu or Protium — runs the same RTL in the 1–10 MHz range; an FPGA prototype with distributed clocking (HAPS and the custom FPGA-array boards in its class) can climb into the tens of MHz; silicon runs at GHz. Across that gradient, every order of magnitude in speed costs an order of magnitude in either debug visibility, capacity, or compile / setup time. The chapter’s first half is the engineering reality of that gradient. The second half is the methodology argument: in EV, emulation is not the post-RTL deck the simulation team hands off to the software team — it is the third concurrent leg of the sprint, made operationally cheap because the actor framework’s typed messages and ‘WIRE edges survive the substrate swap. The book closes on that loop closing.

The chapter’s central contribution. The second half makes a claim this book has been building toward and that, to the author’s knowledge, no production verification methodology has made before: a complete testbench can move from software simulation to hardware emulation with no manual rewrite of any kind. Not a hand-written synthesizable bus-functional model; not a re-coded transactor; not a second, host-resident testbench reaching the design through hand-built SCE-MI plumbing; not a hand-tuned partition; not a re-instrumented coverage model. The same authored actor graph that ran in simulation — stimulus, device under test, scoreboard, coverage, register layer, and all — is re-rendered onto the emulator by the framework and the synthesis compiler, and the engineer changes nothing. The claim holds for both substrates a 2026 program would target: a commercial emulator (Palladium, ZeBu, Veloce) reached across the SCE-MI transactor boundary, and FireSim — the open-source, FPGA-accelerated, cycle-exact simulation platform — reached across its FAME/bridge boundary. One authored graph; two hardware substrates; zero manual conversion. This is, to the author’s knowledge, a first-in-industry property, and §7.17 develops it rigorously, with a runnable artifact (appG_firesim_substrate_swap) behind it.

Why this is not the way it is done today. The status quo is the opposite. Because the dominant testbench language (UVM/SystemVerilog) is non-synthesizable by construction, moving verification onto an emulator today is a manual port: the testbench stays on the host workstation; the part that must touch the emulator-resident design is hand-rewritten as a synthesizable BFM and a SCE-MI transactor; verification IP that cannot cross is re-coded in C++; coverage is hand-moved into a module bound into the emulator image; the partition is hand-hinted. Published flows state this plainly — in one DVCon/CDNLive Palladium case study the UVM Ethernet VIP “could not be reused” on the emulator and “was rewritten in C++.” That per-port, per-program rewrite is the tax this chapter removes. The landscape sections that follow establish why the tax exists (the transactor boundary, Amdahl’s law on a two-rate machine, the synthesizability wall); the contribution section shows why an actor-authored testbench does not pay it.

7.1 The third leg of Emergent Verification

The Chapter 1 picture (§1.6) put three lanes running in parallel through every sprint: formal beginning at the block and, via propagation of proven knowledge across interfaces, carrying up to subsystem, system, and even whole-SoC scope; simulation at the sub-system and system level; emulation at the SoC level, all three feeding the same coverage and bug-tracking infrastructure. Most of that picture was promissory — the book had not yet built the substrate that would let a single verification environment span the three. Chapter 6 built it: actors with typed messages, declarative ‘WIRE edges, mailboxes as ports. Appendices E and G showed the substrate continues to work after synthesis: the whole actor graph — design and verification actors alike — re-renders onto the fabric, with a bridge actor only where a genuine software-to-hardware seam remains (the host read-out, external I/O). This chapter is where the third lane finally runs.

The framing matters because it changes what emulation is for. The classical view is hand-off: simulation team verifies the RTL block by block; once the SoC integration freezes, the emulation team builds the deck, brings up the workloads, and the software team starts boot-up. Emulation is a downstream consumer of stabilized RTL. The EV view is concurrent: emulation comes online incrementally as blocks mature, in the same sprint that adds the formal proofs and the simulation tests, on the same actor substrate. The verification investment compounds across the three legs rather than being rebuilt for each.

What makes the EV view operationally possible is not new emulation hardware. The emulators have existed for decades. What makes it operationally possible is the substrate: a verification framework that does not distinguish between an actor running in software, an actor running on the host as a proxy for an actor running on hardware, and an actor running on an emulator’s processor array. The framework only knows about typed messages and ‘WIRE edges. The substrate swap that turns a 1 kHz Verilator run into a 5 MHz Palladium run does not change the scoreboards, the coverage actors, or the regression dashboards. That is the methodology claim this chapter delivers on.

The reason this matters operationally is also worth front-loading: emulators are logic evaluators, not general-purpose computers. They cannot run SystemVerilog’s dynamic OOP idioms — no new(), no randomize(), no run-time class allocation. A traditional UVM testbench therefore stays on the host workstation and talks to the emulator-resident DUT through a transactor. The transactor’s bandwidth becomes the system bottleneck; the rest of the chapter’s methodology argument hinges on what to do about that. The actor framework’s answer (Chapter 6 and Appendix E) is to make the testbench itself synthesizable so most of it runs on the emulator alongside the DUT — which collapses the boundary traffic to what genuinely has to cross.

Notation: ‘WIRE. The chapter refers to ‘WIRE on most of its pages; the reader arriving here without Chapter 6 deserves a one-paragraph gloss. ‘WIRE(producer, MsgType, consumer) is the actor framework’s only wiring primitive (Chapter 6; actor_pkg.sv line 203). It declares that consumer will receive every message of type MsgType that producer publishes; the framework’s publish() looks up the message’s type at runtime and fans out only to the actors wired for that exact type. The macro body itself is one line — PRODUCER.add_subscriber($typename(PAYLOAD_T), CONSUMER) — but it is the only primitive, so the topology is fully visible in the wiring code with no hidden edges. The leading character is a backtick (‘), not a backslash; it is SystemVerilog’s macro-invocation marker. Verilog adopted the backtick because the language reserves # for timing constructs (#5 means “delay 5 time units”), so the C-style #define / #FOO convention could not be reused. The same backtick prefix appears throughout the framework on ‘PUBLISH, ‘PUBLISH_TO, and ‘PUBLISH_TRACED.

Shift-left. The industry term for this pattern is shift-left — moving verification activities earlier in the project timeline, so bugs are found pre-RTL-freeze rather than post-tape-out. Formal verification shifts block-level proofs left; emulation shifts SoC-integration and software-stack bring-up left. The EV claim is sharper: the actor substrate is the strongest shift-left lever, because it requires zero testbench refactoring to move work from one substrate to another. A bug that would have been found at silicon bring-up (an Android-boot-time race condition, say) is now findable two sprints earlier on the emulator, using the same scoreboards that block-level simulation used three sprints earlier still. The shift-left distance is the entire pipeline; the substrate is what carries the verification investment across it.

Why the substrate is the load-bearing claim. Hardware emulation is older than Emergent Verification. IBM’s Yorktown Simulation Engine ran in 1982. Quickturn shipped CoBALT in 1997, and what is now Cadence Palladium descends from it. EVE, the company behind the ZeBu line that became Synopsys’s emulator, was founded in 2000. The novelty in this chapter is not the emulator; the novelty is the claim that a verification methodology can be defined such that moving a block from simulation to emulation requires no scoreboard rewrite, no coverage-bin redefinition, no testbench restructuring — only the re-rendering of the same actor graph onto the fabric, with a bridge actor at the one seam that genuinely crosses to the host. Appendix G states this property structurally; this chapter shows what it buys, at the scale of an industrial SoC, when the substrate underneath is a 5 MHz emulator rather than a single-digit-kHz software simulator. The substrate is what makes the three legs of EV operationally concurrent rather than sequential.

7.2 Why emulation exists

The motivating arithmetic is stark. A modern SoC — 10 B gates, multiple CPU clusters, GPUs, NPUs, IO complex, memory subsystem — presents three numbers that decide what verification can and cannot do.

• Verilator at full speed closes one second of design-time in roughly 1000 seconds of wall-clock on a workstation, when the design is small. On a 10 B-gate SoC the ratio collapses to single-digit kHz, and one second of design-time becomes hours or days of wall-clock. Booting Linux takes weeks. Booting Android is impractical.
• A commercial simulator (VCS, Xcelium, Questa) on the same SoC adds perhaps an order of magnitude over Verilator on the right hardware, but the wall is still wall-clock days for a workload that would run in seconds on silicon.
• A modern emulator (Cadence Palladium, Siemens Veloce, Synopsys ZeBu, Cadence Protium) puts the same RTL on its own processor array or FPGA array and clocks it at 1–10 MHz: Linux boots in under an hour, Android in a few hours, multi-hour workload longevity tests are feasible. An FPGA prototype (Synopsys HAPS, or a custom FPGA-array board in the same class) at tens of MHz drops Linux boot to a few minutes and Android to under an hour — the prototyper’s speed advantage is most visible on these long workloads.

The ratio that matters is not absolute speed — silicon is faster than any of these — but the throughput required to exercise software-driven workloads before silicon arrives. Simulation cannot reach it. Real silicon does not yet exist. Emulation closes the gap. Without it, three classes of bug are by construction invisible until tape-out.

Bugs only emulation finds.

• OS-boot regressions. The kernel exercises page tables, exception vectors, MMU translations, cache-coherence transitions, interrupt routing, and DMA paths in a way no testbench writes by hand. Most “boot doesn’t work” bugs in pre-silicon SoCs surface for the first time on the emulator.
• Workload-driven power surprises. Average power numbers from gate-level simulation are computed over microseconds of stimulus. A 90-second camera-launch profile is $10^{11}$ cycles — only emulator-driven dynamic power analysis (Cadence DPA, Synopsys ZeBu Empower) captures the bad windows.
• Long-tail concurrency. DRAM-refresh-while-self-refresh, cache coherence races under memory pressure, ECC double-bit-error accumulation: emerge only after tens of billions of cycles.

The first class is the decisive commercial argument. A six-month software bring-up that begins at silicon arrival can be compressed to four weeks by starting on emulation at RTL freeze. Published case studies report 5$\times $ reductions in first-software-release readiness; in the AI-accelerator world the multiplier is larger because the software stack is newer.

A real-world analogy. Simulation is a kitchen prep cook practicing a single recipe in slow motion. Emulation is the same kitchen running a full dinner service for two hundred people, slowed to one-twentieth of real speed but cooking every dish that the real menu calls for. Silicon is the same kitchen at full speed on opening night. The prep cook spots ingredient errors; the slowed-down service spots workflow and timing issues no isolated recipe could uncover; opening night spots only what slipped through. The three are complementary; none substitutes for the other.

7.3 Two audiences, two purposes

Emulation hardware serves two distinct user audiences with different goals, and the platforms have evolved to match. Understanding the split clarifies why the industry has both emulators and prototypers, and why a single program typically owns both.

DV acceleration. The verification team has spent months building scoreboards, coverage actors, and a RAL in simulation. The bug rate in simulation has dropped, but the workloads they want to run — multi-hour longevity, exhaustive random regression, deep instruction-stream coverage — are too slow to finish in days of simulator time. Emulation gives them the same testbench at thousand-times speed. They want full debug visibility because every emulator run is potentially the run that surfaces a bug, and they need to inspect it. They want deterministic compile because the emulator is part of the CI loop and unpredictable turn-around breaks the regression cadence. Emulators built for this audience (the custom-processor-array family, plus the high-end FPGA-based emulators) optimize for high debug visibility, predictable compile, hours-not-days turnaround, and MHz-class speed. The substrate question matters here: a UVM testbench cannot run on the emulator (UVM is non-synthesizable), so it stays on the host CPU and the boundary transactor becomes the bottleneck. The synthesizable-actor substrate of Chapter 6 avoids this trap — the scoreboard, coverage actors, and stimulus generators all ride the emulator, and only what genuinely has to cross the boundary actually does. §7.17 returns to this.

Software bring-up. The software team needs to bring up a real operating system, real device drivers, real applications — on silicon that does not yet exist. In practice they use both emulators and prototypers, at different stages. Early bring-up runs on an emulator (often Palladium or its peers): a single-reference-clock custom-processor-array gives the debug visibility a software engineer needs when nothing yet boots, the kernel hangs in mysterious places, and “where exactly did execution go?” is the question. Once the RTL stabilizes and the boot path is reliable, the team migrates the long workloads (driver longevity, full Android boot, app benchmarks) to a prototyper with distributed clocking, where SoC speed climbs by an order of magnitude and a multi-hour workload becomes a multi-minute one. The software team does not pick between emulator and prototyper; they use both, sequentially, as their workload matures.

The same substrate, two questions. The verifier asks: “what is the SoC doing at cycle 7,348,291?” The software engineer asks: “can the kernel boot, can the driver bind, can the application complete in time.” Different questions; different debug requirements; different speed-versus-visibility trades. Every Tier-1 program buys hardware for both, and the same RTL flows between them. The actor substrate accommodates both without methodology change: the verifier’s scoreboards subscribe to the same typed messages whether the run is on an emulator or a prototyper; the software engineer’s RAL writes go through the same path. The platform underneath swaps; the verification and software-bring-up environments do not.

7.4 A short history of emulation

Hardware emulation has a forty-year history that the contemporary verification engineer rarely sees, but that explains why the platforms look the way they do. The history is brief; the patterns are durable.

1982: Yorktown Simulation Engine. IBM Research, frustrated with the simulation throughput available for the 3081 mainframe verification, built the YSE: a massively parallel custom-processor array whose entire job was to evaluate one RTL design per cycle. The architecture — thousands of identical Boolean evaluators, each computing one logic primitive per clock, fed from a shared crossbar to shared memory — set the template for what would become Palladium. The YSE was an internal IBM tool; it never shipped commercially, but the architectural choice (purpose-built processors, not general-purpose CPUs) defined the high-end of the industry for decades.

1988: Quickturn and the FPGA approach. Quickturn Design Systems started selling logic emulators based on commodity FPGAs — early Xilinx parts. The hardware was cheaper, the speed was slower, and the partitioning problem (how to split a design across N FPGAs) was the central engineering challenge. The Quickturn CoBALT family in 1997 was a commercially successful tens-of-millions-of-gates emulator — not an FPGA machine but Quickturn’s custom-Boolean-processor line, built on technology licensed from IBM and sold alongside its original FPGA products (billion-gate scale arrived later, in the 2000s, with subsequent CoBALT and Palladium generations); Cadence acquired Quickturn in 1999, and the architectural family eventually evolved into today’s Palladium.

2000: EVE founded, FPGA emulation matures. EVE (Emulation and Verification Engineering, founded in Paris in 2000) committed to the FPGA approach as Quickturn moved away from it. EVE shipped the ZeBu (Zero Bug) line through the 2000s, refining the per-FPGA compile, the inter-FPGA TDM, and the probe-network technology that defined what FPGA-based emulation could do. Synopsys acquired EVE in 2012, and ZeBu has continued as Synopsys’s emulator line ever since. Today’s ZeBu Server 5 carries the architectural DNA of the original 2000 system.

Mid-2000s: prototyping splits from emulation. The realization that the same FPGA silicon could be optimized differently — for raw speed at the cost of debug — spawned the prototyping category. Synopsys HAPS and the custom FPGA-array boards that followed it serve this market. The split between the two product categories is architectural, not functional: emulators drive the whole design from a single reference clock — the custom-processor arrays and the single-reference-clock FPGA platforms (ZeBu, Protium) alike; prototypers run distributed clocking, letting each IP block clock at its own frequency. Distributed clocking is what gives prototypers their order-of-magnitude speed advantage. Both categories serve both audiences (verification and software bring-up); software teams live on emulators during early bring-up because debug visibility matters when nothing yet boots, and migrate to prototypers later when the RTL stabilizes and raw throughput matters more than visibility.

Mid-2010s: hybrid emulation becomes dominant. AMD’s published GPU pre-silicon flow with QEMU + RTL co-simulation, the SCE-MI 2 (Standard Co-Emulation Modeling Interface) standard from Accellera, and AMD’s (formerly Xilinx’s) open-source libsystemctlm-soc library together established the pattern: fast CPU model in software, cycle-accurate RTL on the emulator, glued by TLM-2.0 transactors. The pattern is now the dominant approach across major SoC programs in 2020–2026; pure-RTL emulation is reserved for the cases hybrid cannot capture.

2020–2024: cloud emulation. The major vendors began offering emulation as a hosted cloud service, shifting the consumption model from owning boxes to renting time by the gate-hour. The economic argument: even Tier-1 programs cannot fill a private fleet, and Tier-2 programs were priced out of ownership entirely. Cloud emulation made the technology accessible to a broader range of programs and shifted the utilization risk to the vendors.

2024: the contemporary generation. The platforms a modern program would benchmark in 2025–2026 are the latest generations of Cadence Palladium, Siemens Veloce, Synopsys ZeBu, Cadence Protium (the emulators) and Synopsys HAPS (the prototyper). The architectural choices (custom processor array versus FPGA array) and the consumption choices (on-prem versus cloud) are now mature. The next-generation race is about capacity (multi-die chiplet SoCs), power-analysis accuracy, and cloud integration depth.

What the history reveals. Three observations the forty-year arc makes visible:

1. The custom-processor-array idea is older than the FPGA approach. YSE in 1982 predates Quickturn’s FPGA approach in 1988 by six years. The current Palladium architecture is a direct intellectual descendant of YSE; Veloce’s custom emulator-on-chip fabric and ZeBu’s commercial-FPGA array are the alternatives that came later.
2. The fundamental tradeoffs have not changed. Capacity versus speed versus debug versus compile time was the tradeoff in 1982; it is the tradeoff in 2026. The numbers have improved by orders of magnitude on every axis, but the shape of the choice has not.
3. The methodology layer is where 2024–2026 progress is happening. Hybrid emulation, cloud emulation, AI-accelerated debug — the architectural primitives are old. What is new is the methodology and the integration of emulation into the broader verification flow. This chapter’s argument — that the actor framework makes emulation methodology-free for EV — is one step on that methodology-layer arc.

7.5 The platform taxonomy

The verification engineer’s mental model has four tiers between workstation and silicon, separated by orders of magnitude on every axis (Figure 7.2).

Software simulation runs the RTL as a program. The simulator step-by-step evaluates every gate and flop. Compile is fast (seconds to minutes), debug is omniscient (every signal at every cycle, breakpoint anywhere, time-reverse with the right tool), but the absolute throughput tops out below 1 MHz for chip-scale designs and falls precipitously as the design grows. The 1 kHz–1 MHz range stated in the figure is the band; the daily reality with full UVM and waveform dumping is at the low end of it — commercial event-driven simulators on sub-system regressions average hundreds of Hz to low single-digit kHz; cycle-based Verilator with a C++ harness and standard debug averages tens to low hundreds of kHz. Stripped-down benchmarks reach 1 MHz; real verification runs rarely do.

Emulators are purpose-built machines whose entire job is to evaluate one RTL design as fast as their silicon can. Three architectural families exist (the lineage subsection below traces them), and for the practical tradeoffs they pair into two camps: custom silicon versus commercial FPGAs. In the custom-silicon camp, Palladium compiles the RTL into a memory image and runs it on tens of thousands of identical Boolean processors — the same architectural choice that IBM made in 1982 — while Veloce runs it on purpose-built emulator-on-chip FPGA fabric. The commercial-FPGA family (ZeBu, Protium) partitions the RTL across an array of large commercially-available FPGAs, routes the inter-partition signals through a backplane, and runs the whole thing from a single reference clock that the partition strategy and FPGA timing can sustain. Both camps clock in the 1–10 MHz range, both support billions of gates, and both cost in the multi-million-dollar capital class. They differ structurally in debug visibility (memory-dump in custom silicon, probe-net-routed in FPGA arrays) and in compile determinism.

FPGA prototypes are not emulators in the formal sense. They share the FPGA silicon with the FPGA-based emulators but throw away the latter’s deterministic compile, full probe-network, and tight integration with the verification environment in exchange for raw speed. A prototyper (Synopsys HAPS, or a custom FPGA-array board in the same class) carrying the same RTL as an FPGA-based emulator runs an order of magnitude faster on the SoC clock — tens of MHz on its distributed clocking, higher still with substantial setup effort — but compiles in days, exposes only a small fraction of nets to probes, and treats debug as a recompile-driven activity. Prototypes carry the heaviest load when the SoC RTL is stable and the workload is throughput-bound — software stacks that need real-time end-to-end runs are the typical case — though DV teams also use prototypers for late-stage longevity and stress runs where compile cost is amortized over weeks of execution.

Real silicon is the destination. Native speed, no on-chip debug other than what design-for-test (DFT) explicitly exposes, post-fabrication only. Real silicon catches what slipped through the three pre-silicon tiers; everything an emulator finds before it is a tape-out spin saved.

The tradeoffs as one sentence. Every order of magnitude in speed costs an order of magnitude in either debug visibility, capacity, or compile time. The four tiers are points on that curve; there is no single tier that wins on all axes, and no architectural reason to expect one. Mature pre-silicon programs run all four tiers concurrently, on the same RTL, with each tier handling the workload that matches its tradeoff.

The deeper axis: prototype versus decoupled simulator. The four tiers are the engineer’s operational view; the academic literature draws a sharper line within the FPGA-based tiers, and the chapter’s contribution turns on it. An FPGA prototype maps the target one-to-one onto the fabric — one host clock cycle is one target cycle (an FPGA Multiple Ratio, FMR, of 1), maximum speed, but the target must fit the fabric directly and FPGA-hostile structures (multi-ported register files, off-chip DRAM timing) have nowhere to go. An FPGA-accelerated simulator decouples host time from target time: one target cycle takes a variable number of host cycles (FMR $\geq 1$), slower, but it can host those structures, tolerate variable host latency, and stay deterministic. The FAME taxonomy (FPGA Architecture Model Execution; Tan et al., ISCA 2010) classifies every platform on three axes — direct versus decoupled, RTL versus abstract model, single- versus multi-threaded — and the load-bearing sentence is that “host-target decoupling is what differentiates an FPGA-accelerated simulator from an FPGA prototype” (Biancolin et al., FASED, FPGA 2019). Commercial emulators lean toward the prototype end (compiled RTL, near-direct execution); FireSim is squarely the decoupled-simulator end (the FAME-1 transform, §7.18). The distinction matters here because an actor graph compiles to both ends from one source (§7.17): a latency-insensitive actor network is exactly the structure that sits above the direct/decoupled split.

7.6 The industrial platforms

The emulation market splits along an axis of compile determinism versus raw throughput. Custom-silicon emulators (Cadence Palladium’s processor array, Siemens Veloce’s emulator-on-chip fabric) deliver full waveform visibility, quick and deterministic compile, fast bug-fix turn-around, and a decent run rate; this combination serves DV acceleration well and also serves early software bring-up well, since debuggability matters most when nothing yet boots. FPGA-based platforms (Synopsys ZeBu and Cadence Protium as single-reference-clock emulators — Protium is marketed as a prototyping system but sits in the emulator class under this clocking criterion — Synopsys HAPS as a distributed-clock prototyper) span the speed range: the single-reference-clock members match the processor-array emulators at 1–10 MHz, while the distributed-clock prototypers push runtime throughput an order of magnitude higher, accepting longer compile times and reduced debug visibility as the price; this combination serves late-stage software workloads on stable RTL well, and also serves DV teams that need high-gate-count capacity at the cost of probe density. Neither family is exclusive to one audience — the architectural choices set defaults, not assignments.

The specific product versions and capacity figures these vendors ship change every generation; the architectural choices stay stable. This section walks the two families by architecture and audience, not by product SKU.

Custom-silicon emulators (Palladium, Veloce)

The custom-silicon approach descends from IBM’s Yorktown Simulation Engine via the Quickturn CoBALT family (§7.4). The defining choice: do not use commodity FPGAs. Palladium maps the design onto a custom processor array whose individual elements are tiny Boolean evaluation engines, each capable of computing one logic primitive (AND, OR, XOR, MUX) per cycle; a modern system packs thousands of these processors per die and networks dies into a single emulation engine of billions of gates per chassis (Figure 7.3). Veloce reaches the same operating point through purpose-built emulator-on-chip FPGA fabric rather than Boolean processors — a different execution element with the same custom-silicon properties, deterministic compile and full visibility.

Compile and run. A DUT compiles into a memory image of operations and operands. Every cycle, every processor reads its instruction, fetches operands from a shared crossbar, computes one Boolean, and writes the result back. Compile times scale linearly with design size and are deterministic — the compile cost does not blow up as the design fills the box, and incremental recompiles after a small RTL change finish quickly. Honest production-grade run rates on flagship SoCs sit in the low-MHz range under load; best-case demonstrations on isolated CPU subsets reach higher, but the SoC-scale number is the one to plan against.

Full visibility, no recompile. Every net’s value lives in some processor’s memory at every cycle. Adding probes is a memory-dump command, not a recompile. The debug stacks built on this property (Cadence FullVision / Verisium Debug, Siemens equivalents) lean on memory-dump fast-paths to deliver waveform windows on demand. Full waveform dumping with quick turn-around is the architectural property that makes this family attractive for both DV acceleration and early software bring-up — a bug that surfaces at minute 47 of a Linux boot is debugged immediately with full context, no recompile in the loop, whether the engineer running the workload is a verifier closing a regression or a kernel driver author hunting an integration issue.

The bandwidth caveat. “100% visibility” is the architectural property; streaming 100% of nets to disk is a physical-bandwidth problem. Dump an FSDB waveform for billions of gates across millions of cycles and the PCIe channel from the emulator to the host workstation saturates, the disk array fills in minutes, and the emulator’s effective speed collapses to single-digit kHz — back into simulation regime. The operational practice is triggered dumping: arm a trigger condition (a specific actor message published, a counter reaching a threshold, a coverage bin firing), capture only the window of cycles around the trigger, dump that window at full resolution, discard the rest. The processor-array architecture wins on probe availability (any net is accessible without recompile); the disk-bandwidth physics wins on probe dumping, and every Tier-1 emulation flow learns to live within that envelope.

Use models. The runtime catalogues document many named use cases; for our purposes they reduce to three primitives.

• In-circuit emulation (ICE). The DUT lives on the emulator; the rest of the system is real silicon (PCIe, Ethernet, USB, DDR) connected through rate-adapters (Cadence SpeedBridge, vendor equivalents from Siemens and Synopsys) that translate the emulator’s MHz-class clock to the GHz target. ICE is the mode for protocol-compliance testing against real test gear and for live driver bring-up against real interconnect silicon.
• Virtual / co-modeling. All stimulus is software — a SystemC TLM model of the host, a QEMU CPU complex, or a verification IP suite — driving the emulator over transactors at SCE-MI message granularity. Determinism, repeatability, and CI compatibility make this the dominant mode for AI-accelerator companies.
• Simulation acceleration. A small testbench in the simulator plus a large design on the emulator. Used when the testbench is too irregular to convert to transactors but the DUT is too big to run in software.

Dynamic Power Analysis. The vendor-specific power-emulation applications (Cadence DPA, Siemens Veloce Power) are the differentiator outside core performance. The mechanism is consistent across vendors: emulate the workload at full speed, stream toggle data per cycle to an analysis stage, feed only the high-power windows forward into signoff power tools (PrimePower, Joules) for gate-level accuracy. The point is not signoff accuracy — it is finding the bad workload windows in the first place. A multi-minute camera launch will never run through gate-level power simulation.

FPGA-based emulators and prototypers (ZeBu, Protium, HAPS)

FPGA-based emulation is the alternative architectural family. The compile flow is the classic three-stage one that defines this approach.

1. Partition the RTL graph across $N$ FPGAs. The compiler does this automatically from an EDIF netlist, plus the FPGA count and parallel-CPU budget. Manual partition hints are common at SoC scale.
2. Synthesize and place-and-route each partition independently. This is the long pole. Per-FPGA P&R is hours; cross-FPGA timing closure can extend it significantly.
3. Link the partitions via a direct-connect backplane, time-domain-multiplexing inter-FPGA signals across the chosen clocking strategy.

A single modern FPGA (the high-end AMD/Xilinx UltraScale+ parts the industry uses) clocks well into the hundreds of MHz natively. An SoC partitioned across many of them lands in a clock-speed range that depends sharply on the clocking strategy. This family is well suited to throughput-bound workloads on stable RTL — late-stage software bring-up is the most visible case, but DV teams also lean on FPGA-based platforms when the design has stopped churning and the remaining bugs need billions of cycles of stress to surface. The common factor is that the simulation-tractable issues have already been caught and what remains needs the runtime multiplier the FPGA platforms provide.

Clocking strategy — the structural split inside the FPGA family. Within FPGA-based platforms, two clocking strategies sit on opposite ends of an engineering-effort-versus-speed trade.

• Single-reference-clock systems drive every FPGA in the array from one shared clock. Inter-FPGA timing closure is uniform, the compile flow is more deterministic, and the system is straightforward to set up — but the system clock is bottlenecked by the slowest crossing in the entire fabric. Synopsys ZeBu and Cadence Protium configurations typically sit in this camp. SoC run rates settle in the low-MHz to high-single-digit MHz range; the speed ceiling is the cost of the simplicity.
• Distributed-clocking systems let each FPGA’s clock domain run independently, with carefully managed crossings between domains. The per-FPGA work can clock far faster — this is where the tens-of-MHz (and on small/well-partitioned designs, higher) prototyper SoC speeds come from — but the timing closure across distributed domains is the long pole, and the engineering setup cost is orders of magnitude higher than a single-reference-clock system. Synopsys HAPS and the custom FPGA-array boards in its class take this approach; the speed advantage comes at the price of program-level investment in clock-domain management.

The choice is not academic. A team that picks a distributed-clocking platform for raw speed must staff for the setup; a team that picks single-reference for the simpler bring-up accepts the speed ceiling. Both are defensible engineering trades.

Visibility tradeoff. The architectural cost of the FPGA approach is that signal visibility is limited to what the FPGA’s probe-routing fabric can carry out at runtime. Without recompile, only a fraction of nets is accessible (typically a small percentage); 100% visibility is achievable but bandwidth-limited and often requires re-routing. The contrast with the custom-processor-array architectures is structural — it is not a software gap, it is what FPGAs and processor-arrays each are good at.

The routing cliff. The FPGA-array compile-determinism story has a second, sharper edge. Below roughly 70–80% box utilization, the partitioner finds inter-FPGA routes readily and compile times grow approximately linearly with design size. Above roughly 85–90% utilization, cross-FPGA timing closure becomes a cliff: a wide crossbar added by an RTL change can saturate inter-FPGA routing, and yesterday’s eight-hour compile becomes today’s three-day compile with no warning. The fix is rarely a code change to the new feature; it is a manual partition hint or a re-architected interconnect to relieve the congestion. Custom-processor-array emulators do not have this failure mode by construction — their compile is a memory-image layout problem, not a routing problem, so cost stays linear all the way to box capacity. This is the structural reason “compile determinism” on the comparison table separates the families: the routing cliff is what makes FPGA compile times unpredictable at the high end.

The two families compared

The two architectural categories, side by side, with the engineering tradeoffs that determine which audience each serves (Table 7.1).

Table 7.1: Custom-silicon versus FPGA-based platforms.
	Custom-silicon	FPGA-based
	(Palladium, Veloce)	(ZeBu, Protium, HAPS)

SoC run speed	low-MHz	single-ref clock: 1–10 MHz;
		distributed clock: tens-of-MHz
Cold compile (flagship)	predictable, hours	variable, hours to days
Incremental recompile	quick, deterministic	variable, hours
Probe budget	100% (memory-dump)	partial (probe-net routed)
Compile determinism	high	moderate to low
Clocking strategy	single ref (intrinsic)	single ref (ZeBu, Protium)
		or distributed (HAPS)
Setup effort	moderate	moderate to very high
		(distributed clocking is the costly one)
Power-emulation app	yes (DPA, Veloce Power)	yes (ZeBu Empower)
ICE support	full	full
Hybrid (QEMU + RTL)	full	full
Cloud option	hosted gate-hour pool	hosted gate-hour pool
Best fit	DV + early SW bring-up	Late SW bring-up on stable RTL
	full-wave dumping,	fastest runtime,
	quick bug-fix turnaround,	accepting higher
	deterministic compile	compile and less debug

Reading the table. Four observations:

1. The probe-budget axis is structural. Custom-processor arrays give 100% visibility for free; FPGA-based platforms give it bandwidth-limited or recompile-gated.
2. The compile-determinism axis splits the families. Custom-processor-array compile is predictable; FPGA compile depends on the slowest partition’s place-and-route, which can spike unpredictably as the design fills the box.
3. Speed differentiates within the FPGA family, by clocking strategy. Single-reference-clock platforms (ZeBu, Protium) cluster with custom-processor arrays at low-MHz; distributed-clocking platforms (HAPS) leap an order of magnitude — but absorb orders of magnitude more setup engineering.
4. The Tier-1 answer is rarely one platform. Major programs own platforms of both families, run them on the same RTL, and choose per-workload by debug-visibility, compile-time, capacity, or speed requirements. The diversified-vendor strategy is standard practice at the top of the industry.

The capital cost of any of these platforms is multi-million-dollar capital for the emulators and sub-million for the prototypers; specific figures depend on configuration and vendor commercial terms that are rarely published.

Cloud. Both families — and the FPGA prototypers — are now available as cloud offerings: time on a vendor-hosted pool, billed by the gate-hour instead of bought as a box. The economics drive it. An emulator is a multi-million-dollar box that runs hot for the two months around RTL freeze and tape-out gates and sits idle the rest of the year, and even the largest programs cannot keep a private fleet busy year-round. Renting instead of owning is what makes it work: capacity scales elastically with demand, the cost of idle hardware disappears, and the technology reaches the Tier-2 programs that could never justify owning a box. Hosted alongside the vendor’s cloud debug and analysis tools, the turnaround matches an on-prem deployment.

Lineage and the 2024–2026 generations

The three architectural families each trace to a 1990s root, and the names on the boxes change every generation while the architecture does not. Processor-array: IBM YSE $\to $ Quickturn CoBALT (1997) $\to $ Cadence (1999) $\to $ Palladium I (2002) through XP2 (2013), Z1 (2015), Z2 (2021), Z3 (2024). Custom-FPGA: Mentor SimExpress (1996) $\to $ IKOS (2002) $\to $ Veloce (2007) $\to $ Siemens (2017) $\to $ Veloce CS (2024). Commercial-FPGA: EVE (2000) $\to $ ZeBu Server (2009, debuted at DAC) $\to $ Synopsys (2012) $\to $ Server 5 (2021) $\to $ ZeBu-200 (2025).

What changed in 2024–2026. The contemporary generation converged on shared FPGA silicon — the AMD Versal Premium VP1902 adaptive SoC for the FPGA-based platforms — paired with custom chips for the processor-array emulators, and capacity jumped to tens of billions of gates: Cadence Palladium Z3 (a new custom emulation processor) paired with Protium X3 (the matching VP1902-based FPGA platform), both April 2024; Siemens Veloce CS, built on a purpose-built “Crystal” emulation chip (Veloce Strato CS quoted at up to 5$\times $ the prior Strato at full visibility, scaling from 40 MG to 40+ BG; 2024); Synopsys ZeBu-200 (up to 15.4 B gates) with ZeBu Server 5 scaling beyond 60 B gates for multi-die verification (February 2025). The framing is uniformly AI-chip, multi-die / chiplet, and “digital-twin” software bring-up; analyst sizing puts hardware-assisted verification at roughly $0.76 B in 2025, growing past $3 B by 2035, with emulation the majority slice. These are vendor and analyst figures at announcement — directional, not independently measured — but the direction is the chapter’s: the capacity and software-bring-up pressure that make the automatic-conversion property valuable are accelerating, not plateauing.

The cross-vendor shape (analyst view). At a comparable generation (the Palladium-XP2 / Veloce 2 / ZeBu 3 era analyzed publicly by Rizzatti) the processor array wins compile speed, visibility-at-speed, and determinism; the commercial-FPGA platform wins raw clock — the highest of the three — and power and running cost, but pays in place-and-route compile time; the custom-FPGA platform sits between, with the strongest transaction-based verification. Each generation narrows the gaps. The point for this chapter is invariant across the generations: whichever family a program buys, the actor testbench converts onto it automatically, because the conversion targets the architecture-neutral property — synthesizable, latency-insensitive RTL — that all three families share.

7.7 Hybrid emulation: QEMU plus RTL

The dominant 2024–2026 pattern is hybrid emulation: a fast functional CPU model in software, cycle-accurate RTL on the emulator, glued together by TLM-2.0 transactors over SCE-MI. The CPU model is most often QEMU (open-source, very fast, AMBA-AXI bridged via libsystemctlm-soc); the RTL on the emulator is whatever IP is new and being verified — a new GPU, NPU, memory controller, IO complex.

Why hybrid dominates.

• CPU complexes are stable, peripheral IP is novel. An ARM Cortex you have shipped for five years adds no value when re-emulated; emulator capacity is too expensive to burn on commodity IP.
• Software boot times collapse. A QEMU-driven Android boot in roughly two hours is the headline talking point Cadence, Synopsys, and Siemens all use; an emulator-only model with the same RTL takes days.
• TLM-2.0 is genuinely portable. AMD’s libsystemctlm-soc is open source. The SCE-MI 2 standard from Accellera underpins all three vendors’ transactor implementations. The same testbench moves between simulator, emulator, and prototyper.

What hybrid costs. Boundary crossings are the slowest part of any cycle. A round trip from the host to the emulator and back is hundreds of nanoseconds at best over PCIe and microseconds typical over USB or socket transports (Appendix G names this latency). A “chatty” transactor design that crosses the boundary on every clock cycle therefore caps the effective system speed at single-digit-MHz or below, regardless of how fast the emulator clocks RTL natively. Buffer at the transaction layer, not the pin layer: an AXI burst as a single typed message crosses cheaply; per-pin signaling for the same burst crosses dozens of times and collapses throughput by an order of magnitude. The QEMU–RTL synchronization model matters too: loosely-timed mode is fast but ignores cycle-level interactions; approximately-timed mode is slower but preserves them. Bugs that depend on cycle-accurate CPU behavior are by construction invisible in this mode. Two specific things published hybrid flows disable to keep the boundary tractable: cache coherence between the virtual CPU and the emulator-resident memory subsystem (the QEMU/Fast-Models CPU model cannot participate in the RTL coherence protocol cycle-accurately, so the hybrid configuration disables it and the workload must tolerate stale-cache behavior), and multi-core SMP across the virtual CPU side (ARM Fast Models in hybrid configurations typically run a single virtual core; SMP-dependent bugs do not surface). Teams who hybridize for speed schedule periodic “all-RTL” emulation runs for the cases the hybrid misses — specifically the coherence and SMP cases that the hybrid mode disables.

The actor reading. A QEMU-side actor and an emulator-side actor connected by ‘WIRE through whatever transport the SCE-MI implementation uses is precisely the proxy-actor pattern from Appendix G. The TLM transactions become typed messages; the same message types that the architecture model emitted (Appendix D, Steps 1–2) drive the QEMU side; the same scoreboard subscribes on the host. Hybrid emulation does not break the EV substrate — it instantiates it across a faster CPU front-end. The chapter returns to this point in §7.17.

7.8 The compile-and-run economics

Putting numbers to the tradeoffs (Table 7.2) makes the engineering reality concrete and explains why no platform dominates.

Table 7.2: Compile and run economics across the four tiers (typical, modern flagship SoC).

	Sim	Emulator	Prototype	Silicon
	(VCS / Verilator)	(Pall., ZeBu, Veloce, Protium)	(HAPS, S2C)	(post-tape-out)
Cold compile	30 s – 10 min	2–10 hrs	1–7 days	months
Incremental compile	seconds	20 min – 2 hrs	several hours	—
Run speed	1 kHz – 1 MHz	1–10 MHz	tens of MHz	GHz
Probe budget	100% nets	100% / probed (FPGA)	$\sim $10–20% nets	DFT only
Capacity / box	RAM-bound	billions of gates	billions of gates	1$\times $ chip
Capital	workstation	multi-million class	sub-million class	mask + wafer
Cost / engineer-hour	low	high	moderate	none
1 sec real workload	hours – never	100 – 1000 sec	3 – 100 sec	1 sec
Linux boot	impractical	under 1 hr	minutes	seconds
Android boot	impractical	a few hours	under 1 hr	seconds

Reading the table. Simulation wins below one second of real-time workload — the compile cost dominates and absolute speed does not matter. Emulators win from there out to multi-hour workloads — the only platform that can debug and re-run a workload at high gate count. Prototypes win above multi-hour workloads at stable RTL — software-team throughput at the cost of debug. Silicon wins above multi-day. The four tiers are not substitutes; they are points on the workload-duration curve, and a complete program uses all four.

The honest caveat. The vendors’ headline numbers — “tens of MHz emulation,” “billions of gates,” “$N\times $ faster than the previous generation,” “$\geq $95% power accuracy,” “Android boot in 2 hours” — are best-case demonstrated, not typical, and several depend on assumptions (small design size, isolated CPU subset, an emulator-mapped netlist that is not the tape-out netlist) that do not hold for a real flagship SoC. The numbers in the table are calibrated to what a real flagship SoC (10+ B gates, complex CPU / NPU / memory / IO mix) actually achieves with a competent team. Where the marketing diverges from the operational reality, the operational reality is what to plan against.

7.9 Cloud emulation and the shift from owning to renting

The emulator is a multi-million-dollar box that runs hot for the two months around RTL freeze and tape-out gates, and largely idle the rest of the year. Even the largest programs struggle to fill their fleets. Tier-2 semiconductor companies have historically been priced out of owning emulators at all.

Both major vendors now offer cloud emulation, selling time on hosted emulator pools rather than boxes. The economics are simple: a team that needs a few hundred hours of emulation per quarter pays a fraction of what a private box costs to keep year-round and gets the same RTL turnaround.

The data-egress and IP-isolation problem. RTL is a company’s most sensitive asset. Cloud offerings all sell themselves on dedicated single-tenant hardware in isolated network enclaves, with the IP staying in the company’s own cloud account. The 2024–2026 market response has been broadly favorable, particularly among AI-accelerator startups and Tier-2 mobile/networking SoC vendors — the companies for whom owning a box was never viable. The broader Cloud EDA market is sized at roughly $4 B in 2025 and projected to roughly $7 B by 2034 (CAGR $\sim $6.4%); emulation is a meaningful and faster-growing slice.

The actor reading. The proxy-actor pattern from Appendix G is transport-agnostic by construction. “Whatever the FPGA platform supports” — the appendix’s phrasing — generalizes to ZMQ over wide-area network, NATS, libfabric over RDMA, or any other carrier the actor_distributed_pkg layer can speak. Cloud emulation is, from the actor framework’s point of view, one more transport. The verification environment does not change; the bridge’s serialization layer talks to a remote emulator pool instead of a PCIe DMA channel. The substrate ignorance the framework relies on (it does not care whether the actor on the other end of a ‘WIRE is in the same process, on another machine, on FPGA hardware, or in another data center) makes cloud emulation methodology-free for an actor-based environment.

7.10 In-circuit emulation: when the emulator drives real silicon

ICE mode is the use case that gives emulation its industrial primacy. The emulator runs the new chip’s RTL at 1–10 MHz; the rest of the system is real silicon (a PCIe switch, a USB hub, an Ethernet PHY, a DDR DIMM, a power-management IC). SpeedBridge (Cadence) and equivalent rate-adapters from Synopsys and Siemens mediate between the emulator’s MHz-class clock and the real silicon’s GHz interface. A SpeedBridge buffers traffic, manages flow control, and presents the emulator to the outside world as if it were running at full speed.

Why ICE matters.

• Protocol compliance. Pre-silicon compliance testing against Rohde & Schwarz, Keysight, or Anritsu test gear — the only way to know whether a USB or Ethernet PHY will work with real test equipment before tape-out.
• Driver bring-up. Live operation against real interconnect silicon (a real PCIe root complex, a real Wi-Fi 6E radio) catches bugs at the analog boundary that no transactor reproduces.
• System integration. The new SoC plus the existing board ecosystem — chipset, PMIC, regulators, sensors — can be exercised end-to-end before silicon arrives. Networking-silicon programs have used pre-silicon ICE for years for NIC bring-up.

What ICE costs. The emulator no longer chooses its own pace. Real-time hard deadlines apply: a Wi-Fi PHY does not wait for the emulator to catch up. Flow control on most external links does not extend to the emulator. Debugging is by construction not omniscient — the bug may live in the interaction between emulated logic and real silicon, and the real silicon does not give its internal state to the debugger.

The cloud-driven decline. ICE’s share of total emulation use is declining in 2024–2026, and the driver is cloud emulation. Cloud-hosted emulator pools cannot host physical SpeedBridges to real PHYs — the cable has to plug in somewhere, and the cloud data center is somewhere else. Programs that target cloud-elastic CI workflows therefore migrate toward Virtual / hybrid emulation (transactor-driven, software stimulus), which is portable to any emulator pool the vendor exposes. ICE remains the right answer for protocol-compliance testing against real test equipment, for live-PHY analog-boundary validation, and for chip-plus-board integration — but those are increasingly specialized lab-only activities, not the daily CI flow. The chapter’s broader claim about substrate-agnostic methodology survives the shift: the proxy actor’s transport over a Virtual / hybrid setup is structurally the same as its transport over an ICE setup; only the bytes on the wire differ.

The actor reading. The proxy-actor pattern absorbs ICE without modification. The proxy actor’s transport happens to be a real protocol PHY rather than a host-side socket. The scoreboard still receives the same typed messages from the on-FPGA monitor actor; the bug-rate dashboard still subscribes to the same coverage events. The bridge’s serializer writes Ethernet frames instead of MessagePack records; the deserializer translates the inbound frames back into typed messages. The methodology is preserved; only the transport adapter changes. This is exactly the same architectural ignorance that makes cloud emulation cheap — the framework does not know what is on the other end of the wire.

7.11 Power emulation: workload-driven analysis

Power is the verification dimension that emulation uniquely enables. The argument is structural: average power numbers from gate-level simulation are computed over microseconds of stimulus, but the worst-case power events that determine package thermal limits, regulator design, and battery life happen on the time-scale of seconds-to-minutes of real workload. A 90-second camera launch contains roughly $10^{11}$ cycles — seven-to-eight orders of magnitude beyond what gate-level simulation can capture. Emulator-driven power analysis is the only pre-silicon technology that closes this gap.

The flow. The mechanism is consistent across the three vendors’ offerings (Cadence Dynamic Power Analysis, Synopsys ZeBu Empower, Siemens Veloce Power):

The methodology is fundamentally two-pass: a cheap coarse pass identifies which windows are worth analyzing, and an expensive fine pass extracts cycle-accurate detail only for those windows. Dumping cycle-accurate SAIF data for the full Android boot is unviable — the volume is in the tens of terabytes and the emulator’s bandwidth saturates at the disk-write level. The two-pass structure is what makes the flow operationally tractable.

1. Emulate the workload at full speed. The DUT runs at 1–5 MHz; the workload (Android boot, camera launch, AI inference run, GPU benchmark) runs end-to-end in hours.
2. Pass 1 — coarse activity profile. The first emulator pass captures only register-level switching activity at coarse time intervals (say, every microsecond). The output is a lightweight “activity trend” that summarizes per-region activity over the whole workload. This pass costs little — the stream is small enough to write to disk during execution — and produces a visual time-series that a verifier or an automated threshold detector can scan in seconds.
3. Identify high-power windows. Inspect the activity trend for activity spikes that exceed the chip’s average by some threshold (typically 2–3$\times $). These spikes are the windows that determine the worst-case power, regulator sizing, and battery-life calculations.
4. Pass 2 — targeted SAIF dump. The emulator re-runs the workload with full cycle-accurate SAIF dumping enabled only during the windows identified in Pass 1. SAIF volume drops from tens of terabytes to gigabytes; the emulator’s bandwidth holds; the verifier gets full-fidelity activity for the windows that actually matter.
5. Feed Pass-2 SAIF forward to signoff power tools. Only the flagged windows are re-run through PrimePower or Joules at gate-level accuracy. The vendors quote accuracy within a few percent of post-layout signoff ($\geq $95%); the operational reality is more modest, for the reasons set out below.

What this catches. Three classes of power bug that pre-silicon teams find only with emulator-driven analysis:

• Workload-dependent worst-case windows. Synthetic stimulus rarely exercises the simultaneous-switching activity that real software produces. A camera ISP pipeline plus a GPU compositor plus a CPU running ML inference plus a USB host concurrently transferring is a regime the architect did not size for.
• Clock-gating-failure regressions. A new IP block whose enable signal has the wrong sensitivity defaults to ungated. Under real workload, this might dissipate watts of unnecessary power. Average-power analysis misses it; workload-driven analysis finds it the first time the workload runs.
• Coherence-protocol churn. Real software stresses cache coherence in ways synthetic stimulus does not. The resulting bus-cycle waste shows up as power overhead concentrated in specific software phases (kernel scheduler entry, syscall return). The fix is microarchitectural; without the workload-driven measurement, the team would not know to look.

The accuracy boundary — be honest about it. Vendor marketing tends to quote within-a-few-percent-of-signoff accuracy. The numbers a verification engineer should plan around are more modest, for a structural reason: the netlist the emulator runs is not the netlist that goes to tape-out. The emulator-mapped netlist comes from a synthesis flow optimized for emulator compilability, with placeholder clock structures, no clock-tree synthesis, no real foundry standard cells, and no back-annotated parasitics. The tape-out netlist has all of these and they materially change the power picture. The emulator’s power output is therefore a relative signal — a reliable way to compare workload window A against window B on the same emulator-mapped design — not an absolute signal that can substitute for signoff PrimePower / Joules on the post-layout netlist. The pre-silicon value is in the relative comparison, which is enough to find the bad workload windows. The absolute accuracy is whatever the synthesis flow and the cell-library extraction support, and it is rarely as good as the marketing suggests.

The actor reading. The emulator emits per-cycle activity events for every monitored region. An on-emulator ActivityAggregatorActor reduces the cycle-level activity to per-window summaries; the summary stream exits the emulator at much lower bandwidth than per-cycle traffic. A host-side PowerEstimateActor subscribes to the windowed activity messages, applies the cell-library characterization, emits per-window power-mW messages. A regression dashboard subscribed to those messages produces the workload power profile. The same dashboard that aggregated functional coverage now aggregates power coverage; both ride the same actor bus; both stream from the emulator through the proxy-actor bridge. No separate tooling.

What it costs to set up. The infrastructure is not free. The activity-monitor instrumentation has to be added to the RTL (the synthesis-time hook that records per-region toggles); the cell-library characterization has to be done up front; the high-power-window threshold has to be tuned per design. None of this is hard; all of it is real engineering effort. A first power-emulation flow on a new architecture takes weeks; subsequent flows on the same architecture inherit the infrastructure and complete in hours per workload.

7.12 Gate-level emulation: post-synthesis netlists at emulator speed

The emulator’s substrate ignorance has a consequence the chapter has not yet exercised: the DUT does not have to be RTL. A post-synthesis gate-level netlist runs on the same emulator at the same MHz-class clock rate, and the methodology around it does not change. Gate-level emulation (GLE) is a major fraction of every Tier-1 program’s emulator hours and a load-bearing capability for the workloads it enables.

Why post-synthesis matters. Three classes of bug that the RTL emulation flow cannot catch but the gate-level flow can:

• X-propagation through real cells. RTL simulation typically treats Xs aggressively (propagating through every gate) or optimistically (stopping at gates with don’t-cares). The synthesized netlist’s real cells resolve the propagation deterministically; some bugs are visible only in that resolution.
• Timing-correlated power. Power-emulation accuracy improves materially when the netlist comes from the same synthesis flow that produced the tape-out (with real cell-library characterization, not behavioral placeholders). The post-layout netlist is still not available at this stage; the post-synthesis netlist is one step closer than the RTL.
• SCANDUMP rehearsal at the gate level. The JTAG scan-chain debug flow that will be used post-silicon walks the actual scan-stitched netlist. Running the gate-level scan stitcher on the emulator validates the post-silicon debug path well before silicon arrives. A scan chain that fails on the emulator fails in the lab; one that works on the emulator is logically correct, leaving only physical failure modes (shift timing, IR drop, clock skew) for silicon.

Speed relative to gate-level simulation. The vendor papers consistently report 70–90$\times $ speedups over gate-level software simulation — a regression that takes a week of GLS finishes in a couple of hours on the emulator. As with all vendor speedup claims the absolute number depends on workload, gate count, and probe instrumentation; the trend is reliable, the headline figure is best-case. Plan around a real-world band of one-to-two orders of magnitude depending on how aggressive the probe configuration is.

The actor reading. The actor substrate (§7.17) does not know whether the synthesized block on the emulator was a behavioral actor, an RTL block, or a gate-level netlist. The same scoreboard subscribes to the same typed messages; the same coverage actor records the same bins; the same RAL drives the same register sequences. The substrate ignorance the chapter has been arguing for cuts at this boundary too: gate-level is one more substrate the actor framework treats identically.

7.13 AI accelerators: the dominant 2024–2026 use case

The bulk of emulator capital purchases in 2024–2026 is being driven by AI accelerator development. The reason is structural: AI accelerators are large (10–100 B gates per die), dataflow-heavy (every cycle is exercised by a real workload), and software-dependent (the compiler, the runtime, the framework are as much the product as the silicon). The class of bug that AI workloads expose is not findable in simulation.

Why AI accelerators are different. Three properties that distinguish AI silicon verification from general-purpose SoC verification:

• Large blocks, regular structure. A modern NPU is dominated by a systolic array of multiply-accumulate units — thousands of regular MAC tiles — plus a smaller control plane. The block-level verification is straightforward; the integration challenge is the software stack that drives them.
• Compiler-as-verification. The accelerator’s software stack (PyTorch / TensorFlow / JAX compiler $\to $ accelerator IR $\to $ accelerator machine code) is as much the verification surface as the RTL. A correct accelerator with a buggy compiler is a defective product. The compiler is exercised by running real models; only emulation provides the throughput to run real models at all.
• Power and performance are first-order. A general-purpose SoC ships with a functional verification gate. An AI accelerator ships with functional, performance, and power gates. The performance number is the headline (FLOPS, TOPS, tokens-per-second); the power number is the headline (TOPS-per-watt). Both require workload-scale measurement; both require emulation.

The workloads. The AI workloads that drive emulator capacity in 2024–2026:

• Large language model inference. A 70B-parameter LLM generating one token end-to-end exercises the memory hierarchy, the matrix-multiply units, the attention compute, and the output sampling. Per-token latency is the headline; emulator-driven measurement is the only pre-silicon technology that produces it.
• Large language model training. A training step exercises the same paths plus the gradient accumulation, the optimizer update, and the all-reduce communication if the accelerator supports it. Multi-day training runs at silicon speeds become multi-week training runs at emulation speeds — still impossibly slow for a full epoch, but fast enough to catch the first dozen steps of bugs.
• Computer-vision inference. A camera-ISP-to-neural-network pipeline on a mobile or automotive SoC exercises the full path from sensor to inference output. Pre-silicon emulation catches the synchronisation bugs between the ISP and the NPU that simulation misses.
• Stable Diffusion / generative inference. A diffusion pass for image generation exercises the U-Net repeatedly with different timestep conditioning. The cyclic structure exposes memory-allocator bugs the AI compiler is supposed to handle but sometimes does not.

Hybrid emulation as the AI norm. The published AI-accelerator pre-silicon flows from 2024–2026 converge on hybrid emulation as the default. The CPU front-end (the host running PyTorch or TensorFlow) runs in QEMU or a fast SystemC model; the new accelerator RTL is on the emulator; the framework-to-accelerator transfer goes through TLM-2.0 transactors. The boundary between CPU and accelerator is where the verification battle lives — queue depths, DMA transfer patterns, completion-interrupt latency. Hybrid emulation places this boundary where it can be exercised by real software.

The actor reading. The AI accelerator dataflow maps cleanly onto the actor model. The tensor-flow IR the compiler emits is a graph of operators; each operator becomes an actor; the edges between operators are typed-message channels. The host-side verification environment subscribes to per-operator typed messages (“GEMM 4096x4096 completed in 870 $\mu $s, 124 W average”); the scoreboard checks against the compiler’s expected schedule; the coverage actor records the bin (“transformer-block N executed with all 16 attention heads in flight”). The substrate is the same actor framework Chapter 6 built; the only difference is that the workload happens to be an AI inference instead of a Linux boot.

7.14 Functional safety: fault injection at emulator speed

A class of pre-silicon verification the chapter has not yet named explicitly is functional safety — the discipline of proving that a chip’s safety mechanisms detect a defined fraction of the faults that real silicon will experience. The ISO 26262 automotive standard codifies this with Automotive Safety Integrity Level (ASIL) targets: an ASIL-B IP must demonstrate a single-point-fault metric (SPFM) of $\geq 90\%$; ASIL-C requires $\geq 97\%$; ASIL-D requires $\geq 99\%$. The metric is calculated from a fault campaign that injects a specified distribution of stuck-at and transient faults into the design and records whether each is detected by a safety mechanism.

Why emulation is the right substrate. A typical ASIL-B fault campaign for an automotive NPU injects $10^5$ to $10^6$ faults across the design. Each fault must run a complete safety-mechanism test, frequently a firmware self-test routine of $10^7$–$10^8$ target cycles. In software simulation at tens of kHz that is minutes-to-an-hour of wall-clock per fault — across $10^5$–$10^6$ faults, the campaign never finishes. On an emulator at 1–5 MHz each fault takes seconds, and with fault sampling and concurrent-fault injection the campaign compresses into days. Without emulation, automotive safety certification is structurally infeasible at modern gate counts.

The flow. A representative automotive-NPU fault-campaign case study (DVCon 2024) walks the canonical structure:

1. Generate the fault list. A vendor tool (Siemens Functional Safety Database, Synopsys Z01X, Cadence Midas) enumerates the candidate fault sites and applies pruning rules to skip logically-equivalent or unreachable faults.
2. Inject and run. Each fault is injected (a stuck-at value forced on the target net), the firmware safety test runs, and the safety mechanism’s output is observed: did it raise an alert at the expected severity within the expected window?
3. Classify. Faults are classified as detected (alert raised), not detected / safe (no harmful effect), not detected / dangerous (silent failure), or multi-point (requires a second concurrent fault to manifest). The SPFM is the failure-rate-weighted fraction of faults that are neither single-point nor residual — detected by a safety mechanism, intrinsically safe, or multi-point.
4. Iterate the safety mechanism. If the SPFM falls below the ASIL target, the safety mechanism is strengthened (more coverage, faster detection, better isolation) and the campaign re-runs.

The actor reading. A FaultInjectionActor on the host publishes typed FaultInject_s messages naming the target net, the fault model (stuck-at-0, stuck-at-1, transient), and the test scenario; an on-emulator harness applies each fault. A DiagnosticCoverageActor subscribes to the same alert-handler messages the chip-level scoreboard already subscribes to (the Earl Grey example of §7.17 demonstrates the alert-handler topology); it classifies each fault outcome and emits a FaultOutcome_s message. A host-side dashboard aggregates the campaign’s SPFM in real time. The fault campaign is not separate infrastructure; it is one more set of subscribers on the same actor bus the functional verification flow already uses.

7.15 When the emulator finds a bug: the debug loop

The chapter has so far argued that emulation is operationally cheap on the actor substrate. The remaining question is: what does it actually look like when something fails? A Linux boot that hangs at minute 47, an Android boot that panics at hour 3, an AI inference that produces wrong outputs at iteration 1000 — these are the bugs emulation finds, and the debug workflow for them is qualitatively different from the simulation debug workflow.

The bug surfaces as a typed message. The first signal is not a waveform; it is a message on the actor bus. A scoreboard actor detects a divergence between expected and actual; an alert actor fires; a coverage actor records a bin that should not have been reachable. The dashboard flags the run. The verifier opens the dashboard; the bug has a typed-message context already attached — which actor published the failing message, which test, which workload phase.

The investigation pivots to the trace. The next step is to look at the activity around the failure. The custom-silicon emulators (Palladium, Veloce) give 100% net visibility at every cycle; the verifier requests a waveform window around the failure timestamp. The FPGA-based emulators (ZeBu) give bandwidth-limited visibility; the verifier examines what the probe network captured and, if necessary, requests a recompile that routes the suspect signals through the probe network for the next run. This is where the visibility tradeoff materializes. A bug on a Palladium gives waveform context immediately; a bug on a ZeBu may require a recompile cycle to expose the right signals.

The reproduce loop. The emulator’s checkpoint-and-replay (Palladium InfiniTrace, equivalent on Veloce and ZeBu) lets the verifier rewind to a known-good checkpoint and replay forward, with extra signals exposed. The pattern:

1. Take a checkpoint every $N$ million cycles during the workload.
2. When a bug surfaces, identify the latest pre-bug checkpoint.
3. Recompile with the suspect signals exposed.
4. Replay from the checkpoint with the new probe configuration.
5. The replay reproduces the bug with the visibility the original run lacked.

The total turn-around is hours — the recompile dominates, but the checkpoint replay itself is fast. On a Palladium with 100% visibility, the recompile step may be skipped entirely.

SCANDUMP: the JTAG-driven full-state escape hatch. A specific debug primitive worth naming explicitly is SCANDUMP — a JTAG-driven full-state dump that reads the entire scan-chain state of the design through the JTAG interface. Originally a post-silicon debug technique for hung chips on the bench, emulators implement an equivalent capability: when the emulator-resident DUT hangs (a deadlock, a wedged FSM, a stalled bus), the emulator’s controller can scan out every flip-flop’s current value through the JTAG model in minutes rather than days of waveform-based forensics. Two operational uses make SCANDUMP load-bearing: (a) pre-silicon hang debug on the emulator — when the Linux boot wedges at minute 47 and the verifier needs to know the system’s exact architectural state at the hang, SCANDUMP delivers it in a single minutes-scale operation rather than reconstructing it from a partial waveform; (b) post-silicon rehearsal — the SCANDUMP-based debug flow developed in emulation is the same flow that will be used post-silicon when the real chip hangs in the lab. The team that practices SCANDUMP during emulation has the same debug instrumentation already validated when first silicon arrives. Vendor implementations vary in name and detail (Cadence’s full-vision scan-dump, Synopsys ZeBu’s equivalent), but the JTAG-driven full-scan-readout primitive is consistent across the families.

The fix and the regression. Once the bug is understood, the fix is in the RTL or the firmware. The fix gets committed; the next emulator nightly compiles the fix; the workload is re-run; the bug-regression passes. The same scoreboard that caught the original bug catches the regression if the fix is incomplete. The actor framework’s typed-message bus is what makes this loop closed without separate tooling.

The cross-leg learning. If the bug was triggered by a specific software sequence, the verifier can extract the sequence and create a simulation regression test from it. The actor framework’s RAL records register-write sequences as typed messages; the failing sequence becomes a simulation test that pre-emptively catches the same bug in the future without requiring the multi-hour workload. The emulation bug becomes a simulation bin — a permanent capture of the bug class.

What this is missing. Two things the workflow does not give the verifier:

• Cycle-by-cycle CPU behavior in hybrid mode. If the hybrid setup runs the CPU in QEMU rather than on the emulator, the CPU cycle activity is invisible. Bugs that depend on cycle-accurate CPU behavior are invisible too. The mitigation is to schedule periodic all-RTL emulation runs, but the cost is high (slower workloads, more emulator time).
• Power waveforms below the activity-monitor resolution. Power emulation works at window granularity (microseconds to milliseconds); bugs below that resolution (sub-cycle clock-gating glitches, IR drop transients) are not visible. Post-silicon power probes catch them; pre-silicon power emulation does not.

These limitations are not catastrophic — they account for a small fraction of pre-silicon bugs — but they are real, and the engineering team needs to know they exist. The emulation flow is not a complete verification methodology; it is the third leg of the EV pipeline, and the other two legs catch what it cannot.

7.16 Amdahl’s law on the emulator: the unnamed tax

Before walking through how the actor framework rides on an emulator, it is worth naming the single largest structural reason every emulator-using team has felt itself underutilizing expensive capital. It is Amdahl’s law, applied to a heterogeneous machine, and it is striking how thoroughly the emulation industry has discussed every symptom of it — transactor bandwidth, PCIe round-trip latency, SCE-MI marshalling overhead, accelerator-mode throughput — without ever naming the underlying equation.

The heterogeneous machine. An emulator deployed in any mode other than fully self-contained is two machines stitched together. The emulator part runs at $R_{\text {emul}}$, somewhere in the 1–10 MHz range on a custom-processor array or comparable on an FPGA-based platform. The host part runs the testbench and (in hybrid emulation) a fast CPU model; its effective rate $R_{\text {host}}$, once a workload-cycle has to make a round trip to the host through a transactor or SCE-MI bridge, is governed by the PCIe round-trip latency. At hundreds of nanoseconds per round trip and the UVM sequence-item generation/checking step on top, $R_{\text {host}}$ lands at single-digit kHz to low tens of kHz for full transaction-by-transaction flows. The two machines differ by two to three orders of magnitude in clock rate.

The equation. Let $f$ be the fraction of workload cycles that have to be serviced by the host (transactor driven, host-resident scoreboard checked, RAL brokered), and $1-f$ the fraction that can be serviced inside the emulator without crossing. Amdahl’s law gives the system throughput exactly:

\[ R_{\text {system}} \;=\; \frac {1}{\dfrac {f}{R_{\text {host}}} + \dfrac {1-f}{R_{\text {emul}}}}. \]

The asymptote as $f \to 0$ is $R_{\text {emul}}$ (the emulator runs at full clock). The asymptote as $f \to 1$ is $R_{\text {host}}$ (the system collapses to the host rate, and the emulator capital sits idle most cycles). The function does not collapse linearly between those endpoints: the slow part dominates the sum far faster than intuition suggests, because the ratio $R_{\text {emul}}/R_{\text {host}}$ enters multiplicatively.

The cliff, with numbers. Plugging $R_{\text {emul}} = 5\,\text {MHz}$ and $R_{\text {host}} = 10\,\text {kHz}$ (representative for a UVM-style transactor flow) makes the cliff concrete.

Host-bound fraction $f$	System rate $R_{\text {system}}$	Loss vs. $R_{\text {emul}}$
$0.001$ (0.1 %)	$\sim 3.3\,\text {MHz}$	1.5$\times $
$0.01$ (1 %)	$\sim 835\,\text {kHz}$	6$\times $
$0.1$ (10 %)	$\sim 98\,\text {kHz}$	50$\times $
$0.5$ (50 %)	$\sim 20\,\text {kHz}$	250$\times $
$1.0$ (100 %, UVM-style)	$10\,\text {kHz}$	500$\times $

A UVM-style flow that brokers every transaction through the host sits at $f \approx 1$ and runs the system at the host rate, regardless of how fast the emulator’s processors execute underneath. This is the throughput the industry forgoes when the boundary traffic is synchronous — not a vendor failing, but the equation. Buying a faster emulator does not change the system rate when $f \to 1$; it only changes how much capital is idle.

The surprise is not the formula — every computer scientist learns Amdahl’s law as an undergraduate — but that the emulation literature, marketing material, and field comparisons never write it down. Vendors quote $R_{\text {emul}}$ (“5 MHz emulator clock”); the teams running them measure $R_{\text {system}}$ (“50 kHz effective throughput,” a partly host-bound operating point); the gap between the two is treated as an operational nuisance to be optimized incrementally rather than as the predictable output of a one-line equation. Naming the equation is the precondition for fixing it.

What $f$ actually counts: synchronous round-trips, not one-way streams. The equation as written can mislead if read too broadly. $f$ counts only the fraction of workload cycles that block on a host round trip, not every cycle that involves the boundary. One-way streams across the boundary — emulator $\to $ host or host $\to $ emulator — impose no Amdahl penalty, because they do not gate the emulator’s clock on a return value. Examples that are one-way and therefore cheap:

• Waveform dumping. Probed nets stream from the emulator into a host-side FSDB writer. The emulator never waits on the writer; the writer drains the stream at whatever rate the PCIe and disk subsystem allow. This is why dumping in emulation is “almost free” — not because the bandwidth is unlimited (it is not, see §7.6), but because the stream is one-way. A bandwidth-saturated dump throttles throughput at the disk-write level; it does not collapse $R_{\text {system}}$ to $R_{\text {host}}$.
• Structured logging and telemetry. An emulator-side observer publishes typed log messages to a host consumer; the emulator never blocks on the consumer’s acknowledgement.
• Fire-and-forget DPI calls. A DPI-SV function called from the emulator that takes arguments and returns void — a coverage update, a printf-style log line, a one-way trace export — pays only the call overhead. DPI itself is not the cost; the cost is synchronous DPI inside a handshake that blocks on the return value.
• Host-injected stimulus that does not require synchronization. The host streams pre-computed inputs into the emulator (e.g., a recorded packet trace into an Ethernet IP’s RX FIFO) without waiting for cycle-level responses.

The Amdahl-collapsing traffic is specifically the two-way synchronous kind, where the emulator’s clock cannot advance until the host has responded:

• Ready/valid handshakes through the boundary. A UVM driver on the host raises valid on a virtual interface; the RTL on the emulator asserts ready; the driver waits for the handshake closure before it can issue the next beat. Each beat is one host round trip. A burst of $N$ beats costs $N$ round trips.
• Per-transaction scoreboard checking through transactors. The monitor publishes a transaction back to the host scoreboard; the scoreboard’s reaction (or even its existence as a blocking consumer) gates the next transaction.
• Cycle-accurate CSR access through host-resident RAL. A read returns a value the next cycle of host code depends on; the read therefore blocks.
• Hybrid SCE-MI calls in the synchronous mode. A request-reply pattern with a single outstanding transaction is exactly the pattern Amdahl punishes.

The practical implication: the same boundary, the same transport, the same PCIe link, can be either cheap (one-way) or order-of-magnitude expensive (synchronous handshake) depending on what flows across it. The emulator’s full-visibility waveform dump (megabytes per second of one-way stream) does not slow the emulator. A handful of UVM transactions per emulator cycle, each requiring host acknowledgement, will. The shape of the boundary traffic, not its volume, determines whether the emulator’s clock survives the crossing.

The fix is to drive $f$ toward zero. The function is steep on the side of small $f$. At $f = 0.001$ the system still delivers 3.3 MHz — two-thirds of the emulator’s rate. At $f = 0.01$ it delivers 835 kHz — recovering most of the order-of-magnitude gap. The methodology that wins on the emulator is whatever methodology makes $f$ small. The substrate swap the rest of this chapter describes — placing synthesizable actor versions of the scoreboard, coverage, RAL, stimulus on the emulator alongside the DUT — is exactly that: an architecture that pushes $f$ from “every UVM transaction” down to “only what genuinely originates outside the simulated system,” which empirically is one to two orders of magnitude smaller.

The next section walks the mechanism.

7.17 Automatic conversion: the actor testbench that needs no rewrite

The chapter has so far walked the emulation landscape as an engineer choosing a platform would review it. The remainder is the contribution. In EV, moving the verification environment from a software simulator to a hardware emulator — a commercial box or FireSim — is automatic: the same authored actor graph is re-rendered onto the new substrate by the framework and the synthesis compiler, and the engineer writes no transactor, no synthesizable bus-functional model, no second testbench, and no partition hint. This section establishes the claim in four moves: the manual tax the rest of the industry pays; why the actor graph is synthesizable in full, testbench included; what, precisely, the framework converts automatically and where the human step would otherwise be; and the demonstration that the identical conversion targets both a commercial emulator and FireSim.

The status quo: conversion is a manual rewrite

Naming the tax precisely is the precondition for claiming to remove it. Taking a verification environment from simulation to an emulator today is four distinct manual efforts, each documented in the published commercial flows.

1. The testbench cannot cross, so it is split and partly rewritten. UVM is non-synthesizable: new(), randomize(), run-time class allocation, virtual-method hot paths, and unbounded mailboxes have no gate-level form. The testbench therefore stays on the host CPU, and a hand-written synthesizable BFM is placed on the emulator to drive the design’s pins. Verification IP that has to run at speed is re-coded in a synthesizable language — the Palladium Ethernet case study (Vtool, CDNLive) reports that its UVM Ethernet VIP could not be reused and “was rewritten in C++” as an accelerated VIP. The rewrite is per-protocol and per-program.
2. The boundary is hand-plumbed. Host testbench and emulator-resident BFM communicate over SCE-MI (the Accellera Standard Co-Emulation Modeling Interface, v2.1, 2010 / v2.4, 2016) — a multi-channel message interface every vendor implements. Wiring a transactor to it — the message ports, the marshalling, the clock control — is hand engineering repeated per interface, and it must be done well: a “chatty” transactor that crosses the boundary per pin rather than per transaction collapses throughput by an order of magnitude (§7.16).
3. Coverage and checking are relocated by hand. A UVM scoreboard or coverage class lives in the host testbench; to avoid a host round trip on every transaction, the engineer hand-moves coverage into a module bound into the emulator image and arranges a single end-of-run transfer instead of a per-transaction one. The Mentor co-emulation methodology documents exactly this relocation.
4. The partition is hand-hinted. On an FPGA-based emulator the automatic partitioner’s cuts slice across high-fanout buses; the engineer adds manual partition hints to recover predictable compile, and the hints do not transfer cleanly between RTL revisions (§7.6).

None of the four is a small effort, and together they are why “move the testbench to the emulator” is a multi-week specialist task that the industry staffs as a separate methodology with a separate team and a separate artifact stack. All four are consequences of one root cause: the testbench is written in a non-synthesizable language with an implicit, unavoidable host/hardware split. An actor testbench has neither property — it is synthesizable in full, and its only host crossings are the ones the engineer deliberately placed at real seams. Remove the root cause and all four manual efforts disappear together. The rest of this section shows how.

The synthesizable form

Appendix E establishes the engineering basis. The actor framework has two faces:

• A class-based simulation framework (Chapter 6, actor_pkg/) with mailbox, fork/join, virtual dispatch, dynamic allocation, supervision — none of which synthesize.
• A synthesizable pattern (Appendix E) with five rules that restrict the class form into RTL-emissible shape: no dynamic allocation, no virtual dispatch in the hot path, bounded mailboxes only, fixed-cardinality fan-out, ready/valid handshake on every channel.

The same act() FSM body, the same typed channels, the same ‘WIRE topology — only the runtime mechanisms swap for elaboration-time ones. The mechanical mapping is one-to-one: mailbox.get() becomes input ready/valid, publish() becomes output ready/valid, ‘WIRE becomes a direct wire connection at the parent, multi-subscriber fan-out becomes broadcast wire with per-consumer ready, and trace IDs ride alongside data inside the message bundle.

The worked example in Appendix E demonstrates this at silicon: a 32-bit counter actor (a handful of lines of class form) becomes a 98-standard-cell counter_actor module (100 lines of SystemVerilog, of which $\sim $50 are executable module code), and two instances composed by counter_chain.sv place and route to 44 iCE40 logic cells (0.6% of a Lattice iCE40 HX8K) at 126.6 MHz with 26% margin over the 100 MHz target. Tool-chain runtime end-to-end is seconds; the synthesizable form is the structural shape a careful designer would write, and the actor methodology adds no overhead at the silicon level. This is the foundational technical fact this chapter rests on.

The whole graph synthesizes

The previous subsection rendered one actor to RTL, and nothing about it was special to the device under test. A scoreboard is an actor: internal state (its golden model, its match tables), a single typed-message handler, ‘WIRE edges in and out. So is a coverage collector, a stimulus generator, a register-abstraction layer. Each is a finite state machine, and the Five Rules of Appendix E turn each into the same synthesizable shape. The hardest of these to credit is the stimulus generator’s constraint solver — it looks like a runtime SAT search, the one part of a testbench that surely cannot be a wire; Appendix F shows it too compiles to a datapath, because randomize() needs only the model-finding half of a proof engine, and proves it on the full constraint set of an open-source RISC-V instruction generator. No actor in a verification environment is synthesizable “in principle but not in practice” — the discipline that lowers the DUT actor to gates lowers the scoreboard, the coverage actor, and the stimulus actor to gates just the same.

So the substrate swap is larger, and simpler, than moving the DUT. It is not “put the DUT on the emulator and keep the testbench on the host”; it is “re-render the entire authored actor graph — DUT and testbench alike — onto the emulator.” The author writes one graph; the framework renders every node of it for whatever substrate the run targets. In software simulation each node is a class object; on the emulator each node is RTL. The ‘WIRE edges that were mailbox hand-offs become ready/valid wires between adjacent blocks — wires inside the fabric, not host crossings. The verification environment runs at the emulator’s clock, beside the design it checks.

Runnable proof. appG_firesim_substrate_swap shows this end to end with no FPGA required. Four actors — a stimulus actor (a 16-bit LFSR), the accumulator DUT, a scoreboard actor (an independent golden model, an expected-value FIFO, a comparator), and a coverage actor (eight buckets) — are wired into one fabric and synthesized. Yosys lowers each to gates: the stimulus to 26 flip-flops, the DUT to 33, the scoreboard to 231 (it carries the golden model, the FIFO, and the counters), the coverage actor to 8 — on the order of 300 flip-flops for the whole verification loop. The example runs that loop two ways: the four actors as software objects, and the same four as the synthesized fabric under Verilator. Both report identical results — 256 transactions, zero mismatches, full coverage — and in the hardware run the host only resets the fabric, clocks it, and reads the final counters. The scoreboard and the coverage are on the fabric, not on the host. The ./firesim/ scaffold carries the identical fabric onto an FPGA through FireSim (§7.18).

Observation continuity. Because the verification actors are the same authored actors, re-rendered rather than rewritten, the methodology survives the swap intact:

1. The same scoreboard checks. The scoreboard actor that checked the behavioral DUT checks the synthesized DUT — the same actor, the same ‘WIRE edge, now a wire on the fabric instead of a mailbox in a process.
2. The same coverage actor records. Coverage bins are defined at the message level; a bin “alert severity 3 / type 7” fires identically in simulation and on the emulator because the coverage actor is the same FSM in both.
3. The same RAL drives. The register-abstraction actor unfolds high-level register commands into bus transactions on the fabric; only the human-issued commands and the summary replies are workload-level traffic.
4. The same regression infrastructure aggregates. The dashboard subscribes to test-pass and coverage-summary messages identically whether they came from an overnight Verilator run or an emulator workload run.

The framework knows only typed messages and ‘WIRE edges, and renders every node of the graph for the substrate at hand. That uniformity is the methodology’s strength: the swap that turns a 1 kHz Verilator run into a 5 MHz emulator run re-renders the verification environment onto the faster substrate rather than stranding it on the host.

What converts automatically (and where the human step would be)

The contribution is not that an actor testbench can be made to run on an emulator with effort; it is that the conversion has no manual step. Walk the six things that happen when a program retargets an actor graph from a simulator to an emulator, and against each, the manual effort it replaces (the four taxes of the status-quo subsection map onto these six mechanisms).

1. Each actor is rendered to RTL — by the compiler, not by hand. An actor is a finite state machine in the synthesizable form (Appendix E’s Five Rules); rendering it to RTL is a mechanical compiler pass, not a re-authoring. The strongest available existence proof is FireSim’s Golden Gate compiler: it takes arbitrary FIRRTL and emits a cycle-exact hardware model automatically, with a formal guarantee (the latency-insensitive bounded dataflow network’s partial-implementation property) that the emitted model matches the source cycle for cycle. The mechanical rendering Golden Gate performs on a processor is what the framework performs on a scoreboard. Replaces: hand-writing a synthesizable BFM and re-coding non-synthesizable VIP.
2. The ‘WIRE topology becomes wires — emitted, not netlisted by hand. The actor graph declares every edge once, with a known message type and therefore a known bit width and a known ready/valid discipline. Emitting the intra-fabric connections is a graph walk. Replaces: hand-wiring the testbench-to-DUT signal interface.
3. Bridges are generated only at genuine seams. The TransportBridgeActor abstraction (next subsection) already names where the graph touches software. The framework emits the seam adapter — a SCE-MI transactor on a commercial emulator, a FireSim bridge on FireSim — from that single declaration. Replaces: hand-plumbing SCE-MI message ports and marshalling, per interface.
4. The partition plan is the ‘WIRE graph. Because every actor edge is latency-insensitive (ready/valid), every edge is a legal fast-mode cut — FireAxe’s published result (ISCA 2024) that latency-insensitive boundaries partition across FPGAs at near-zero accuracy cost. The graph the engineer authored is the partition plan; the partitioner is handed it rather than inferring it. Replaces: hand-tuned partition hints.
5. Coverage ports through the intermediate representation, not by relocation. Coverage here is a property of the message (a typed bin), not of a host class; expressed at the IR level it produces identical coverage maps in a software simulator, on an emulator, and under formal, and the maps merge trivially across them (the simulator-independent-coverage result, ASPLOS 2023). Replaces: hand-moving a coverage class into a bound emulator module.
6. The workload is the same artifact. A version-controlled workload description (the FireMarshal pattern, ISPASS 2021) drives the functional simulation and the emulation run unchanged; the stimulus that exercised the actors in software exercises them on the fabric. Replaces: re-creating the stimulus environment for the emulator.

The sum is the claim. Every step that is a manual effort in the status quo is, for an actor graph, a compiler pass or a graph walk over a declaration the engineer already wrote. There is no remaining human step between “it runs in simulation” and “it runs on the emulator” beyond choosing the target and pressing build. That is what “no manual effort” means, and it is the property no non-synthesizable, host-resident testbench can have — the absence is structural, not a tooling gap a vendor could close while keeping UVM.

Table 7.3 sets the two columns side by side: on the left, the per-program manual effort the status-quo subsection enumerated; on the right, the mechanism that performs it without a human step, and the published result that grounds the mechanism.

Table 7.3: The conversion, step by step. Every manual port the status quo requires becomes an automatic mechanism over a declaration the engineer already wrote.
Manual step (status quo)	Automatic mechanism (actor graph)	Grounded in

Re-code non-synthesizable TB; hand-write a synthesizable BFM	every actor, the testbench included, is rendered to RTL by a compiler pass	Golden Gate / LI-BDN, ICCAD 2019; App. E (126.6 MHz iCE40)
Hand-wire the TB$\leftrightarrow $DUT signal interface	the `‘WIRE` topology is emitted as wires (a graph walk)	`actor_pkg` `‘WIRE`, Ch. 6
Hand-plumb a SCE-MI transactor per interface	the bridge is generated from the `TransportBridgeActor` declaration, at seams only	SCE-MI v2.4 (Accellera); FireSim bridge model
Hand-tune partition hints per RTL revision	the `‘WIRE` graph is the (fast-mode) partition plan	FireAxe, ISCA 2024
Relocate the coverage class into a bound emulator module	coverage lives in the IR; maps merge across sim / emulator / formal	simulator-independent coverage, ASPLOS 2023
Re-create the stimulus environment for the emulator	one version-controlled workload artifact drives every substrate	FireMarshal, ISPASS 2021
Prove it only on the scarce, sealed hardware	metasimulation: bit-/cycle-exact, laptop-provable, no FPGA	FireSim metasim; the substrate-swap example

Where a bridge is actually needed

If the whole graph is on the fabric, what is left for a bridge? Only a genuine software-to-hardware seam — an edge with software unavoidably on one side and hardware on the other. There are a few, and their number is set by the system’s real boundaries, not by the size of the design:

• The workload read-out. A person, or a CI dashboard, reads pass/fail and coverage summaries and issues high-level commands. That endpoint is software; the bridge carries this thin, low-rate traffic — the read-out the host performs in the substrate-swap example.
• External I/O. Stimulus that genuinely originates outside the simulated system, or output that genuinely leaves it, crosses at the system boundary.
• A link whose far side is software. A host CPU model in hybrid emulation, a remote emulator pool in cloud emulation, a real device in in-circuit emulation, or a host-resident switch connecting many emulated nodes. In each, one endpoint is software because the program chose to model it in software — not because verification is inherently host-resident.

At such a seam the bridge is an ordinary actor. It receives typed messages on one side, serializes them over a transport, and republishes inbound bytes as typed messages on the other — the same transport layer actor_distributed_pkg already provides for cross-process and cross-machine actors. The shape, in SystemVerilog (illustrative — the exact serialize/send_bytes/recv_bytes signatures live in actor_distributed_pkg, where the inbound poll runs as the base actor’s forked run()):

class FpgaProxyActor #(type T = MsgBase) extends Actor;
  TransportBridgeActor bridge;    // PCIe DMA / USB / ZMQ / Ethernet / shared mem


  function new(string name, TransportBridgeActor b);
    super.new(name);
    bridge = b;
  endfunction


  // host -> emulator: serialize and forward (per-message handler)
  virtual task act(MsgBase m);
    T msg;
    byte_stream_t bytes;
    if (!$cast(msg, m)) return;                    // type-safe unwrap
    serialize(msg, bytes);                          // fill the wire-format buffer
    bridge.send_bytes(bytes);                       // wire format depends on transport
  endtask


  // emulator -> host: deserialize inbound bytes and republish to topology
  virtual task receive_loop();
    byte_stream_t bytes;
    T msg;
    forever begin
      bridge.recv_bytes(bytes);
      msg = deserialize(bytes);
      publish(msg);                                 // back onto the actor bus
    end
  endtask
endclass

An actor on either side of the seam wires to this bridge with ‘WIRE the way it would wire to any other actor, and neither side knows the other is across a substrate boundary. The point to keep straight is that the bridge earns its place because of that boundary — not because the testbench lives on the host. The bridge’s $\sim $50 lines is the surface area of one seam; it is not the surface area of the substrate swap, which re-renders the whole graph and adds no bridge inside the verification loop at all.

Adapting to non-actor RTL. When the block under test is itself an actor, its synthesized form natively understands typed messages and the surrounding verification actors wire to it directly — all on the fabric. When the block is legacy non-actor RTL — a Verilog module with an AXI or APB interface — a small synthesizable BFM (bus functional model) sits between the actor fabric and the bus pins, translating typed messages into the bus protocol and back. The BFM is itself on the fabric: it is a hardware-to-hardware adapter between two on-chip representations, not a host crossing. The verification actors do not change; only the protocol-translation block is per-DUT-class, and it rides the emulator with everything else.

Scale does not add bridges. A 10 B-gate SoC is seven orders of magnitude larger than Appendix E’s iCE40 demonstration, but the arrangement does not gain a single bridge as it grows. More design means more synthesized actors on the fabric and more verification actors beside them; the seams stay where the system’s real boundaries are — the workload read-out, external I/O, and whatever links the program chose to terminate in software. The team that built the actor-based scoreboards, coverage actors, RAL, and regression dashboards for the simulation flow runs the same artifacts on the emulator, re-rendered to RTL. There is no second testbench, and no bank of bridges standing between the testbench and the design. What remains a roadmap item is the end-to-end FPGA bring-up of a full SoC at this scale — the per-actor synthesis and the whole-loop fabric are demonstrated (the substrate-swap example, Appendix G), and the move onto a commercial emulator is bounded engineering, not a new methodology.

The generated seam, concretely: from one bridge actor to a SCE-MI transactor

The bridge actor above is one declaration: a typed message in, a typed message out, a transport named. On a commercial emulator that declaration has to become a SCE-MI transactor — and the shape of that transactor is fixed by the declaration, which is exactly why it can be generated rather than written.

Recall what SCE-MI standardizes: a multi-channel message interface between a host proxy and an emulator-resident synthesizable model, carrying transactions, not pins. A well-formed transactor encapsulates the per-cycle bus protocol on the emulator side and crosses the boundary once per transaction; the gap between the published 64$\times $ (signal-based) and 2091$\times $ (transaction-based) ZeBu accelerations is precisely this transaction-versus-pin choice — the Amdahl mechanism of §7.16, in vendor numbers. For an actor whose outbound message is an AXI burst, the emitted target-side transactor takes the shape:

// Emitted target-side SCE-MI transactor for an AXI-burst bridge actor. One inbound message = one full burst; one outbound message = one response. The host crosses the boundary TWICE per burst, not 2*N times.
module axi_burst_xtor (
     input   logic           clk, rst_n,
     // SCE-MI message ports (driven by the host proxy over the message channel)
     input   axi_burst_msg_t burst_in,         // {addr, len, data[]} -- one transaction
     input   logic           burst_in_valid,
     output logic            burst_in_ready,
     output axi_resp_msg_t   resp_out,         // {status, rdata[]} -- one transaction
     output logic            resp_out_valid,
     input   logic           resp_out_ready,
     axi_if.master           axi               // pin-level AXI to the emulator-resident DUT
);
     // The actor's act() handler, rendered to RTL, runs the per-beat AXI protocol internally: unpack burst_in once, drive len+1 AXI beats, collect the responses, pack them into one resp_out. No per-beat host crossing.
endmodule

The host side is the same TransportBridgeActor proxy the simulation flow already used (the FpgaProxyActor above), now bound to the SCE-MI message channel instead of a ZMQ socket: it serializes the typed burst message into the input pipe and deserializes the response from the output pipe. Three facts make this generation, not authoring:

• The message type fixes the wire format. The burst’s fields — address, length, data — are the message struct; the marshalling is a field walk, identical to the one the distributed transport already performs for cross-machine actors.
• The protocol unfold is the actor’s own act() body. An AXI-driver actor already contains the beat-by-beat logic as its handler; rendered to RTL by the six-step mechanism, that handler is the transactor’s internal state machine. There is nothing to re-code.
• The transaction granularity is automatic. Because the actor communicates in whole typed messages and never in pins, the emitted transactor is transaction-level by construction — it physically cannot be the “chatty” per-pin transactor that collapses throughput, because the actor never exposed pins to cross.

This is the concrete form of step three of the conversion: the seam a status-quo program hand-plumbs per interface is, for an actor graph, an emission whose every field is determined by a declaration the engineer wrote once. On FireSim the same declaration emits a FireSim bridge instead (§7.18) — the target-side stub and host-side driver are the FireSim equivalents of this BFM and proxy.

What synthesizable actors do to Amdahl’s $f$

§7.16 named the equation; this is its corollary in the actor framework’s vocabulary. The previous subsection drew a sharp line between Amdahl-collapsing two-way synchronous traffic and Amdahl-free one-way streams. The actor framework wins on both axes by construction, not by clever engineering effort applied to a UVM substrate.

The structural payoff: actors eliminate the traffic category that creates $f$.

• Asynchronous at the transaction level. The actor framework’s communication primitive is publish(), which writes into the consumer’s mailbox and returns immediately — fire-and-forget by construction. The producer never blocks on the consumer’s acknowledgement; there is no handshake to wait on. Compare to UVM’s driver-monitor pattern, where the sequencer–driver handshake (get_next_item … item_done, plus get_response where used) closes synchronously per item — a synchronous round trip built into every transaction. The two patterns are not engineering variants of the same architecture; they are different models of computation. UVM’s is synchronous; actors’ is asynchronous. The synchronous round-trip category that Amdahl punishes is a UVM property, not a verification property. An asynchronous-by-construction substrate does not generate it in the first place.
• Synthesizable at the signal level. Even the asynchronous messages between actors have to land somewhere. Appendix E’s Five Rules constrain each actor to a synthesizable shape (bounded mailbox, ready/valid handshake on every channel, fixed fan-out, no dynamic allocation, no virtual dispatch in the hot path), which means the actors themselves — producer, consumer, and the FIFO between them — compile to RTL and run inside the emulator. The ready/valid handshakes that connect adjacent actors are not host crossings; they are intra-emulator wires. The fraction of actor traffic that crosses the host-emulator boundary is whatever the program explicitly routes through a proxy bridge — by default, nothing.

The combined result is that the actor substrate does not just reduce $f$; it removes the category of cycles that count toward $f$. In UVM-style flows, $f \to 1$ because every transaction is synchronously brokered through the host; in actor-style flows, $f \to \varepsilon $ because the producer-consumer chain is asynchronous and both endpoints are inside the emulator. The Amdahl penalty is structural to the synchronous-handshake-through-host pattern, and the actor methodology does not instantiate that pattern.

The per-component view. The structural argument has a per-component reading: each piece of host-resident infrastructure in a UVM-style flow becomes a place where the actor framework drops the round-trip traffic the host-resident piece used to require.

• Stimulus. The host-resident sequencer + driver chain becomes a synthesizable stimulus actor on the emulator. The per-transaction round trip vanishes; stimulus is published at the emulator’s clock.
• Checking. The host-resident scoreboard becomes a synthesizable scoreboard actor on the emulator that subscribes to the DUT’s typed messages and emits pass/fail bins on its own typed channel. Only the bin updates cross the boundary, at the bin-flip rate rather than the transaction rate.
• Coverage. The host-resident coverage collector becomes a synthesizable coverage actor on the emulator. Bins are accumulated locally; periodic snapshots cross the boundary at human-readable cadence.
• RAL. The host-resident register abstraction layer’s CSR-broker role becomes a synthesizable RAL actor on the emulator. The host issues high-level “write register CTRL” commands at sequence-item granularity; the RAL actor unfolds them into bus transactions inside the emulator.

What is left to cross the boundary — external IO that genuinely originates outside the simulated system, the high-level pass/fail and coverage-summary messages that go to the host dashboard, the human-issued RAL commands — is at the workload level, not the cycle level. Plugging plausible numbers back into the Amdahl equation, this moves a typical UVM-style flow from $f \approx 1$ to $f \approx 0.001$–$0.01$, and the system rate climbs from $\sim $10 kHz to $\sim $1–3 MHz. The emulator’s MHz clock finally shows up at the testbench level.

What is shipped, and what is extrapolated. Honesty about the substrate matters here, and two worked examples now ship. Appendix E synthesizes a 32-bit counter actor through Yosys and place-and-routes it on an iCE40 HX8K at 126.6 MHz — real silicon, with a mechanical class-to-RTL translation. The substrate-swap example (Appendix G) carries the claim onto the testbench: it synthesizes a stimulus actor, a scoreboard actor (golden model, expected-value FIFO, comparator), and a coverage actor, wires them into one fabric with the DUT, and runs the whole verification loop — so “the scoreboard and coverage synthesize” is demonstrated, not asserted, and the fabric’s results match the software rendering exactly. What is still extrapolated is scale: a production RAL with deep register maps, and the full-SoC fabric at industrial gate counts, are larger instances of the same Five Rules and the same ready/valid discipline, not new mechanisms. The Amdahl math is just arithmetic — $f$ small gives $R_{\text {system}}$ close to $R_{\text {emul}}$ regardless of which actor pieces shrank $f$ — and the $f \approx 0.001$–$0.01$ figure is the target an actor-discipline program engineers toward. The end-to-end FPGA bring-up of a full SoC on a commercial emulator remains the roadmap item; the per-actor synthesis and the whole-loop fabric that underwrite it are shipped, and §7.18 carries the same fabric onto the open FireSim platform.

The mechanical question is how each of those host-resident chunks becomes a synthesizable actor. The next subsections walk it.

Multi-FPGA partitioning: the actor wire graph as the partition plan

A 10 B-gate SoC does not fit one FPGA. On an FPGA-based emulator or prototyper, the compile flow partitions the RTL graph across dozens to hundreds of FPGAs. The partitioning is the long pole: a good cut produces partitions whose per-FPGA P&R closes in hours and whose inter-FPGA TDM cost is bearable; a bad cut produces partitions whose timing closure spins for days.

The classical experience is that automatic partitioning is workable but not great. The compiler does it from an EDIF netlist plus FPGA count, but the resulting cuts often slice across high-fanout buses or critical paths, and the engineer has to add manual partition hints to recover predictability. The hints are a per-design effort; they do not transfer cleanly between RTL revisions; they are the single largest source of compile-time variability in FPGA-based emulation.

The actor framework changes the partitioning starting point. The ‘WIRE graph at elaboration time is a list of edges between named producer and consumer actors. The edges have known message types (and therefore known bit widths); the cardinality of fan-out is known; the back-pressure structure is explicit. The ‘WIRE graph is a directly usable partition plan. A natural cut is one that places each actor (or each subgraph of densely-coupled actors) on its own FPGA, with the inter-actor edges becoming the inter-FPGA crossings.

FireAxe confirms the cut is the cheap one. The open FireSim platform (§7.18) makes this concrete and published. Its FireAxe extension (ISCA 2024) partitions a single large RTL design across multiple FPGAs, and it draws exactly the distinction the actor discipline enforces. FireAxe’s fast mode partitions at latency-insensitive boundaries — ready/valid or credit-based interfaces — and runs about twice as fast as its exact mode, the mode it must fall back to when a boundary has combinational paths running through it; adding the one cycle of latency that a ready/valid cut needs costs, in its authors’ measurements, nearly zero accuracy. The actor framework’s Five Rules put a ready/valid handshake on every ‘WIRE edge by construction, so every actor boundary is already a fast-mode cut. The partition plan is not merely legible in the ‘WIRE graph; it is built entirely from the cheap kind of edge. A design not written to that discipline must hunt for latency-insensitive cut points; an actor graph is nothing but latency-insensitive cut points.

What this buys for the FPGA-based emulators. On ZeBu and HAPS, the partition is the source of compile-time pain. An actor-based design ships with the partition plan implicit in its ‘WIRE graph; the engineer can feed the graph to the partitioner as a hint and skip the slow inference step. Compile-time variability drops; cross-FPGA timing closure becomes predictable.

What this buys for the custom-silicon emulators. On Palladium and Veloce, partitioning is the vendor compiler’s deterministic job rather than a user-visible problem, but the same property gives an analogous benefit: the actor graph is the natural unit for emulator-runtime checkpointing. Each actor has a well-defined state and a well-defined input/output interface; a checkpoint at the actor boundary is small and resumable. The InfiniTrace and equivalent unlimited-depth trace engines need exactly this structure to scale.

The unification. Partitioning across many FPGAs is invisible to the verification environment for a structural reason: the verification actors ride the same fabric as the design, and a ‘WIRE edge is just a wire. When the partitioner cuts an edge across an FPGA boundary it bridges the two halves with time-division multiplexing; neither the producing actor nor the consuming actor can tell that the wire between them now crosses a seam. The actor graph the engineer authored is the partition-independent view, and the emulator’s internal cuts never appear in it.

The same conversion, two targets: commercial emulators and FireSim

The six automatic steps describe a conversion, not a platform. The same authored graph reaches both kinds of hardware a 2026 program would target; only the compiler back-end and the seam protocol differ, and both are generated.

Target one: a commercial emulator (Palladium, ZeBu, Veloce). The actor graph synthesizes to RTL and compiles onto the processor array or FPGA array exactly as the design does — the vendor compiler does not distinguish a scoreboard actor from a DUT block, because both are RTL. The bridge actors at the genuine seams become SCE-MI transactors, generated from the TransportBridgeActor declaration rather than hand-coded; the ‘WIRE graph is handed to the partitioner as the partition plan; coverage rides as a bound module’s typed bins, read out at workload cadence. What the program does not do is the four-step manual port of the status-quo subsection — no UVM rewrite, no hand-built transactor, no relocated coverage class, no hand-tuned partition. The verification investment made in simulation is the verification investment that runs on the emulator.

Target two: FireSim (open, FPGA-accelerated, cycle-exact). The same graph becomes a FAME-1 target under Golden Gate; the bridge actors become FireSim bridges; and because FireSim is open and offers metasimulation, the conversion is not just automatic but provable — the whole graph runs in software (Verilator) under the FAME transform, bit- and cycle-exactly reproducible against an FPGA run, on a laptop, with no hardware. §7.18 develops this, and appG_firesim_substrate_swap is the runnable artifact: stimulus, DUT, scoreboard, and coverage, all four synthesized to gates and run as one fabric, the host reading only the final counters.

Why both, from one graph. The two targets are the two faces of FPGA-accelerated verification the academic literature has long kept separate — the prototype (one host cycle per target cycle, the commercial-emulator end) and the decoupled simulator (a variable host-cycle ratio, the FireSim end). The FAME taxonomy (§7.5) calls them points on one axis: direct versus decoupled execution. An actor graph sits above that axis. It is a latency-insensitive bounded dataflow network of typed, handshaked nodes — precisely the structure both faces compile from. The methodology never had to choose between commercial emulation and open FPGA simulation, because the thing it converts — the graph — is what both substrates are built to run. One authored testbench, two hardware targets, zero manual conversion is the property — demonstrated at whole-loop fabric scale today, with full-SoC bring-up the roadmap item; the remaining subsections make each target concrete.

Worked example: OpenTitan Earl Grey on an emulator

Appendix C took OpenTitan’s Earl Grey SoC (an open-source security chip) and recast all twenty-eight of its IP types — the TileLink bus, the power island (pwrmgr, rstmgr, clkmgr, lc_ctrl), non-volatile memory, the key manager, the crypto block, the entropy chain, the CPU and interrupts, the common peripherals, GPIO / pinmux / pwm / adc, the serial buses, and USB — as cooperating actors. The appendix runs a seven-phase chip-level test on Verilator with behavioral actors, then ships a separate demonstration (§“Pure-Actor Real-RTL DV”) that runs the same actor topology against the real chip_earlgrey_verilator RTL, replacing OpenTitan’s six DPI bridges (uartdpi, gpiodpi, spidpi, usbdpi, jtagdpi, dmidpi) with a single actor substrate. The appendix’s line-count comparison was 10,944 actor-framework lines versus 204,589 UVM lines for the same architectural scope — a 19$\times $ reduction at chip scale.

Appendix C is a simulation case study; it does not run on an emulator. What it demonstrates concretely is that the actor topology and the verification investment built on it (scoreboards, coverage, RAL, supervision) are substrate-agnostic at the simulation level — the same actors verify behavioral models, then verify real RTL through the no-DPI substrate. Extending that demonstrated substrate-agnosticism to an emulator is the natural projection of the methodology, not an empirical result the appendix ships. What that projection would look like, if a program followed the substrate-swap pattern this chapter has been describing:

1. Synthesize Earl Grey. The OpenTitan repository ships with a Yosys-friendly Verilog source tree. Synthesis to a Lattice or Xilinx primitive set is the standard FPGA flow; an emulator’s vendor compiler takes the same RTL.
2. Re-render the DUT as synthesized actors. The behavioral Earl Grey actors that drove the appendix’s simulation flow are re-rendered to RTL — the same authored blocks (pwrmgr, rstmgr, clkmgr, lc_ctrl, the IO complex, the Ibex CPU, the cryptographic accelerators), now synthesized onto the emulator rather than executed as class objects.
3. Re-render the verification actors beside it. The Power-Manager Scoreboard, the Lifecycle State Tracker, the Entropy Chain Verifier, the Alert Receiver — all the actors that verified the simulation flow — are re-rendered onto the emulator too, wired to the synthesized DUT by the same ‘WIRE edges, now intra-fabric wires. They subscribe to the same typed messages they always did; only the workload read-out crosses to the host.
4. Boot Earl Grey’s ROM. The OpenTitan boot ROM runs on the Ibex CPU on the emulator at 2–5 MHz. The scoreboards, now on the fabric, observe the same architectural sequence (reset, lifecycle-state read, entropy-source initialization, OTP read) they observed in simulation. The boot completes in tens of seconds on the emulator; it would have completed in hours in simulation, if at all.
5. Run a real OS workload. OpenTitan ships TockOS as the reference RTOS. Booting Tock on the emulator-resident Earl Grey is a multi-minute workload; the scoreboards observe the syscall path, the timer interrupt chain, the cryptographic accelerator calls.
6. Find the bugs the simulation never found. Long-running cryptographic-accelerator key-rotation, alert-handler escalation under concurrent fault injection, lifecycle-state-transition under power-glitch — the bugs that need billions of cycles and concurrent stress show up on the emulator, observed by the same actors that did the block-level verification.

The methodology claim is structural. Appendix C’s verification investment — twenty-eight behavioral actors plus the same actors verifying real Verilator RTL through the no-DPI substrate — is built on the same typed-message bus an emulator-resident synthesized actor publishes onto. The Power-Manager Scoreboard that observes PwrStateTransition_s in Appendix C subscribes to that message type by name; whether the publisher is the behavioral PwrmgrActor, the real-RTL block wrapped by Appendix C’s no-DPI chip testbench, or (projected) the same block re-rendered as a synthesized actor on the emulator, the scoreboard does not change. The empirical part of this chain is the first two; the third is the natural substrate-swap extension. The team that built Appendix C’s testbench would be the team that runs it on the emulator, with no second methodology, when the program engineers that final substrate swap.

What the emulator adds. Three things the simulation flow could not give:

• Real boot times. Tock boots in minutes; the simulation flow could not boot it in practical time.
• Real concurrent fault scenarios. The alert handler under concurrent ROM-integrity-check, OTP-readout, and crypto-keyrotation stress: provoking these in simulation requires synthetic stimulus the human verifier writes; on the emulator the real software stack provokes them organically.
• Real timing. The cryptographic accelerators’ completion latency varies with workload; the emulator’s MHz-class clock preserves the relative timing (within the constant emulator-to-silicon factor), letting the verifier see software-visible timing relationships that simulation flattens.

The Earl Grey example generalizes. Any actor-based verification environment, whether the DUT is open-source like Earl Grey or proprietary like a flagship mobile SoC, ports to the emulator the same way: re-render the DUT and the verification actors as synthesized actors on the fabric, with bridges only at the genuine software-to-hardware seams, and run. Same scoreboards, same coverage, same RAL, same dashboards. The boundary between simulation and emulation is no longer an artifact transition — it is a substrate substitution.

The same Earl Grey graph on both targets. The conversion this chapter claims is dual-target, and Earl Grey makes it concrete on an open-source chip. Appendix C’s 10,944-line actor graph — the twenty-eight IP actors plus the Power-Manager Scoreboard, the Lifecycle State Tracker, the Entropy Chain Verifier, and the Alert Receiver — is the single source for both targets. Neither sees a second testbench.

• On a commercial emulator (Palladium / ZeBu / Veloce). The vendor compiler takes the synthesized Earl Grey RTL and the synthesized verification actors as one netlist — it cannot tell the Alert Receiver from the alert handler, because both are RTL. The ‘WIRE graph is handed to the partitioner as the cut plan. The genuine seams — the host dashboard reading pass/fail, the JTAG and UART consoles, any ICE link to real silicon — become generated SCE-MI transactors of the shape above. The boot ROM and TockOS run on the Ibex at 2–5 MHz; the on-fabric scoreboards observe PwrStateTransition_s and the alert escalation sequence exactly as they did in simulation. No UVM rewrite, no hand transactor, no relocated coverage.
• On FireSim. The same graph becomes a FAME-1 target under Golden Gate; the same seams become FireSim bridges. Because FireSim offers metasimulation, the alert-handler topology is proven bit-/cycle-exact in Verilator — on a laptop, with no FPGA — before any bitstream is built, then runs unchanged on EC2 F1. The artifact that makes this credible at small scale is the substrate-swap example (the whole loop on the fabric); Earl Grey is the same construction at chip scale.

The only per-target artifacts in the entire flow are the generated seam adapters — SCE-MI transactors on one side, FireSim bridges on the other — and both are emitted from the same TransportBridgeActor declarations, not authored. The 10,944-line graph is byte-identical across the two. That is the dual-target conversion the chapter’s contribution names, on a real chip: one testbench, two hardware substrates, the only difference a generated adapter. (Honesty, unchanged from above: Appendix C ships the simulation-level demonstration; the full-chip bring-up of Earl Grey on a commercial emulator and on FireSim is the projected extension this section describes, with the substrate-swap example the shipped proof of the mechanism.)

Continuous integration on emulators

The traditional view of emulators is that they are too expensive and too slow to be part of a CI loop. CI runs on Jenkins or GitHub Actions every commit; emulators take hours to compile and run; the two cadences do not match.

The actor framework’s distributed-regression layer (Appendix L) opens a different option: the emulator is one more execution backend in a regression pool, alongside the simulator pool. The actor framework’s typed-message bus aggregates results from both. A nightly CI cycle looks like:

• Commit lands at 5 PM. Jenkins kicks off a simulation regression on a 1000-instance Verilator pool; each instance runs a directed or random test; results stream into the regression dashboard as typed coverage messages.
• The emulator wakes at 6 PM. A cron-triggered emulator job picks the latest RTL, runs the incremental compile (20 min – 2 hrs on a processor-array emulator), and starts the workload battery: Linux boot, Android boot, the previous bug regression set. The emulator’s results stream into the same dashboard, via the same proxy actors.
• Cross-leg coverage merge. The dashboard subscribes to coverage messages from both simulation and emulation legs. A bin marked as “not yet hit” in simulation might be hit by the emulator’s Linux boot the same night; the bin is closed in the merge.
• Bug triage at 8 AM. The verifier’s morning report shows: simulation regressions passed/failed, emulation regressions passed/failed, coverage delta from both legs, bugs filed by the cross-leg correlation engine when the same RTL signature triggered a failure in simulation and the emulator (with the emulator workload providing the bigger trace context).

Appendix L ships the bridge architecture concretely — a three-process polyglot demonstration with two SystemC WorkerActors publishing through a ZmqPublisherBridge, a SystemC Scoreboard subscribing through a ZmqSubscriberBridge, and a Python subscriber as a polyglot consumer, all aggregating ten typed events through the same wire format. The nightly CI cycle described above is the natural extrapolation that appendix’s own §L.5 (“One Regression Bus for All Three Legs of EV”) forecasts: a central dashboard subscribed to typed coverage and result messages, aggregating across simulation and emulator workers over the same transport. The substrate is empirical at small scale; the regression-farm-scale deployment is operational engineering on the same substrate.

This requires no separate emulation team and no separate emulation dashboard. The substrate that aggregates simulation and emulation results is the same actor bus; the messages are typed; the dashboards subscribe by message type, not by substrate. The cost of adding emulation to CI is the cost of the emulator capacity and the incremental compile time; the methodology cost is zero.

The cloud variant. If the emulator is in the cloud, the CI integration is identical — the proxy actor talks to the remote pool the same way it would talk to an on-prem box. Cloud capacity makes the CI option viable for Tier-2 programs that cannot justify a private emulator at all. The dashboard does not know the difference.

7.18 An open, runnable substrate swap: the actor model on FireSim

Everything in this chapter so far has described emulation through closed platforms — Palladium’s processor array, ZeBu’s FPGA partitions, the vendor transactors that bridge host and emulator. Their internals can be described but not cited. FireSim is the exception: an open-source, source-available platform that realizes the host/target-decoupled, FPGA-accelerated approach in public. Built at Berkeley (Karandikar et al., ISCA 2018), it runs silicon-proven RISC-V RTL on commodity cloud FPGAs (Amazon EC2 F1) and on-prem boards. The precise framing matters: FireSim is FPGA-accelerated cycle-exact simulation, not commercial emulation. It transforms the target RTL so that one target clock cycle executes over a variable number of host FPGA cycles, rather than mapping the design one-to-one onto the fabric — and that host/target decoupling is exactly what makes the substrate swap mechanizable.

The mechanism, in the open. FireSim’s FAME-1 transform (the Golden Gate compiler, Magyar et al., ICCAD 2019) gates the target’s state with a global enable: the target advances one cycle only when every input channel has a token and every output channel is ready (targetFire). One target cycle then takes a variable number of host cycles — and the special case of exactly one host cycle per target cycle is an FPGA prototype, which is why this chapter could treat prototypes and emulators as points on a single axis. Host-side timing models plug into the same decoupling: FASED (Biancolin et al., FPGA 2019) models DRAM timing as target-time RTL instead of mapping a real controller. The published, formally grounded statement of the trick (a latency-insensitive bounded dataflow network) is the open analogue of what the closed FPGA-based emulators do internally and never document.

The target is the fabric; bridges are the seams. FireSim draws exactly the line this section has been drawing. Its target — the RTL that is FAME-transformed and placed on the FPGA — is the whole design, and a FireSim bridge is a host-side C++ endpoint paired with a target-side RTL stub, for the I/O that genuinely leaves the target: a console, a DRAM timing model, a network link. Lay the actor graph over this and the correspondence is exact. A plain actor is target RTL: stimulus, DUT, scoreboard, and coverage all synthesize into the FireSim target and run on the FPGA, just as they did on the fabric in the substrate-swap example. A bridge actor — the kind that sits at a software-to-hardware seam (§7.17) — is a FireSim bridge: its target side is on the FPGA, its host side is C++. FireSim arrived independently at the same split, and at the same conclusion that only the seams cross to software.

A runnable demonstration. appG_firesim_substrate_swap makes this concrete. It synthesizes the entire verification loop — stimulus, accumulator DUT, scoreboard, and coverage, each a finite state machine in the synthesizable form — into one fabric, and runs it two ways: the four actors as software objects, and the same four as the synthesized fabric under Verilator. The results are identical (256 transactions, zero mismatches, full coverage), and Yosys confirms every actor maps to gates — the scoreboard’s 231 flip-flops carry its golden model, FIFO, and counters, beside the DUT’s 33. In the hardware run the host does not drive transactions or hold a scoreboard; it resets the fabric, clocks it, and reads the final counters. The verification environment is on the fabric, not on the host. This is the substrate swap demonstrated end to end, not projected.

From metasimulation to FPGA. The ./firesim/ scaffold carries that fabric onto FireSim. The whole tb_fabric becomes the FireSim target — BlackBoxing the same .sv files Verilator ran, so the RTL on the FPGA is byte-for-byte the RTL of the substrate-swap example — and a single peek/poke bridge lets the host read the status counters. In FireSim metasimulation the entire target runs in software (Verilator) with no FPGA, and FireSim guarantees that what metasimulation observes is bit- and cycle-exactly what an FPGA run produces; on an EC2 FPGA instance (F1/F2) or an on-prem Alveo the identical build runs at megahertz. Across Verilator, metasimulation, and the FPGA, the fabric does not change — only the substrate underneath it does, and the host only ever reads status.

Why this needs an open platform. The demonstration requires four things at once: authoring a custom bridge (the proxy actor), source access to prove the synthesized actor is bit-for-bit the software actor, a platform anyone can rerun, and a way to prove the architecture with no hardware at all (metasimulation). The closed commercial emulators provide none of them — their transactor glue is sealed vendor IP, they are multi-million-dollar shared boxes, and they have no metasimulation of the bridge stack. FireSim, being open, is the one platform on which the substrate-swap claim becomes a reproducible artifact rather than an assertion. The methodology and the platform meet exactly at the host/target boundary both were built around.

7.19 Workload-scale emulation

Pre-silicon software bring-up is the headline workload class. Published Tier-1 flows consistently report that a six-to-twelve-month software bring-up beginning at silicon arrival compresses substantially — typical figures in the four-to-eight-week range — when bring-up starts on emulation at RTL freeze. The first software release date is the gating item for product launch; emulation moves it from after silicon to before silicon.

Linux boot as the sanity test. The de facto pre-tape-out sanity test is “does Linux boot.” The kernel exercises page tables, exception vectors, MMU translations, cache-coherence transitions, interrupt routing, DMA paths, peripheral discovery, device-tree parsing, and root-filesystem mount — a workload no testbench writes by hand. A passing boot is not a proof of correctness, but a failing boot is one of the most concentrated signals of integration bugs available.

The actor-based RAL is what makes the firmware-driver-runs-real-software flow operationally smooth. The driver’s CSR writes go through the same RAL the simulation testbench used; the publish path puts a typed register-write message on the bus; the coverage actor records the bin; the scoreboard checks the architectural effect. The team does not write a separate “boot harness” — the boot is one more sequence on top of the regression infrastructure that already exists.

Coverage at workload scale. Long-tail coverage bins that simulation will never reach become routinely reachable on the emulator. “Kernel panic handler observed,” “ECC double-bit error logged,” “power-management state transition under interrupt storm” — these were rare events at simulation throughput; at emulation throughput they are observed multiple times per Linux boot. The coverage closure plan that targeted these bins via constrained-random simulation pivots to relying on real workloads for them; simulation focuses on the corner cases real workloads do not exercise. The two legs complement instead of duplicate.

Workload-driven power. The same actor-bus that carries coverage messages carries power-cost messages. Appendix D, §Step 2, specifies pre-RTL power estimation as a power-cost field tagged onto each actor’s published events plus a power-tracking actor that sums activity by clock domain and IP. The same architecture rides the substrate swap onto the emulator: each on-emulator actor’s published events carry the same power-cost field (driven by a hardware activity counter on the synthesized side), and an on-emulator aggregation actor sums identically. The host-side power-tracking actor translates the running sums to mW through a pre-characterized library. Workload-scale power characterization is not a new tool — it is the same actor-based instrumentation App D specified pre-RTL, with the substrate underneath swapped from a simulator to an emulator.

Performance counters as typed messages. The same CoverageActor pattern that records functional bins records IPC, cache miss rates, branch-prediction accuracy. A real workload on the emulator exercises these counters; the host-side dashboard subscribes to the typed messages and produces a performance characterization report. No separate performance-tracing tool, no separate instrumentation; the actor substrate carries everything that flows on the typed-message bus.

Save-and-restore: forward pointer. The save-and-restore checkpointing flow is what makes a one-hour Linux boot serve dozens of per-driver test runs without repeating the boot. The next section (§7.20) is dedicated to it because it is the operational unlock for the workload-scale flow, not a peripheral detail.

7.20 Save-and-restore: the operational unlock

Pre-silicon software bring-up at workload scale needs more than raw emulator throughput. Booting Linux on a modern emulator is well under an hour, but no team waits for it on every test. The operational practice that makes emulator capital pay back for software teams is save-and-restore: a multi-engineer parallel-fast-forward workflow that turns one boot into hundreds of per-test runs.

The mechanism. Boot Linux once. Pause the emulator at a known-good state — after kernel init, after driver load, after a specific kernel point that all subsequent tests need. Dump the entire state of the design to a checkpoint file: every flip-flop value (10 B gates worth on a flagship SoC), every on-chip memory bank, every in-flight bus transaction, every clock-domain phase relationship. The checkpoint is typically tens of gigabytes; the write takes a few minutes; resume from the checkpoint takes a few more.

The team workflow. Fifty software engineers can load that one checkpoint in parallel on fifty allocated emulator slots. Each fast-forwards past the boot they did not need to repeat and starts running their specific driver test from the same warm state. The per-engineer iteration latency drops from one boot per test (an hour) to zero boots per test (start-from-checkpoint). An emulator that boots Linux once a day serves a team of dozens running per-driver tests from the same checkpoint — the throughput multiplier on per-engineer iteration is what makes the difference between “one emulator per engineer per workload” (uneconomic) and “one emulator boot per workload, fanned out across the team” (the economics every Tier-1 software-bring-up program runs on).

Three structural consequences. The save-restore flow does not change what the emulator does — it changes what the team can do with it:

• Engineer-to-emulator ratio inverts. A pre-checkpoint flow needs one emulator per engineer per workload; a save-restore flow needs one emulator boot per workload, with the cost amortized across the whole team.
• Driver development goes parallel. Driver bring-up that previously serialized on emulator access can now run as a wide parallel sweep over the same checkpoint — one engineer per driver, each testing in isolation against the same warm state.
• Cloud emulation makes economic sense. A cloud emulator pool sized to one boot per day per workload, with save-restore to fan out to the development team, is the consumption model the usage-based pricing (§7.9) was designed for.

The actor reading. The checkpoint is one more transport for the emulator’s state. The actor framework treats a restored emulator the same way it treats a freshly booted one — the proxy actors reconnect, the scoreboards subscribe, the dashboard registers the run. No checkpoint-aware testbench code, no special restore-mode actors, no per-engineer testbench instance. The save-restore primitive is invisible at the methodology layer; the team gets the throughput multiplier for free.

7.21 The Sprint Spiral, finally closed

Chapter 1 introduced the Sprint Spiral: a verification program in which each two-week sprint adds incremental capability across all three legs of EV. Formal proves a new block-level invariant; simulation regresses the expanded test set; emulation begins to boot software on the maturing SoC. The three legs feed one regression dashboard, one coverage model, one bug-tracker. The picture was promissory in Chapter 1 because the substrate did not yet exist; Chapter 6 built the substrate; this chapter put the third leg on it.

What is different from the classical view. The classical view treats emulation as a hand-off: build the simulation testbench, regress it; freeze the RTL; transfer to the emulation team, who build the deck and bring up the workloads; transfer to the software team, who begin boot-up. Three teams, three artifact stacks, two hand-offs. The hand-offs cost months and reset the verification investment each time.

The EV view treats emulation as a continuous third lane:

• Sprint 1–2: a single block under the actor framework, verified in simulation and formally where applicable. The synthesizable form is enforced from day one — not because emulation is imminent, but because it costs nothing and the option is open.
• Sprint 3–4: the block joins a sub-system. The simulation testbench grows; formal extends to inter-block properties. The first FPGA dev-board bring-up of the block runs the same actor scoreboard, with a bridge actor across the genuine seam to the block now on the dev-board — a transitional split, until the scoreboard re-renders onto the fabric beside it as the sub-system matures.
• Sprint 5–6: the sub-system joins the SoC. The bridge moves from FPGA dev-board to commercial emulator (Palladium, ZeBu, or Veloce); the same scoreboard, coverage, and RAL ride along. Linux begins booting on the emulator in the background of every sprint.
• Sprint 7+: workload-scale verification. Android boot, software-driven power profiles, multi-day longevity. Throughput finally matches what the workload needs.

There is no hand-off, no transfer, no testbench rewrite at any sprint boundary. The actor topology is the constant; the substrate underneath swaps incrementally. The verification investment compounds across sprints rather than restarting.

What this costs. The synthesizable-form discipline (Appendix E’s Five Rules) does cost something in expressivity. A team that did not adopt it could use full SystemVerilog OOP in simulation, dynamic allocation, virtual dispatch hot paths, unbounded mailboxes, run-time subscription. The cost shows up later: every block that did not respect the discipline has to be rewritten before it can ride on the emulator. The EV bet is that the discipline up front is cheaper than the rewrite later. The book has argued throughout that this bet is correct; the empirical answer comes from each program’s own measurement of where its verification time goes.

7.22 Looking back: the chapter, the section, the book

This section closes the chapter and the book.

The chapter

The chapter set out to deliver on the third leg of Emergent Verification — the lane Chapter 1 promised but the book had not yet built. The substrate was built in Chapter 6 (actors with typed messages and ‘WIRE edges); Appendix E demonstrated that the synthesizable form produces real silicon (a counter actor at 126.6 MHz on iCE40), and the substrate-swap example synthesized a whole verification loop — stimulus, scoreboard, coverage, and DUT — onto one fabric; Appendix G specified the bridge architecture for the software-to-hardware seams that remain once the rest of the graph is on the emulator. This chapter brought the industrial scale: Palladium, ZeBu, Veloce as the platforms; hybrid emulation as the dominant 2024–2026 pattern; cloud emulation as the economic shift; ICE as the bridge to real-system integration; Amdahl’s law as the structural reason the synthesizable-actor substrate buys an order of magnitude over UVM-style transactor flows. The Earl Grey case study made the simulation-side methodology argument concrete — twenty-eight IPs cooperating as actors, the same scoreboards and coverage and RAL running against real RTL through an actor substrate that replaced OpenTitan’s six DPI bridges. The emulator extension — the same actor graph re-rendered onto a Palladium or ZeBu, with bridges only at the genuine seams — is the natural next step the chapter laid out; the substrate makes it operationally cheap, and the per-actor synthesis and whole-loop fabric that underwrite it are shipped. What remains a roadmap item is the end-to-end bring-up of a full SoC at that scale, not a new methodology.

The methodology argument compresses to one sentence: because the testbench is an actor graph — synthesizable in full, with host crossings only at real seams — it converts from simulation to emulation with no manual rewrite, and the same authored graph reaches both a commercial emulator and FireSim. To the author’s knowledge this property — a complete verification environment that retargets from software simulation to hardware emulation automatically, for both substrate kinds — is new. The rest of the industry pays the four-step manual tax of §7.17 per interface, per port, and per program: a non-synthesizable testbench cannot cross, so it is split, partly re-coded, hand-plumbed to the boundary, and hand-partitioned. The actor methodology removes the tax not by automating the rewrite but by removing its cause — the testbench was never non-synthesizable and never host-resident to begin with. This is the engineering pay-off of Chapter 6’s substrate. The book’s earlier chapters built the substrate; this chapter showed what it buys when the substrate underneath is a 5 MHz custom-processor array, an FPGA-based emulator, or FireSim’s decoupled fabric rather than a 1 kHz Verilator process.

The arc

Stepping further back: this chapter is the seventh of seven, and the arc of the seven is what the book has been arguing throughout.

Chapter 1: the picture. Hardware is structurally an actor system. Hardware verification is the activity of confirming that the as-built actor system matches the intended one. The verification methodology that fits this shape has three properties: adaptive planning (Emergent Verification), lightweight architecture (the substrate that lets stimulus and checking grow with coverage rather than being built monolithic), and a three-leg pipeline (Formal starting at the block and propagating up to subsystem and SoC scope, Simulation at the sub-system and system level, Emulation at the SoC level). Chapter 1 named the picture; the rest of the book delivered on it.

Chapter 2: the silicon. A verification methodology that does not know the shape of the silicon it verifies is a methodology in the abstract. Chapter 2 walked the RTL — the concurrency model, the basic structural blocks, the arbiter, the handshake controller, the FIFO, the pipelined ALU, the cache controller, the memory controller — so that every subsequent claim about “what verification should do” had a concrete referent.

Chapter 3: the formal leg. The verification capability that handles properties no finite testbench can handle: parameterized properties (“for every $N$-core configuration”), inductive properties (“for every depth $D$ FIFO”), arithmetic correctness (“for every IEEE 754 input pair”). The chapter built the four-engine push-button stack (BMC, $k$-induction, Craig interpolation, IC3/PDR, all on a from-scratch CDCL/SAT core) and its SMT lift for the cases it covers, then climbed the logic ladder to theorem proving for the cases it does not. The closing thesis was that probability proposes and logic verifies, that LLMs and formal verification are siblings of mathematics finally working together, and that the asymmetry between deciding HOL validity (undecidable) and checking a specific proof (decidable) is what makes the top rung of the ladder operationally tractable in 2026.

Chapters 4–5: the simulation tradition. TLM and UVM are the simulation tradition the industry inherited. Chapter 4 walked the TLM architecture — the layered testbench, the constraint programming, the coverage — on its own terms. Chapter 5 walked the UVM framework, its design patterns, its base classes, its phasing, its factory, its sequencer arbitration, its register abstraction layer, its case study (UBUS). The two chapters together honour what the tradition got right while preparing the ground for what it got wrong.

Chapter 6: the substrate. What the tradition got wrong: it imposed shared-state OOP on a system whose underlying shape is independent components passing messages. The bottleneck was the wrong model of computation, not insufficient engineering on the right one. Chapter 6 named the right one: actors with typed messages on declared wires. The chapter built the implementation — actor_pkg/, its supervision layer, its routing layer, its observability layer — and rewrote the UBUS in the actor style to show the empirical line-count and clarity improvement. The substrate at the end of Chapter 6 is what every subsequent claim rests on.

Chapter 7: the third leg. The chapter you have just finished. The substrate that Chapter 6 built is what makes the platform swap from simulator to emulator operationally cheap. The verification investment that was made in the simulation flow rides on the emulator without rewrite, because the framework only knows about typed messages and ‘WIRE edges. The three legs of EV — formal, simulation, emulation — finally meet on one substrate, concurrent in every sprint, aggregating into one dashboard.

The principle

What the book has argued is not a tool, not a framework, not a methodology in the marketing sense. It is the observation that the right model of computation for hardware — typed messages on declared wires, with backpressure and supervision — generalizes into a substrate that absorbs every leg of the verification pipeline and every substrate the design will run on, from architecture exploration to silicon. The actor framework is one realization of that substrate. There will be others. What survives is the principle:

Pick the model of computation once. Pick it to match the shape of the silicon. The verification investment compounds rather than restarts.

This principle is the inheritance the book leaves. The corollaries fall out:

• The simulation testbench is the emulation testbench. Not because emulation got cheaper or simulators got faster, but because the testbench is built on a substrate that does not know which kind of substrate underlies its DUT.
• The formal proof and the simulation regression aggregate into one coverage model. Both publish typed bins onto the same actor bus; the bin marked closed by a formal proof is the bin the random regression no longer needs to chase.
• The architecture model and the synthesized RTL share an artifact. The architecture-exploration actor’s act() body is the same shape as the synthesized actor’s RTL; the substitution is mechanical, not a rewrite.
• The pre-RTL power model and the workload-scale power emulation share an actor. The PowerEstimateActor’s interface does not change; the substrate underneath swaps from a behavioral energy model to an emulator’s per-cycle activity feed.
• The cross-team hand-off becomes a same-team substrate swap. The simulation team, the emulation team, the post-silicon team — all use the same testbench, the same coverage, the same RAL, the same dashboards.

Each of these corollaries was independently engineered by someone, somewhere, at high cost, before the substrate existed. The substrate makes them properties of the substrate rather than per-program engineering efforts.

The historical analogue

The argument has a historical analogue worth naming. Operating systems before Unix were per-machine artifacts: every new computer brought a new operating system, written from scratch, with its own utilities, its own command languages, its own file systems. Unix’s contribution was not a new operating-system feature; it was a substrate (the file abstraction, pipes, the C language) that made operating systems portable across machines. The cost of the substrate was discipline up front (write everything in C, use pipes for composition, treat everything as a file); the pay-off was that the next machine got an operating system in months instead of years.

The actor framework’s contribution to hardware verification has the same shape. The cost is discipline up front (declared wires, typed messages, supervised lifecycle); the pay-off is that the next substrate — the next emulator generation, the next FPGA prototype, the next vendor-hosted pool — gets a verification environment in days instead of months. The substrate is not glamorous; the discipline is not new; the pay-off is large because it compounds.

The close

Emulation is the third leg of EV not because emulation hardware is new — it has existed for forty years — but because the substrate has finally caught up to what emulation hardware can do. The hand-off model treated emulation as a downstream consumer of frozen RTL because the testbench could not survive the substrate swap. The actor framework’s typed-message bus makes the swap transparent. Emulation becomes operationally cheap, becomes a first-class concurrent lane in the Sprint Spiral, becomes the place where simulation-time and silicon-time finally meet.

The book ends here. The thirteen appendices that follow (A–M) are reference material for the practitioner who picks up the substrate and starts building — the UBUS rewrite (A), the Mini-SoC integration (B), the OpenTitan case study (C), the spec-to-silicon methodology continuum (D), the synthesizable form (E), constrained-random stimulus synthesis (F), the FPGA-emulation conversion reference (G), the AI-driven RTL pipeline (H), the actor-based hardware simulator (I), the SystemC port (J), the pure-C++ actors (K), distributed regression (L), and the GPU / AI cross-domain claim (M). Each appendix demonstrates the substrate at one more boundary.

The argument is complete. There is a path from spec to silicon that does not break, does not transfer between teams, does not rebuild the verification investment at each boundary. The path runs on actors — typed messages on declared wires, with bounded mailboxes and supervised lifecycles. The first leg is formal. The second leg is simulation. The third leg, and the place where the loop closes, is emulation. Pick the model once; pick it to match the silicon; the rest compounds.

The book is finished. The work is just beginning.

Host-bound fraction \(f\)	System rate \(R_{\text {system}}\)	Loss vs. \(R_{\text {emul}}\)
\(0.001\) (0.1 %)	\(\sim 3.3\,\text {MHz}\)	1.5\(\times \)
\(0.01\) (1 %)	\(\sim 835\,\text {kHz}\)	6\(\times \)
\(0.1\) (10 %)	\(\sim 98\,\text {kHz}\)	50\(\times \)
\(0.5\) (50 %)	\(\sim 20\,\text {kHz}\)	250\(\times \)
\(1.0\) (100 %, UVM-style)	\(10\,\text {kHz}\)	500\(\times \)