Emergent Hardware Verification

Chapter I Actor-Based Hardware Simulation: a Working Prototype

Chapters 1–6 made the structural argument that hardware is naturally an actor system: each RTL module is an encapsulated unit of state, each wire is a typed channel between modules, each clock edge is a broadcast event, each flip-flop is a two-phase state updater. Appendix M, §M.6, observed that modern interconnects and accelerators have already converged on the same model in silicon. What the book has been hinting at throughout is the deeper consequence: event-driven simulators are implementing the actor model indirectly, through a centralized event queue with explicit scheduling regions, rather than directly as a topology of communicating actors — the classic Verilog-XL model VCS still runs, and that Verilator compiles away for the synthesizable subset (below). A simulator written natively against the actor model should be smaller, easier to parallelize, and naturally portable across CPU, FPGA, and GPU substrates.

This appendix demonstrates the claim with a working prototype. The simulator lives at appI_actor_sim/ and ships as part of the framework. It is a header-only C++17 library — 568 lines for the simulator core, 165 lines for VCD output — with a Makefile-driven test suite that exercises seven canonical RTL patterns (mux, D flip-flop, counter, shift register, ALU, Moore-style round-robin arbiter FSM, FIFO with full/empty flags). All 84 self-checks pass. The FIFO test additionally writes a standard VCD waveform that opens in GTKWave for cycle-level debug.

The prototype is not yet a Verilator replacement. It does not parse SystemVerilog; users write their designs as C++ actor classes that map mechanically to the SV they would have written. The point of the prototype is to validate that the actor model captures the semantics of real RTL — two-phase NBA updates, combinational propagation to fixpoint, multi-stage flip-flop chains, FSMs with combinational decoders, dual-output combinational blocks, stateful FIFOs with synchronous control flags. All seven pass. The path from here to a SV front-end is engineering, not research; the model already works.

I.1 The Thesis

A SystemVerilog simulator has to solve three problems:

  • Modeling. Translate the SV source into an executable structure.

  • Scheduling. Decide which processes run when, ensuring SV’s semantics for concurrent assignments, NBA updates, \(\Delta \)-cycles, and event regions.

  • Execution. Run the processes efficiently.

Verilator and VCS solve all three using the event-queue paradigm inherited from Verilog-XL in the 1980s. Every always block, assign, initial, and fork branch becomes a process; processes are queued in a global priority queue keyed by simulation time and IEEE 1800 region; the simulator’s main loop pops the next process, runs it, and propagates effects until the queue is empty for the current time, then advances time. Verilator further compiles cycle-stable RTL into flat C++ that walks all combinational logic once per clock edge, eliminating per-event scheduling for the synthesizable subset.

Three structural facts limit this approach:

  • The event queue is a serialization point. All processes scheduled at the same simulated time must be processed in the order specified by IEEE 1800’s scheduling regions. Multi-threading this is exactly as hard as multi-threading a shared-memory garbage collector; every attempt has yielded modest speedups (Cadence Xcelium parallel, Synopsys VCS multicore: 1.5–3\(\times \) on 16+ cores).

  • The model is fundamentally event-driven, but most real RTL is cycle-driven. Verilator wins by collapsing event-driven semantics down to cycle-driven where it can. But the framework still has to support all SV semantics, including features almost no one uses.

  • There is no native path to FPGA or GPU. Verilator output is fundamentally tied to a single CPU. Emulator vendors require recompilation into a totally different mapping; GPU simulation is a research project.

The actor model offers a different decomposition. Each module is an actor with private state. Each wire is a typed signal channel. Each clock edge is a broadcast message. Each flip-flop captures its D input on the message and publishes its Q output on the next propagation phase, exactly reproducing SV’s NBA region semantics. There is no centralized event queue. There is no global scheduling state. The model is a graph of actors connected by typed channels — the same model the silicon implements, and the same model the actor framework already provides for verification testbenches in Chapter 6. The simulator is just one more embodiment of the actor model.

I.2 Architecture

The framework header appI_actor_sim/actor_sim.h provides four core types, each a thin wrapper around the actor concepts already developed in Chapter 6:

  • Bits — a width-typed bitvector for \(W \in [1, 1024]\). Internally a std::array; for \(W \le 64\) the array collapses to a single word with no overhead. Bitwise operators, modular add/sub, bit indexing, and hex/binary string output. The compile-time width gives type-checking (\(\texttt {Bits<8>}\) cannot accidentally be assigned to a \(\texttt {Bits<4>}\)) and lets the compiler unroll narrow operations into single-word instructions.

  • Signal — a typed channel. Each Signal has at most one driver and any number of readers (the hardware single-driver rule, enforced at the source level rather than the synthesis level). write() queues a pending change and enqueues the signal onto the simulator’s dirty list; the simulator’s propagate loop then calls commit() (promotes pending to current) and enqueue_readers() (pushes each subscribed module onto the notify queue, deduplicating) for every dirty signal. This two-phase write — queue-then-propagate — is what makes flip-flop NBA semantics work correctly: a clock-edge loop calls write() on every \(Q\) output before any reader sees a new \(Q\).

  • Module — the actor base class with three virtual hooks: on_input_change() for combinational reactions to subscribed signals, on_clock_edge() for sequential updates, on_reset() for reset behavior. The simulator drives each of these phase-by-phase.

  • Sim — the scheduler. Owns the modules and signals; runs tick(), run(N), and reset(). tick() runs three phases: settle external stimulus, broadcast clock edge to all modules, propagate \(Q\) changes through combinational logic to fixpoint. The propagation loop runs to quiescence with a cap (MAX_PROPAGATION_ITERS) that catches non-converging combinational loops, which would be a design bug in real RTL.

The standard library primitives map directly onto these:

  • DFF — D flip-flop with synchronous reset. On the clock edge it captures d_->read() into local state and writes it to q_; the new \(Q\) value is not visible to readers until the propagation phase completes, which means every other flip-flop in the same clock-edge loop reads the pre-edge \(Q\), exactly the NBA invariant.

  • CombLogic — a wrapper that takes a std::function. The user constructs it via Sim::comb(name, lambda, {sig1, sig2, ...}); the simulator subscribes the lambda’s module to each signal in the sensitivity list, runs the lambda once initially, and runs it again whenever any subscribed signal changes.

  • • Custom modules subclass Module directly. The FIFO test demonstrates: a FifoActor with private mem_, rptr_, wptr_, and count_ fields, with all updates driven from a single on_clock_edge() handler. The framework’s encapsulation means the FIFO’s internal state stays private to its actor instance, exactly as an SV module’s local logic stays inside the module.

(-tikz- diagram)

Figure I.1: Actor-native simulator architecture (appI_actor_sim/actor_sim.h). Sim owns the module set and the signal graph; each Module subscribes to the Signals on its sensitivity list. There is no centralized event queue: tick() runs three phases that reproduce SV’s NBA semantics, and the Sim scheduler advances all modules in lock-step rather than draining a priority queue.

I.3 Two-Phase NBA Semantics

The single make-or-break implementation detail is the two-phase clock-edge update. SystemVerilog’s NBA region specifies that on a clock edge, every flip-flop must sample its \(D\) input before any flip-flop’s \(Q\) output updates. A four-stage shift register depends on this: if \(Q\) updates were visible immediately, all four flops would capture the same value in one cycle and the chain would collapse.

The actor model reproduces this exactly with the two-phase update:

// Sim::tick() pseudocode
void tick() {
    propagate_();              // Phase 0: stimulus settles
    for (auto& m : modules_) { // Phase 1: clock-edge broadcast
         m->on_clock_edge();   //   each DFF reads pre-edge D,
    }                          //   writes pending Q (not visible yet)
    propagate_();              // Phase 2: Q values commit, propagate
    ++cycle_;
}

During Phase 1, every flip-flop calls q_->write(state_), which queues a pending change. Other flip-flops in the same loop call d_->read() which returns the current (pre-edge) value, not the queued one. After Phase 1 finishes, Phase 2 commits all pending \(Q\) values atomically and propagates the resulting changes through combinational logic. The propagation runs to fixpoint with a maximum iteration cap that catches combinational loops.

This is the entire scheduling mechanism. There is no priority queue. There is no IEEE 1800 region enumeration. The two phases plus the propagate-to-fixpoint loop are sufficient to reproduce the SV semantics that real RTL designs depend on. The test suite confirms this: the shift register test would fail catastrophically if the two phases were merged, but it passes 12/12 because the model gets the semantics right.

I.4 The Prototype Test Suite

Seven self-checking tests demonstrate the simulator across the canonical RTL patterns:

  • test_mux2to1 — 2:1 multiplexer. Pure combinational, three input signals, lambda-driven output. Validates that combinational logic re-evaluates correctly on any input change. 7/7 checks.

  • test_dff — single D flip-flop with synchronous reset. Validates basic sequential element, reset behavior, and the requirement that \(Q\) doesn’t change without a clock edge. 6/6 checks.

  • test_counter — 4-bit synchronous counter with enable. Validates the \(Q \to \text {combo} \to D \to Q\) feedback loop and the wrap-around at 16 cycles. 7/7 checks.

  • test_shiftreg — 4-stage shift register. Validates two-phase NBA semantics: a 1 injected at din appears at q3 exactly 4 cycles later. 12/12 checks.

  • test_fsm_arbiter — 3-client round-robin arbiter FSM (the one from Chapter 2). One DFF holding state, one CombLogic computing the round-robin transition, one CombLogic decoding the grant. Validates Moore-style FSM with combinational decoder. 10/10 checks.

  • test_alu — 4-bit ALU with four operations and a zero flag. Validates two CombLogic blocks chained: ALU result feeds the zero flag, which propagates correctly when either operand changes. 22/22 checks.

  • test_fifo — 8-bit FIFO, depth 4, with full and empty flags, push/pop control, and simultaneous push+pop. Validates a single-module stateful design with internal arrays and pointers. Additionally emits test_fifo.vcd for waveform inspection. 20/20 checks.

$ make test
================================================================
Running actor-sim prototype tests
================================================================


--- test_mux2to1 ---
[mux2to1] 7/7 checks passed    PASS


--- test_dff ---
[dff] 6/6 checks passed    PASS


--- test_counter ---
[counter4] 7/7 checks passed      PASS


--- test_shiftreg ---
[shiftreg] 12/12 checks passed      PASS


--- test_fsm_arbiter ---
[fsm_arbiter] 10/10 checks passed        PASS


--- test_alu ---
[alu] 22/22 checks passed    PASS


--- test_fifo ---
  VCD waveform written to test_fifo.vcd
[fifo4] 20/20 checks passed    PASS


================================================================
ALL TESTS PASSED    (84/84 checks)

Each test file’s top comment shows the SystemVerilog equivalent of the design under test. The C++ actor code is a mechanical transliteration: always_ff becomes DFF, always_comb becomes a Sim::comb lambda, module bodies with internal state become custom Module subclasses. The mapping is direct enough that a SV front-end would be a straightforward parser-plus-code-generator project.

I.5 Production-Grade Capabilities

The prototype is small enough to read in an afternoon, but the framework already implements the capabilities a hardware engineer expects from a simulator:

  • Wide bitvectors. Bits supports widths up to 1024 bits, with the compile-time width attached to every signal and module. Bitwise operations are word-parallel; modular add/sub propagates carry across words. The narrow-bit fast path is single-word and emits the same machine code as plain uint64_t arithmetic.

  • VCD waveform output. vcd_writer.h produces standard IEEE 1364 Value Change Dump traces. The FIFO test demonstrates: register each signal under a hierarchical scope, write the header, dump value changes at each cycle. The resulting .vcd opens directly in GTKWave, ModelSim, or Verdi. This matters because debug is the long pole of any new simulator; supporting a standard waveform format means the existing tool ecosystem is reusable.

  • Self-checking tests. TestHarness provides expect_eq with file/cycle context. Failing tests print expected vs. actual values; the harness returns a non-zero exit code so CI pipelines catch regressions.

  • Determinism. The scheduler is purely single-threaded with deterministic order over modules and signals. Two runs with the same stimulus produce bitwise-identical VCDs — the gold standard for regression testing.

  • Combinational loop detection. The propagation phase caps at MAX_PROPAGATION_ITERS (200) and throws if quiescence is not reached. Real RTL with unintentional combinational feedback would otherwise hang the simulator; this catches the bug at run-time with a clean error.

These are the table-stakes features. The framework already has them. What remains to bring it to production parity with Verilator are the items in the next section.

I.6 Performance: First Numbers

A 568-line prototype against an industrial simulator is not a fair fight, but the measurement is instructive. The benchmark directory appI_actor_sim/bench/ ships two SV designs paired with matched C++ harnesses for Verilator and actor_sim:

  • lfsr16 — a 16-bit linear feedback shift register: one DFF, one combinational XOR. Tiny: two-actor topology, one clocked element, one fan-in.

  • counter_array — 64 parallel 16-bit counters with a 22-bit sum reducer: 64 DFFs, 64 increment-combinational blocks, one fan-in reducer. Mid-size: structurally dense, with a single observer that fans in from every counter.

Each design is run for \(N\) cycles with no I/O; the harness reports wall-clock cycles per second on the same machine for both simulators.

$ make -C appI_actor_sim/bench run


LFSR16 benchmark (10,000,000 cycles)
  verilator   10.01 Mcyc/s    final_q   = 0x2351
  actor_sim   21.41 Mcyc/s    final_q   = 0x2351      <- 2.1x faster


Counter array benchmark (64 counters, 1,000,000 cycles)
  verilator    7.62 Mcyc/s    final_sum = 2052064
  actor_sim    0.39 Mcyc/s    final_sum = 2052064     <- 20x slower

Both designs produce bit-exact final values across the two simulators — the model is correct on real RTL. The performance picture is the interesting part:

  • • On the tiny design, the prototype already beats Verilator by 2.1\(\times \). Verilator’s eval() wrapper, command-arg processing, and signal-coercion path are fixed cost per cycle and dominate when the design has only two actors. actor_sim has no such fixed cost — it walks two modules and ends the cycle.

  • • On the mid-size design, the prototype is 20\(\times \) slower. Verilator ahead-of-time-compiles the whole design into a single flat C++ function and walks it at memory speed; the actor prototype dispatches each of the 129 modules per cycle through a virtual call plus a std::function invocation, costing roughly 20 ns per module that the AOT-compiled inner loop never pays.

The gap is not in the model. It is in dispatch overhead. The architectural property that matters — there is no centralized event queue — is already paying off the moment Verilator’s per-cycle fixed cost dominates, even with naïve dispatch and zero compile-time specialization. Closing the remaining gap on larger designs is a known engineering ladder: CRTP-templated module dispatch (inline the lambda body, drop std::function) at roughly 3\(\times \), work-stealing multi-threaded scheduling at 5–8\(\times \) on commodity cores, ahead-of-time topology codegen at Verilator-parity-and-beyond. None of these steps requires a new idea.

The deeper observation is that the numbers above measure the worst case for the actor model — a single CPU running a single-threaded simulator against another single CPU running a single-threaded simulator that has been hand-tuned for four decades. The same actor topology, once expressed, lowers to FPGA via the synthesizable-form rules of Appendix E, or to a GPU via per-actor CUDA kernels, with the clock-edge broadcast becoming a stream synchronization point. The event-queue removal is what makes those backends mechanical to build; it is also what gives the model a path to beat any event-queue simulator on a sufficiently parallel substrate. The CPU prototype is the proof that the model is right. The FPGA and GPU backends are where the model wins.

I.7 Path to Verilator Parity

The prototype validates the model. Closing the gap to a production simulator is engineering, not research, and the path divides cleanly into four projects:

  • SystemVerilog front-end. The biggest piece: parse SV source into an AST, lower to actor topology. The synthesizable form is bounded and well-understood (Appendix E formalizes the synthesizable form of the actor framework and synthesizes a worked example through Yosys to a gate-level netlist). The mapping rules are the ones the test files already demonstrate: module \(\to \) Module subclass, always_ff \(\to \) DFF, always_comb \(\to \) Sim::comb lambda, logic [W-1:0] \(\to \) Signal. Tooling: tree-sitter for parsing, an IR similar to Verilator’s tree, a C++ code emitter. Project size: roughly the same as Verilator’s front-end (50K–100K lines); years not weeks, but bounded.

  • Multi-threaded scheduler. The model is already concurrent; the prototype runs single-threaded for testability. Multi-threading means giving each actor its own mailbox (the framework’s existing actor::cpp::Actor infrastructure) and using a barrier-based clock-edge broadcast across worker threads. The propagate-to-fixpoint loop becomes a quiescence detection across the actor pool. With work-stealing across cores, the speedup should scale near-linearly on large designs, beating Verilator’s single-threaded inner loop.

  • Four-state logic. Real SV uses \(\{0, 1, X, Z\}\); the prototype uses two-state. The lift is per-signal: a second parallel Bits carrying the X-mask, with combinational operators that propagate X correctly. Pure functional change; no scheduler impact. Optional flag at compile time so two-state runs remain fast.

  • FPGA and GPU backends. The same actor topology that runs on the CPU simulator can be lowered to RTL via Appendix E’s synthesizable-form rules, then synthesized to FPGA. For GPU, each actor’s body compiles to a CUDA kernel; mailboxes live in GPU memory; the clock-edge broadcast is a CUDA stream synchronization point. The cross-backend transport bridge (the TransportBridgeActor base in actor_distributed_pkg, deployed as the FPGA-proxy pattern of Appendix G) handles the boundary — CPU testbench actor talking to FPGA-resident DUT actor across PCIe, or GPU ML-stimulus actor feeding into a CPU scoreboard. Same source, different deployment targets, same observability through trace IDs and structured logs.

None of these projects is novel research. Each has been done in pieces by separate vendor tools (Verilator for CPU; Synopsys ZeBu and Cadence Palladium for hardware emulation, with ZeBu’s FPGA-array and Palladium’s custom-processor-array bracketing the architectural choices; Synopsys HAPS and Cadence Protium for FPGA prototyping; NVIDIA CUDA for GPU simulation). What the actor framework does is unify the underlying model so the projects compose — one design, one verification environment, multiple deployment targets, with the same observability stack across all of them.

I.8 Files

The prototype ships as part of appI_actor_sim/:

  • actor_sim.h — the simulator core. 568 lines. Bits, Signal, Module, DFF, CombLogic, Sim, TestHarness.

  • vcd_writer.h — VCD output. 165 lines.

  • test_mux2to1.cpp, test_dff.cpp, test_counter.cpp, test_shiftreg.cpp, test_alu.cpp, test_fsm_arbiter.cpp, test_fifo.cpp — seven self-checking tests, each with an SV equivalent at the top of the file.

  • Makefile — builds and runs all tests; integrated with the parent actor_pkg Makefile as the sim target.

  • bench/ — benchmark designs paired between Verilator and actor_sim. lfsr16.sv and counter_array.sv are the SV designs; *_verilator_tb.cpp and *_actor_sim.cpp are the matched harnesses; make run reports cycles per second for both simulators on the same machine and verifies bit-exact final values.

  • README.md — short orientation document for users opening the directory.

I.9 What This Appendix Has Established

The actor model is the natural source-level representation of hardware. Event-driven simulators implement it indirectly through a centralized event queue with explicit scheduling regions; an actor-native simulator implements it directly through a topology of communicating modules connected by typed signal channels. The prototype shipped with actor_pkg demonstrates that the model captures real RTL semantics — two-phase NBA updates, combinational propagation to fixpoint, multi-stage flip-flop chains, FSMs with combinational decoders, stateful FIFOs with synchronous control flags — on seven canonical designs that span the patterns most real SystemVerilog uses. The 84 self-checks all pass; the FIFO test’s VCD output opens directly in GTKWave, confirming compatibility with the existing waveform-debug ecosystem.

The prototype is not a Verilator replacement yet. It does not parse SV; the user writes their design as C++ actor code that maps mechanically to the SV they would have written. What the prototype establishes is that the model works, the implementation is small (a 568-line simulator core), and the first benchmark numbers already show the architectural advantage materializing: on designs small enough that Verilator’s per-cycle fixed cost dominates, the prototype is already faster, even before any of the dispatch-specialization or multi-threading work it can absorb.

This appendix leaves the prototype here, as a seed. The path forward divides cleanly. On the engineering side, three well-understood projects close the gap to Verilator on dense CPU designs: templated dispatch to remove the per-module std::function cost, work-stealing multi-threaded scheduling across the actor pool, and ahead-of-time topology codegen for designs where every cycle is critical. None of these requires new ideas. On the architectural side, the deeper win is the substrate: the same actor topology that runs as a CPU simulator lowers to FPGA via the synthesizable form, or to a GPU via per-actor kernels, with no change to the source. The simulator industry has spent forty years optimizing a centralized event queue. The actor framework’s contribution is to remove it — and once it is removed, the model can be synthesized to FPGA or GPU, and beat any event-queue simulator on a sufficiently parallel substrate. The CPU prototype is the proof of concept; the substrate portability is the larger payoff.