Emergent Hardware Verification

Chapter K Pure C++ Actors for General Concurrent Software

Most software is not hardware. Most concurrent and distributed systems built today — microservices, data pipelines, real-time event processors, ML inference services, regression infrastructure, dashboards, control planes, embedded firmware running on host CPUs, robotics middleware, network protocol stacks — have nothing to do with simulated time, RTL co-simulation, or HLS. The actor methodology developed in this book applies to all of them; for those use cases the SystemC port (Appendix J) is the wrong deployment, because its kernel, timing model, and HLS coupling are not needed.

This appendix is the pure-C++ deployment. The framework lives at actor_pkg_cpp/ and ships in two tiers, each tier targeting a different point in the simplicity-vs-scale trade-off. actor.h is the basic tier, one std::thread per actor: ~310 lines of header-only C++17, easiest to read and debug, and fits hundreds of long-lived actors comfortably (each actor is its own OS-scheduled thread of execution). coro_actor.h is the coroutine tier, C++20 stackless coroutines on an M:N green-thread scheduler: ~390 lines of header-only C++20. Each actor’s run() is a coroutine; co_await mbox.recv() suspends until a token arrives. Many thousands of actors share a small worker pool. This is the same execution model Erlang/BEAM, Go’s goroutines, and Java Project Loom virtual threads use.

Both tiers share the same methodology: typed messages, fixed-topology composition, encapsulated state, and no shared mutable data. They differ only in scheduling. A team assigns a tier to each actor based on its lifetime and concurrency needs: long-lived, heavy subsystems use the basic tier; lightweight, short-lived workers use the coroutine tier; and the two interoperate via the same typed message envelopes.

The claim of this appendix is broader than “the framework also works in pure C++.” It is that for the vast majority of concurrent software systems, actors composed with Data-Oriented Design (DOD) are the natural way to build them, and that the combination is closer to what every concurrent C++ system should reach for first than the alternatives currently dominant in the C++ ecosystem (CAF, async/await libraries, thread pools with mutexes, OOP class hierarchies with inheritance-based concurrency).

The other half of the claim: this is the C++ rendering of the actor. The pure-C++ tier is not only the right substrate for general concurrent software; it is one of the per-substrate renderings of the single authored actor the book’s hardware argument turns on (Appendix E). The same actor is rendered as a SystemVerilog class for in-simulator verification (Chapter 6), as synthesizable RTL for an FPGA or emulator (Appendix E), and as the C++ object of this appendix for the host side — and the three are the same authored definition, not three rewrites. Concretely, the verification actors of the substrate-swap example (appG_firesim_substrate_swap) — a stimulus generator, a scoreboard with a golden model and an expected-value FIFO, a coverage collector — are written in this tier and run unchanged against both a software device under test and the same device rendered as synthesizable RTL (Appendix J §J.6). So the pure-C++ actor is simultaneously the general-software substrate and the host-language meeting point of the hardware flow; §K.5 returns to it once the tiers are in hand.

K.1 The Coroutine Tier as the Default

The actor pattern matches green threads exactly. Each actor is conceptually an independent process: encapsulated state, communicates only through messages, runs forever until shut down. C++20 stackless coroutines give us this directly:

class PongActor : public CoroActor {
 public:
     Task run() override {
         while (alive()) {
             auto msg = co_await mbox_.recv();                 // suspend if mailbox empty
             auto p = std::dynamic_pointer_cast>(msg);
             if (!p) continue;                                 // ignore other types
             publish(Pong{p->payload.seq + 1});                // route to Pong subscribers
         }
         co_return;
     }
};

That co_await mbox_.recv() is the SystemVerilog mbox.get(msg). The coroutine suspends when the mailbox is empty; the scheduler picks another ready actor; when a producer publishes the runtime resumes this actor where it left off. There is no std::thread per actor. Each suspended actor’s state is a heap-allocated coroutine frame — typically 100–300 bytes — versus an OS thread’s 8 KB stack. 100 000 actors fit in \(\sim \)25 MB of frames against \(\sim \)800 MB of OS-thread stacks.

This matches Erlang/BEAM, Go goroutines, and Java virtual-thread semantics. The actor methodology is the same across all four; the C++20 coroutine implementation is what gives their ergonomics native-code performance.

Worked example. actor_pkg_cpp/examples/coro_pipeline.cpp composes four coroutine actors — ingester, parser, aggregator, sink — each forever-looping on co_await of its inbound mailbox. Lines flow through the topology cooperatively; the framework’s scheduler runs whichever actor has work. Output (truncated):

[coro_pipeline] starting ingestion of 30 lines
[sink] ERROR count=5
[sink] WARN   count=5
[sink] DEBUG count=5
[sink] ERROR count=10
[coro_pipeline] parsed=30 sink_updates=4

Throughput. actor_pkg_cpp/examples/coro_pingpong.cpp runs the canonical actor benchmark — two actors bouncing a counter back and forth on an M:N scheduler:

[coro_pingpong] 1000000 rounds = 2000000 messages on 2 threads

The run reports about 1.4 M messages/sec single-pair on the coroutine tier, measured on an idle Intel Core i7-8850H at 2.6 GHz (the binary prints the exact rate as a raw msg/s figure; the per-host number varies). The throughput is bounded by the scheduler’s coordination cost (std::mutex + condition_variable per resume), not the coroutine machinery itself — so the figure tracks single-thread clock speed and degrades under CPU contention (a loaded or slower host reads several times lower), but it is not a property of the coroutine API. Production-grade frameworks (Erlang/BEAM at 10–30 M/s, Akka at 5–10 M/s) reach higher numbers by replacing the std-library scheduler with lock-free dispatch and per-thread work-stealing queues. The framework as shipped trades that complexity for clarity — the scheduler is 60 lines of readable code, the coordination is obvious, the user-level API is what matters. For workloads that need higher throughput, the scheduler internals can be replaced without changing user code.

K.2 The Basic Tier as the Fallback

When an actor has long-running blocking work that cannot easily be expressed as a coroutine (e.g. calling into a third-party library that blocks an OS thread, or holding state that should live on a single core for cache locality), use the basic tier. Each actor gets its own std::thread; the framework’s run_loop() calls the private mbox.get() and dispatches each message to the actor’s act() override. The user-level API is otherwise identical.

actor_pkg_cpp/examples/log_pipeline.cpp is the basic-tier worked example: four actors composed via declarative wire() calls, each running on its own thread, processing 30 synthetic log lines. Run output:

[log_pipeline] starting ingestion of 30 lines
[sink trace=5] WARN    count=3 unique=3
[sink trace=10] ERROR count=3 unique=3
... (4 more lines)
[log_pipeline] parsed 30 records
[aggregator] final counts:
  DEBUG : 4 events (4 unique)
  INFO   : 6 events (6 unique)
  WARN   : 10 events (10 unique)
  ERROR : 10 events (10 unique)

The basic tier is the right choice when:

• Actor count is small (\(<\)1000) so the OS-thread stack overhead is acceptable.
• Actor work is heavy enough that std::thread scheduling overhead is amortized.
• Code clarity matters more than scaling to many actors.
• The actor needs to call into a blocking third-party library that does not work cleanly with coroutines.

For everything else, the coroutine tier is the better default.

K.3 Comparison with Existing C++ Concurrency Approaches

The C++ Actor Framework (CAF, libcaf)

CAF is the closest existing C++ project to what this appendix advocates: a typed actor library with distributed messaging. CAF is mature, well-documented, and in production. The choice between CAF and the framework in this book is not arbitrary; the two have different design centers.

Table K.1: CAF vs. this framework: two design centers.
Aspect	CAF	This framework

Lines of code (library)	~100 000	~700 header-only (basic + coro tiers)
Runtime	Thread pool + scheduler + BASP network protocol	`std::thread` per actor or M:N coroutine pool
Message dispatch	Pattern matching macros + `typed_actor` templates	Producer-side type-indexed routing via `wire()`; receiver casts `Msg*` from a shared `Mailbox>`
Behavior model	Erlang-style become / unbecome, behavior pattern matching	`act()` method override or coroutine `run()` loop
Distributed messaging	Built-in BASP (Binary Actor System Protocol)	ZMQ / NATS / iceoryx bridges as actor subclasses
Build complexity	CMake + library build	One `#include`
DOD alignment	Heap-allocated message objects, virtual dispatch on behavior matching	Plain Old Data (POD) message values, no virtual dispatch in hot path

CAF is general-purpose and rich; this framework is narrow-purpose and small. For systems that need typed-actor compile-time enforcement, behavior pattern matching, and built-in distributed networking, CAF is the right choice. For systems that want the DOD-aligned simplicity of Plain Old Data (POD) messages, the smallness of a header-only library, and a transport-agnostic distributed model layered on top through actor subclasses, this framework fits better.

Thread Pools with Shared Queues

The default “modern C++ concurrent system” shape: a std::vector, a shared work queue, components communicating through callbacks or shared mutable state. This works but accumulates problems:

• Shared state requires locks; locks contend; contention is a tax on every core.
• No structural separation between components; the system shape is not visible from any one place.
• Adding a new component means finding all the places its callers and callees live.
• Backpressure is ad hoc; queue-full conditions cause hangs or unbounded growth.
• Lineage and observability are bolt-on; tracing a request through the system requires correlation IDs threaded by hand.

The actor framework solves all of these by construction: each actor owns its state (no locks across actors), wire() edges declared from the parent make the topology visible at a glance, backpressure is the try_publish return value, lineage is automatic via trace_id propagation.

Async/Await without the Actor Topology

Plain C++20 coroutines without an actor framework on top are a different concurrency model: cooperative scheduling on a small set of OS threads, with the runtime managing scheduling between continuations. Frameworks like folly::Future, ASIO’s coroutine support, and libunifex do this.

The actor model and the bare coroutine model are not in opposition — the coroutine tier in this appendix uses C++20 coroutines underneath — but they encode different defaults. Coroutines emphasize call composition: “await this future, then this one, then transform”; the system shape is in the call graph. Actors emphasize topology composition: “A publishes to B; B publishes to C and D”; the system shape is in the wiring.

For systems with stable topologies (data pipelines, microservice meshes, anything where “which components talk to which” changes rarely), the actor topology is easier to reason about. For systems with one-shot request/response patterns where each request takes a different path, coroutines fit better. The coroutine tier in this appendix gets both: actor topology in the user-level API, coroutine machinery underneath.

Object-Oriented Component Hierarchies

This is what UVM does in hardware verification (Chapter 5), and what most large-scale C++ frameworks do in software (Spring, Java EE patterns ported to C++, the COM/CORBA family). Components inherit from a common base class with lifecycle phases, factory configuration, and dependency injection. The same critique Chapter 6 levels at UVM applies to the software equivalents:

• Inheritance imposes a hierarchy on a problem (lateral coordination between equals) that has no hierarchy.
• Factory patterns and dependency injection accumulate boilerplate that obscures the dataflow.
• Virtual method dispatch on the hot path is a per-call cost that DOD-aligned code avoids.
• The class-hierarchy MoC mismatch with the actual concurrent system means the framework is fighting the problem instead of enabling it.

Actors with DOD-aligned messages are the alternative in the same way they are the alternative to UVM. The argument is the same; the application surface is broader.

K.4 The Actor + DOD Synergy

Data-Oriented Design’s central premise is that the hardware reality of caches, memory bandwidth, and SIMD execution units should drive the structure of your data, not abstractions inherited from object-oriented programming. Common DOD prescriptions:

• Prefer struct-of-arrays to array-of-structs when iterating cold fields.
• Pack hot fields together; keep cold fields elsewhere.
• Avoid pointer chasing; prefer flat layouts.
• Avoid virtual dispatch on the hot path; prefer static dispatch via templates or branchless tag-based dispatch.
• Process data in batches, not one element at a time.
• Predict cache misses by understanding data lifetimes.

The actor pattern aligns with each of these by construction:

• Messages are Plain Old Data (POD) structs (RawLogLine, ParsedRecord, AggregateUpdate in the worked example). They are flat, value-copied, cache-friendly, memcpy-able, easy to log and replay.
• Each actor owns its data; no actor reaches into another’s state. Data locality is automatic — an actor’s hot data lives in the actor’s class members, contiguous in memory, accessed only by one thread (its OS thread in the basic tier; one of the worker pool threads at a time in the coroutine tier).
• Subclassing is structural, not for runtime polymorphism. Modern compilers devirtualize this when the type is known statically.
• Mailbox draining is naturally batched.
• Topology is fixed at startup; subscriber lists become read-only. No subscriber-traversal contention at runtime.

The DOD prescription “don’t fight the hardware” and the actor prescription “match concurrent message-passing on wires” converge on the same code. Plain Old Data (POD) messages on wire() edges between independent actors with their own thread of execution is what both models recommend, in the same shape.

This convergence is the deeper argument the appendix makes: actors and DOD are not separate good ideas; they are two views of the same good idea. Concurrent systems built around encapsulated state and typed message passing inherit DOD’s performance properties for free, and DOD-aligned data layouts inherit the actor model’s compositional clarity for free. The combination is what every concurrent C++ system should reach for first.

K.5 The Same Actor, in C++, Crossing into Hardware

The comparisons above place the pure-C++ tier against other ways of building software. Its other role is the one that ties this appendix to the rest of the book: it is the host-language rendering of an actor that also has a hardware rendering, and the two are the same authored definition.

The verification actors are in this tier, and they cross substrates unchanged. The substrate-swap example (appG_firesim_substrate_swap) authors its stimulus, scoreboard, and coverage actors once, in the basic tier of this appendix (actor.h), and runs them against two renderings of the device under test: an accumulator as a C++ actor, and the same accumulator as synthesizable RTL hosted by Verilator. The verification actors do not change between the two runs — same wire() topology, same handlers, same golden model — and both runs produce identical results (256 transactions, zero mismatches, full coverage). This is the host-language-agnostic claim made runnable, and it is the property Chapter 7’s automatic substrate swap rests on: the C++ actor is the artifact that crosses the substrate boundary.

When the C++ actor must ride the fabric, it is rendered to RTL, not re-coded. A verification actor that must run at emulator speed beside the design is rendered to synthesizable RTL by the same mechanical lowering Appendix E demonstrates — an actor is a latency-insensitive bounded dataflow network node, and Golden Gate automates the lowering with a formal cycle-exactness guarantee (Appendix E §E.5). The substrate-swap example does exactly this: the same loop runs as C++ actors and as a synthesized fabric, and Yosys lowers the scoreboard — golden model, FIFO, comparator — to 231 flip-flops (Appendix E §E.4). The C++ tier is the authoring and host-side substrate; the RTL is the other rendering of the same actor.

The host side of every hardware seam is a pure-C++ actor. When part of the graph is on hardware and part on the host — the workload read-out, an external-I/O endpoint, a link to a remote emulator pool — the seam is a bridge actor (Appendix G §G.6), and the host side of that bridge is a C++ object: a bridge_driver_t for FireSim, a SCE-MI proxy for a commercial emulator, a ZmqBridge for a distributed run. All three are pure-C++ host-side actors, realizations of one TransportBridgeActor pattern (Appendix L §L.4). So this appendix’s framework is not only where general software lives; it is where the host side of the entire hardware-verification flow lives — the dashboard, the regression aggregator, the bridge drivers — all C++ actors on the same wire() bus as the software the team also builds here.

K.6 Distributed Deployment

The same actor topology spans processes and machines through transport-bridge actors, exactly the pattern Appendix L demonstrated for the SystemC port. A pure-C++ port mirrors the same bridge pattern: a ZmqPublisherBridge actor would subscribe to local typed messages and forward them out over ZMQ; a ZmqSubscriberBridge actor would receive over ZMQ and republish locally. Such bridges are themselves actors; they participate in the framework’s lifecycle, supervision, and observability the same way any other actor does. There is no special “distributed mode” to enable; choosing a transport is choosing how a particular pair of actors communicate.

The same bridge actor that crosses a process boundary crosses a software-to-hardware boundary. A TransportBridgeActor in this tier would be the universal seam primitive (Appendix L §L.4): its distributed-transport form is the ZmqBridge above, its commercial-emulator form is a SCE-MI transactor’s host proxy, and its FireSim form is a bridge_driver_t over a token channel — three realizations of one pattern, the carrier the only difference (Appendix G §G.6). The pure-C++ tier is where those host-side bridge drivers belong, which is why a regression dashboard, a cloud-emulator client, and an in-process scoreboard are all the same kind of object here.

K.7 Files

• actor_pkg_cpp/include/actor.h — basic tier, std::thread-per-actor framework. ~310 lines.
• actor_pkg_cpp/include/coro_actor.h — coroutine tier, C++20 co_await + M:N green-thread scheduler. ~390 lines.
• actor_pkg_cpp/examples/log_pipeline.cpp — four-stage pipeline on the basic tier.
• actor_pkg_cpp/examples/coro_pingpong.cpp — coroutine-tier ping-pong throughput benchmark.
• actor_pkg_cpp/examples/coro_pipeline.cpp — four-stage coroutine pipeline.
• actor_pkg_cpp/Makefile — builds all examples with g++ -std=c++20 -lpthread.

K.8 What This Appendix Has Established

• The actor methodology developed in this book is not specific to hardware verification. The same primitives — typed messages, wire() topology, encapsulated state, distributed transports — are the natural way to structure most concurrent and distributed software systems.
• Pure C++ is the right deployment for everything that does not need simulated time. The framework is ~700 lines (basic + coroutine tiers combined), header-only, depends only on the C++ standard library and pthreads. Compile times are sub-second; distribution is one #include.
• The coroutine tier matches Erlang/BEAM, Go goroutines, and Java virtual threads at the user-level ergonomics: each actor’s run() is a forever-loop that co_awaits its mailbox. std::thread per actor is no longer required; thousands of actors fit on a small worker pool.
• The basic tier is the readable fallback for actors with heavy, long-lived state, where one OS thread per actor is appropriate.
• Compared to CAF, this framework is two orders of magnitude smaller, more aligned with Data-Oriented Design (Plain Old Data (POD) message values, no behavior pattern matching, no built-in scheduler runtime), and more transparent to standard C++ debugging and profiling tools.
• Compared to thread-pools-with-shared-queues, async/await, and OOP component hierarchies, the actor pattern wins on system-shape clarity, lock-freedom, backpressure handling, and lineage observability — by construction.
• The actor + DOD combination is not two good ideas combined; it is two views of the same good idea. Plain Old Data (POD) as messages, encapsulated state, no shared mutation, no virtual dispatch in the hot path, fixed topology — both methodologies converge on the same code shape, and that code shape is closer to what the hardware reality of modern CPUs rewards than what most concurrent C++ code currently looks like.
• The pure-C++ port and the SystemC port (Appendix J) share the actor methodology but target different deployments. A team deploys the pure-C++ port for general software, the SystemC port for hardware verification, and connects them via the framework’s distributed transports when the two deployments need to coordinate.
• This tier is also where the host side of the hardware flow lives (§K.5). The substrate-swap example’s verification actors are written here and cross the software/RTL substrate boundary unchanged — one of the book’s three shipped renderings of a single authored actor: a SystemVerilog actor placed-and-routed onto iCE40 silicon (Appendix E), the pure-C++ actors of this appendix crossing substrates, and the SystemC actor that reaches RTL through HLS (Appendix J). One authored definition, several renderings, no rewrite between them.
• Beyond general concurrent software, the pure-C++ port is also the natural front-end substrate for the AI hardware systems work in Appendix M. A GPU-targeting code generator that lowers actor topologies to CUDA, HIP, or oneAPI starts here: the header-only library expresses the topology, the codegen pass walks the wire() graph, and the GPU runtime becomes the assembly layer. The transport-bridge pattern that Appendix L demonstrates with ZMQ generalizes to NCCL with the same shape.
• For most software systems being built today, this is the natural way to build them. Not the only way, not always the best way, but the default that makes the structural shape of the system visible, the data flow obvious, and the path from prototype to production short.