Emergent Hardware Verification

Chapter F Synthesizing Constrained-Random Stimulus

Appendix E established that an actor is a finite state machine and therefore synthesizes, and carried the claim to gates — a counter, then a whole verification loop with its scoreboard. Appendix G carries the loop onto the emulator with no manual rewrite. Both left one component unaddressed, and it is the component every engineer points to first: the constraint solver. Constrained-random stimulus is the part of a testbench that looks least like hardware — a declarative set of relations handed to an NP-complete search that runs, at simulation time, on a SAT/SMT engine (Chapter 3 and §4.4). If anything in the verification graph cannot ride the fabric, surely it is this.

This appendix shows that it can, and proves it on a production corpus rather than a toy. The argument is the one planted in §4.4: randomize() does not need the expensive half of a proof engine. It never refutes; it only finds a model. The model-finding half is constructive, a constructive result can be settled once at compile time, and what it compiles to is ordinary synthesizable logic. We take the full constraint set of an open-source RISC-V instruction generator, compile every constraint shape it contains to fabric, and check each synthesized sampler against the original solver running the same constraint — register for register, both directions. The whole generator then synthesizes as one clocked design and is placed and routed on an open FPGA, and a Lean development certifies the samplers and, in one case, produces one smaller and faster than the hand-written form. For the harder tier this corpus never reaches — a constraint that genuinely needs a satisfiability search — a real DPLL-through-CDCL engine is itself compiled onto the fabric, so the method has no gap where it quietly falls back to a host solver.

F.1 The model-finding half is a datapath

A constraint block describes a set: the assignments that satisfy it. A simulator’s solver explores that set at runtime, one query at a time, paying the NP-complete cost on every call. But the set itself does not change between calls — only the seed does. For a constraint whose variables are bounded (every rand field in a testbench is), the legal set is a finite object that can be computed once, ahead of the run, and the act of drawing from it is then a table read or a short datapath, not a search.

This is the difference between the two halves of a proof engine, made concrete. To prove a property, an engine must establish that no counterexample exists anywhere in an unbounded state space — it must refute, and that is where the cost of formal verification lives (Chapter 3 develops it). To generate stimulus, an engine must exhibit one legal assignment — it must find a model, and a model, once found, is a witness it can hand back. The constructive content of “this set is non-empty and here is a member, indexed by a seed” is exactly a function from seed to member. That function is the sampler (Figure F.1). Synthesizing the solver means emitting that function.

Reachable set equals solution set. The property that makes this sound is that the sampler’s reachable set — every value it can emit over all seeds — equals the constraint’s solution set, with nothing illegal reachable and nothing legal omitted. A runtime solver achieves this by searching and rejecting; the compiled sampler achieves it by construction, with no rejection and no backtracking on the fabric. Establishing that equality for each constraint is the whole of the validation, and §F.2 does it both directions against the reference solver.

Three tiers of construction. How a constraint becomes a sampler depends on its shape, and the shapes fall into three tiers:

• Tier 0 — the constructive datapath. The constraint is already a function of its inputs: a sign-extension, a field width set by a format, an address masked into a power-of-two page. There is no search to remove; the “solver” was always a datapath, and the sampler is that datapath written out.
• Tier 1 — structure the legal set. The constraint is relational — all-different, a range, a bit-slice equality — and the legal set is enumerated or, better, structured so that a seed indexes a member directly. A binary-decision-diagram walk, or a factorial-number-system unrank, turns “the \(n\)-th legal tuple” into a short combinational path; no member is ever generated and rejected.
• Tier 2 — invert, do not bit-blast. The constraint contains arithmetic that would explode if flattened to Boolean gates — a multiply, a modulo. Instead of bit-blasting the relation and handing it to a SAT engine, the sampler inverts it: a certified divider or a residue construction produces a satisfying input directly. This is the only tier that needs more than wires and tables; §F.2 reports how rarely a real corpus reaches it, and §F.6 builds and synthesizes a satisfiability engine — DPLL through CDCL — for the corpus that does.

A theme that recurs below is that the tier a constraint lands in is not predicted by the operator it is written with (Figure F.2). The most expensive family in the corpus carries no arithmetic at all; the only family written with multiplication is among the cheapest.

A sampler, concretely. A small, self-contained sampler shows the pattern. The register-allocation constraint asks for a register not in a reserved (or already-used) set; given a seed idx, the sampler walks the register file once and returns the idx-th register not excluded:

module reg_select_ex (input logic [31:0] excluded, input logic [4:0] idx,
                        output logic [4:0] reg_out);
  always_comb begin
    logic [5:0] c; logic done; reg_out = 5'd0; c = 6'd0; done = 1'b0;
    for (int r = 0; r < 32; r++) if (!excluded[r]) begin
        if (!done)              reg_out = r[4:0];   // keep the latest legal register
        if (c == {1'b0, idx})   done    = 1'b1;     // stop once we have passed idx
        c = c + 6'd1;
    end
  end
endmodule

There is no search and no rejection: every seed yields a legal register, the reachable set is exactly the legal set, and the whole thing is one always_comb block — ordinary combinational logic. The richer constraints of the capstone are this idea composed: a chain of such selectors for an all-different set, a cumulative-threshold walk for a weighted dist, a factorial-number-system unrank for a permutation (§F.7). This is the sampler of Figure F.1 as gates.

F.2 The capstone: a production constraint corpus on fabric

The demonstration target is riscv-dv, the open-source RISC-V instruction-stream generator. It is the right test because it is real: an industrial constrained-random generator with a register-allocation discipline, dependent address calculations, format-specific instruction encodings, a weighted return-address distribution, and a vector-extension register model. Its constraint set is an inventory of \(140\) constraint blocks spanning every operator class SystemVerilog offers — Boolean and relational predicates, ranges and inside sets, bit-slice equalities, solve … before dependency chains, foreach array constraints, unique sets, weighted dist, and arithmetic. Of all \(140\) blocks, exactly three are written with multiplication, and §F.3 shows even those collapse to shifts.

The validation is direct equivalence, not re-derivation. A subtle methodological point makes the result strong. Verilator 5.049 supports rand, constraint, and randomize() — including unique, dist, and inside over a queue — and the UVM 1.2 base. So the original riscv-dv constraint can be compiled and run as the reference, and the synthesized sampler is checked against the solver’s own output: every value the sampler emits is accepted by the original constraint (soundness), and the two cover the same legal set (completeness). This is equivalence against the production artifact, not against a second model of it.

Every constraint shape, synthesized and checked. Each constraint family was built as a constructive sampler, validated both directions against the reference solver in Verilator, and synthesized with Yosys for the iCE40 (SB_LUT4 cells). Table F.1 collects the result. Every shape passes with zero illegal outputs and identical legal coverage; no family required a runtime solver on the fabric.

Table F.1: Each riscv-dv constraint shape as a synthesized sampler, checked both directions against the reference solver.

Constraint family	Tier	Synthesized (iCE40 LUT4)
R-type operand / register allocation	1	289
Immediates (`imm_c`, format widths)	0	19
Load/store address chain (`addr_c`)	0/1	16
Instruction assembly (all six formats)	0	59
Unique register list (`avail_regs_c`)	1	4302
Vector LMUL groups (the only random-variable multiplies)	0	33
Weighted distribution (`ra_c`, `dist`)	1	28

The operator does not predict the cost. Two rows in Table F.1 make the point that opened this appendix. The unique register list — written with unique, no arithmetic — is the largest at \(4302\) LUT4, because it is ten dependent selections, each a priority scan against a growing exclusion mask. The vector LMUL family — the only constraints in the entire corpus written with multiplication — is among the smallest at \(33\) LUT4, because the multiplier (vlmul) is always a power of two, so every “multiply” and “modulo” is a shift or a mask (§F.3). What a constraint costs in gates is a property of its dependency structure, not its syntax. The general invert-don’t-bit-blast divider of Tier 2 exists for the case a corpus needs it; this corpus never needs it at width.

F.3 The only multiplications are shifts

The vector register-group constraints — narrowing_instr_c, widening_instr_c, nfields_c — are the only blocks in the corpus whose constraints multiply random variables: a source register aligned to vlmul * 2, a destination group that must not overlap it, a field count bounded by (nfields+1) * vlmul \(\le \) 8. Because the vector length multiplier is one of \(\{1,2,4,8\}\), \(\texttt {vlmul}\times 2\) is a power of two, so the alignment modulo is a bit-mask, the multiply is a left shift, and the bound is a shifted comparison. The synthesized sampler is \(33\) LUT4 of mask-and-shift, and it covers exactly the aligned register groups the reference solver covers, for every legal LMUL. The only arithmetic in the corpus is, on inspection, not arithmetic at all.

F.4 The whole generator, one design

A real emulation run is one integrated design, not a drawer of separate experiments. The per-family samplers were therefore wired into a single clocked generator that, each cycle, selects one of eight instruction classes (ALU register and immediate forms, load, store, branch, upper-immediate, jump, and a vector class) and routes its operands through the families that class needs, all sharing one piece of cross-instruction state: a live-register set that grows as instructions write their destinations, so a source register is never read before it has been written. Across eight thousand instructions spanning all eight classes the generator produced no illegal instruction — every destination legal by its class’s rule, every source previously written — and synthesized as one design at \(1065\) LUT4 plus \(32\) flip-flops, the flip-flops being exactly the live-register state (Figure F.3).

One design, or a distributed graph — the same thing. The monolith is the flattened view; the natural form is the one the rest of the book develops — a network of independently-synthesizable actors composed by ‘WIRE (Appendix E). Each family is an actor that synthesizes on its own (the per-family rows of Table F.1 are their footprints); the wiring is the integration; and an emulator realizes the design exactly that way, as distributed logic across the fabric rather than one block. The flattened single-module synthesis is the proof the actors compose; the distributed graph is how they are placed.

Frequency, and what gates it. Placed and routed on an iCE40 (nextpnr), this \(1065\)-LUT4 generator reaches \(28\) MHz on its own; wired end to end with the register allocator of §F.5 into a single design, it places at \(22.8\) MHz — roughly \(23\) million legal instructions per second, one per cycle, on the lowest-end FPGA in the open tool chain, with no host and no solver in the loop. That the allocator barely moves the frequency (\(28\) to \(22.8\) MHz) is itself the point: the critical path is the per-instruction source-selection loop, and the allocator — despite being the largest block — is not on it, for the reason §F.5 develops. On emulation-class fabric the same netlist clocks far higher, and the limiting selector is a textbook target for pipelining: splitting its scan into two registered stages lifts it from \(25.8\) to \(44.9\) MHz — \(1.7\times \), with the function bit-for-bit unchanged and a stimulus stream absorbing the one-cycle latency without losing throughput. The point here is not the absolute number but that the constraint-bound testbench runs as a clocked machine at all.

F.5 Solve once, reuse

The register allocator is the largest sampler in the corpus, and at first that looks like a problem for a generator that emits an instruction every cycle. It is not, because the allocation is a per-stream decision, not a per-instruction one. A directed instruction stream chooses its set of available registers once, when it is created, and reuses that set across every instruction it emits. The allocator is therefore one-time setup, off the per-instruction critical path entirely.

Recognizing this changes the hardware. Instead of ten selectors unrolled into one combinational cycle, the allocator becomes a small sequential machine: one selector, time-multiplexed over ten cycles, with the exclusion mask held in a register — \(428\) LUT4, a tenth the area of the unrolled form, and its ten-cycle latency is paid once per stream and amortized to nothing. A later draw that wants a different assignment does not re-solve; it shuffles the held set — a certified permutation of ten elements, not a fresh search over thirty-two registers. This is precisely the shape of a verification actor: solve at initialization, hold the result, and answer each request from the held solution. The expensive construction happens once; the steady state is cheap.

F.6 The hard tier, runnable: a satisfiability engine on the fabric

The riscv-dv corpus is Tier 0 and Tier 1 throughout — not one of its \(140\) blocks needs a genuine satisfiability search, and even its three “multiply” constraints are shifts (§F.3). That is a fact about this corpus, not a limit of the method, and it raises the obvious question: when a constraint does need a real solver — a residue, an arithmetic relation a sampler cannot invert in closed form — does the argument still hold? It does, because Tier 2 is not a hand-wave. The research built a real satisfiability engine and put it on the fabric.

The engine is a model-finder, not a prover: it searches for one satisfying assignment and, exactly like randomize(), never has to prove a formula unsatisfiable — it is Figure F.1’s right-hand half, now with a backtracking search behind it instead of a table. It was built in stages, each synthesized and measured on a worked residue constraint:

• DPLL on the fabric. The classical backtracking search — a decision stack, unit propagation, chronological backtrack — with the trail held in registers. It averages \(0.31\) backtracks per sample (the search is shallow on this constraint) and synthesizes to \(4519\) logic cells at roughly \(16\) MHz on a small iCE40; a Verilog-2005 twin confirms the lowering is tool-independent.
• DPLL(T). A theory propagator in the loop — the “T” of SMT — so arithmetic facts prune the Boolean search directly, soundly, with no bit-blasting. It deepens the search (\(0.31 \to 1.15\) backtracks), which is the point: it is solving a harder problem correctly rather than flattening it to gates.
• CDCL. Conflict-driven clause learning with first-UIP conflict analysis, on a flat-logic clause store held in block RAM. Learning the reason for each conflict cuts backtracks by about \(74\%\), and — the result that matters for hardware — the logic is flat in the store’s depth: a deeper learned-clause memory costs block RAM, not logic levels, so the engine grows without lengthening its critical path.

So the tier that looks least like hardware — a clause-learning satisfiability search — is a clocked machine on the fabric, measured end to end. The riscv-dv generator never reaches it; Tier 2 exists, and runs, for the corpus that does.

F.7 Certified samplers, and when the proof outranks the solver

The samplers above are validated against the reference solver, which is the right check when the solver is trusted. But the model-finding half of a constraint also admits something stronger than validation: proof. A Lean development builds each sampler as a function unrank : Fin \(N\) \(\to \) {a // P a} — an index into the legal set whose result type carries a proof that the value satisfies the constraint. Because the type is inhabited only by legal values, the sampler is sound by construction; companion theorems establish that it is complete (every legal value is reachable) and uniform (distinct indices give distinct values), all in a constructive core with no classical axioms. The constructive content of those proofs is the sampler; the proof and the program are one object, and the same development emits the SystemVerilog. The development is a sixteen-step arc in pure Lean 4 — from a termination bound, through certified code generation, to a real riscv-dv block (its solution count reproduced independently of the production solver) and the improved allocator below — machine-checked end to end, with no sorry.

The proof can outrank the solver. Two results show why this is more than rigor for its own sake. First, the certified development was used not merely to check the hand-written register allocator but to improve it: a Lehmer (factorial-number-system) unrank, proved to produce distinct in-pool registers, reconstructs the allocation by a five-bit bump-and-insert instead of a thirty-two-wide priority scan, and synthesizes at \(1478\) LUT4 against the hand-written \(4302\) — \(2.9\times \) smaller, and faster — with uniqueness discharged by a theorem rather than enforced by a runtime constraint. Second, the certification proved more trustworthy than the reference solver: the same Verilator that solves the integer form of the weighted dist correctly mis-weights the enum-named form of the identical distribution. A certified sampler emits a distribution that is correct by proof, independent of any solver’s bugs. When the reference can be wrong, the proof is the ground truth, and validation inverts: the solver should match the sampler.

F.8 What compile-time reasoning buys at run time

Compiling the model-finding half ahead of the run is not only a way to remove a solver; it changes what the hardware must do and what can be known about it before the design ever reaches the fabric. Three advantages follow, and together they are why a dependently-typed development is more than bookkeeping.

• The search is gone, so the runtime is a datapath. Whatever reasoning a constraint needs — enumeration, a binary-decision-diagram walk, a factorial rank, even the backtracking satisfiability search of §F.6 — happens once, in the compiler. What remains on the wire is a table read or a short combinational path, with no backtracking and no rejection. The NP-complete cost is paid by the compiler, once, not by the fabric, every cycle.
• Bounded reasoning, bounded hardware. The development proves not only that a sampler is correct but that its construction terminates in a bounded number of steps; a bounded construction lowers to bounded logic depth. The area and the critical path are knowable before place-and-route, not discovered after — where a runtime solver offers no such bound, and its worst case is one every verification team has met: the randomize() call whose runtime swings unpredictably from seed to seed.
• Correct by proof, not by trust. Because the sampler’s result type carries a proof that its output is legal (§F.7), the reachable set equals the legal set by construction: there is nothing to check at run time, and nothing a buggy reference solver can get wrong. The certification is the verification, and the same development that proves the sampler emits it.

This is the answer to the question a dependently-typed prover invites — whether compile-time reasoning can improve runtime synthesizability. It can, and concretely: it removes the search, bounds the logic, and discharges the legality check, all before the design reaches the fabric.

F.9 The actual generator classes, in an open simulator

A final check closes the gap between “the constraint, transcribed” and “the constraint, as shipped.” The actual riscv-dv source — its real riscv_instr_pkg, every instruction class, and the configuration object that holds the constraints — compiles and links under UVM 1.2 in Verilator 5.049, into a running model. The real return-address constraint, run verbatim from that build with the production register enumeration, enforces its support exactly and, in the integer form, matches the certified sampler’s distribution. The synthesized stimulus is checked not against a paraphrase of riscv-dv but against riscv-dv.

F.10 What this establishes, and what it does not

It establishes that the constraint solver — the component most assumed to strand a testbench on the host — has a synthesizable form, that the form is reached by compiling the model-finding half of the constraint ahead of the run, and that on a production corpus every constraint shape compiles, checks equivalent to the reference solver both directions, and synthesizes; that the whole generator runs as one clocked design on an open FPGA; that even the hard tier, a clause-learning satisfiability search, is itself a clocked machine on the fabric (§F.6); and that the samplers can be certified, the certification bounding the hardware, bettering the hand-written result, and outranking a buggy solver (§F.7–§F.8).

It does not claim that an arbitrary constraint is cheap: a genuinely wide variable-by-variable multiply or a deep combinational dependency chain costs gates, and Tier 2 — built and synthesized in §F.6 — is where that cost is paid. It does not replace the simulator’s solver for interactive debug, where re-elaborating a sampler per edit would be absurd; the compiled form is for the run that has settled and is ready to ride the emulator at scale. And the corpus, though real and complete in its operator classes, is one generator; the claim is that the method covers the shapes a constrained-random testbench is built from, demonstrated end to end on one that matters.

F.11 Files

The development lives under appF_synthesize_constraints/ in the companion repository, organized as numbered directories with a top-level README mapping them to the sections above. The frontend that parses production SystemVerilog constraints and compiles them through a BDD is in 01_constraint_compiler/; the constructive-sampler tiers are in 02_constructive_samplers/ and 03_reactive_constraints/; the runnable satisfiability engine of §F.6 — DPLL, DPLL(T), and CDCL on the fabric — is in 04_sat_engine/; the sixteen-step Lean certification arc, including the certified Lehmer allocator, is in 05_lean_certified/; the riscv-dv capstone — the per-family samplers, the integrated generator, the pipelined selector, the place-and-route scripts, the \(140\)-block coverage ledger, and the UVM oracle build — is in 06_riscvdv_capstone/, with a per-slice README for each constraint family. The riscv-dv corpus the compiler reads is vendored at a pinned commit, so the samplers and engines build without a separate clone; only the UVM oracle build cross-checks against an external riscv-dv and UVM checkout. Every sampler and every engine is the exact source validated in Verilator and synthesized in Yosys; the numbers in this appendix are reproduced by the scripts named in those READMEs. The complete from-scratch research record — the algorithms built from nothing (SAT, DPLL, CDCL, BDDs, SMT, DPLL(T), constraint propagation), every proof-of-concept, and the honest negative results — is preserved alongside under synthesize_constraints/.