Emergent Hardware Verification

Chapter 3 Formal Verification

Chapter 2 introduced the foundational temporal properties the formal world reasons over — safety, liveness, fairness, and precedence — and grounded them in SystemVerilog Assertions (§2.4, §2.4). The six design examples that followed each applied a specific formal technique to a real RTL problem: cover/assume/assert with safety, liveness, and round-robin precedence on the round-robin arbiter and its symbolic-client generalization (§2.5); assume–guarantee composition across clock domains on the asynchronous handshake controller (§2.6); Wolper’s data-independence symbolic-token abstraction on the FIFO (§2.7); sequential equivalence checking on the pipelined ALU (§2.8); exclusivity with local-node and hand predicate abstraction — plus the CEGAR loop that discovers predicates automatically — on the MSI cache coherence controller (§2.9); and counter and memory abstraction on the memory controller (§2.10). Each example used the right tool for the job and moved on.

This chapter zooms out to the end-to-end flow. Rather than introducing another technique, it assembles the techniques the previous chapter demonstrated into the pipeline a working team uses — specification capture, block-level proof, module composition, subsystem and SoC scope, and high-confidence RTL freeze — and connects each stage to the underlying proof engines, abstractions, and industry tooling that make it run. The aim is operational: which engine runs at which stage, which industry tool gets pointed at which scope, where modern LLM-assisted assume–guarantee partitioning fits, what the formal sign-off artifact actually looks like, and how the same flow turns into a fast bug hunter when the design is still maturing.

Simulation is like walking through a massive dark maze with a flashlight (Figure 3.1). You see only the path under the beam, and if you miss a turn you miss the bug. Formal verification is turning on the stadium lights and looking at the maze from above. The next section makes that difference precise — why it is a difference in kind, not degree, and what simulation and emulation, however fast, structurally cannot do. The rest of the chapter is then the story of how the industry built those stadium lights: the workflow that puts them in place, the proof engines underneath, the SMT lift that handles datapath, and the orchestrator that runs them in parallel. The engineer types one command and gets back a mathematical answer about every reachable state. The chapter then climbs past the automation ceiling into theorem proving, and closes where every flow must: the app portfolio a practitioner actually runs, formal as a bug hunter, and the signoff artifact that ships.

3.1 Why Formal Verification: What Simulation and Emulation Cannot Do

The flashlight and the stadium lights of Figure 3.1 are not merely an illustration; they name a difference in kind. Simulation, constrained-random, and emulation are one family: each establishes truth by executing the design on some stimulus and watching what happens. They differ in speed and scale, not in method. Every one of them answers an existential question — does there exist a stimulus on which the design misbehaves? — so each can only ever find a bug, never certify its absence. Correctness is a universal claim: for every reachable state, the property holds. No quantity of “we ran it and nothing broke” ever adds up to “nothing can break,” because a pile of existential observations is not a universal proof. This is not an efficiency gap that faster machines close; it is a gap in what the two kinds of activity can say.

Formal verification closes it by asking the same question the bug hunt asks and answering it completely. A model checker searches for a counterexample across all reachable states at once; when that exhaustive search comes back empty, the emptiness is the proof, because “there is no reachable counterexample” and “the property holds on every reachable state” are the same sentence. Formal is not an exotic alternative to simulation — it is bug hunting carried to completion, and the completion is what turns a found-or-not-found search into a yes-or-no guarantee.

That guarantee is the first of two reasons a modern flow cannot do without formal. The second is cost: even for the bugs simulation does find, the assertions formal shares with simulation make them far cheaper to localize and fix — the executable specification developed later in this section. Either reason alone would justify the gate; together they make it essential. This section develops both — the capability gap first, in its two sharpest forms, then the efficiency gain.

Presence, absence, and the finite certificate

A finite test can reveal the presence of a bug but never its absence — a point as old as structured programming. Simulation’s own metric concedes it: functional coverage measures only the bins the team thought to write. A behavior nobody imagined is one nobody wrote a coverpoint for, so a coverage report can read 100% while the unimagined corner ships unverified. Coverage closes the holes you can see; it is silent about the ones you cannot.

The natural objection to any universal claim is that it seems impossible to make in finite time: a block with a few hundred state bits has more reachable states than the observable universe has atoms, and no machine can visit them. Formal does not visit them. It produces a finite certificate that entails the whole space. For a safety property the certificate is an inductive invariant: a predicate (a Boolean condition on the state) that holds at reset, is preserved by every transition, and lies entirely inside the good region. If such a predicate exists, induction over time does the rest — every reachable state, for all time, satisfies the property — and the predicate is a finite formula, not a list of states. It is the fence this chapter returns to when it reaches interpolation (§3.3): a boundary that contains reset, that nothing can step across, and that never touches the bug.

The defining property of a certificate is that it is hard to find but cheap to check. Confirming that a candidate predicate is an inductive invariant is three bounded queries — true at reset, closed under one step, inside the good region — none of which unrolls time. Discovering the predicate is the whole art, and it is what the proof engines of §3.3 automate. This asymmetry is why a formal result can be trusted however it was produced: soundness rests on the small, simple checker, never on the elaborate search that proposed the certificate. The same asymmetry is what now lets a flow turn an unreliable assistant loose on the hard part without putting the proof at risk — the point this section closes on.

Failure in time: deadlock, livelock, and starvation

The clearest place the gap bites is concurrency. A deadlock is not a wrong value at a known cycle; it is the absence of a future — the system has entered a state from which no progress is ever again possible. A finite run cannot express that. After a million idle cycles a simulator cannot tell “stuck forever” from “about to proceed,” so teams fall back on a watchdog timeout: if nothing completes for $N$ cycles, flag a hang. A timeout is a guess, wrong in both directions — it fires on a merely slow path and stays silent on a deadlock that needs a stimulus the run never produced, or that lies deeper than the run reaches. It is not a proof of deadlock-freedom; it is an admission that sampling cannot produce one.

Formal supplies the proof, because in a finite-state system an infinite bad behavior has a finite witness. Any infinite run must revisit a state, so it must eventually cycle; if a progress property can fail, it fails on a lasso: a finite path from reset into a cycle that repeats forever without ever making progress. The lasso compresses an infinite failure into a stem and a loop an engineer can read. The three classic concurrency failures are one object under this lens, differing only in what “progress” means on the cycle — a deadlock is a cycle with no activity at all, a livelock is a cycle with activity but no completion, and starvation is a cycle in which the system as a whole advances while one client is never served. (Coffman’s circular wait is exactly a cycle in the wait-for graph; freedom from deadlock is the statement that the graph never closes into one.)

The certificate for the other direction — that progress does happen — is a ranking function: a measure on states, drawn from a well-founded order such as the natural numbers, that strictly decreases on every step until the awaited event. A well-founded order has no infinite descending chain, so the measure cannot fall forever, so the event must arrive. The round-robin arbiter of §2.5 carries one in plain sight: the number of grants the rotating pointer issues before it reaches a given client is a natural number, bounded by the client count, that drops every cycle the client waits and reaches zero when it is served — “fair because it rotates” made into a finite certificate of bounded wait. Such progress proofs almost always rest on fairness assumptions (every continuously enabled agent eventually acts), which is one more reason the property resists simulation: a testbench cannot assume fairness, it must construct fair schedules, while the formal tool simply restricts the search to them.

Emulation does not help here, and tends to hurt. Its purpose is to run realistic software at speed, and realistic traffic is precisely the well-behaved traffic engineered to avoid the adversarial interleaving a deadlock needs; the extra speed buys more well-behaved cycles, not the one pathological schedule. The interleaving that wedges the system is found by the engine that searches for it, not by the workload that steers around it.

Failure in space: arithmetic and the datapath

Concurrency defeats sampling because the property reaches across infinite time. Arithmetic defeats it for the opposite reason: the input space is too large to sample a meaningful fraction of. Exhaustive simulation of a datapath block is not slow but impossible, and the arithmetic is worth seeing (Table 3.1).

Table 3.1: Exhaustive simulation of a datapath block at a generous $10^{9}$ input vectors per second — emulator-class throughput, far above an RTL simulator. Universe age taken as ${\approx }\,4.4\times 10^{17}$ seconds.

Block	Input bits	Combinations	Time to exhaust
32-bit adder	64	$1.8\times 10^{19}$	${\sim }580$ years
64-bit multiplier	128	$3.4\times 10^{38}$	${\sim }8\times 10^{11}$ universe lifetimes
FP64 adder ($\times 5$ rounding modes)	128	$1.7\times 10^{39}$	${\sim }4\times 10^{12}$ universe lifetimes

A 32-bit adder already needs more than five centuries of continuous stimulus; a 64-bit multiplier, more vectors than could be applied in hundreds of billions of lifetimes of the universe.

Floating point is worse than its raw count suggests, for three reasons. It is not one operation but a family: each result depends on the rounding mode, of which IEEE-754 defines five, and on flush-to-zero and denormal controls besides. Its “correct answer” is itself a structured specification — the infinitely precise real result, correctly rounded, with exact handling of subnormals, signed zeros, infinities, quiet and signaling NaNs, and five exception flags — so even a checker that could keep up would have to be a correct floating-point unit in its own right. And the bugs do not live in the bulk a random sample hits first; they live in the measure-zero corners: the halfway cases that round to even, the boundary between the largest subnormal and the smallest normal, the catastrophic cancellation when two close values subtract, the carry that crosses the exponent field. Random stimulus reaches those corners with probability indistinguishable from zero.

This is not hypothetical. The 1994 Pentium FDIV defect was a handful of missing entries in a lookup table; it returned wrong results for roughly one floating-point divide in nine billion, escaped extensive testing precisely because the failing inputs were so sparse, and cost on the order of half a billion dollars to correct. A needle that small cannot be found by sampling a haystack that large, and it is why industrial floating-point units are signed off by proof rather than by vectors.

Formal does not win by enumerating faster; it declines to enumerate. Exhaustive simulation is, in effect, a certificate by brute force, and Table 3.1 is the size of that certificate. Formal replaces it with a far smaller one. Equivalence against a reference is settled at the word level — treating a 64-bit value as a single algebraic quantity and recognizing multiplication and addition as operations with known algebra, rather than bit-blasting them into hundreds of thousands of clauses (§3.5, and the sequential equivalence checking of §3.5). The deepest cases — a full IEEE unit, a novel algorithm — are closed once and for all by proving the algorithm correct by induction over its structure (§3.6 and its Booth-multiplier proof). One structural proof covers every input, every width, and every rounding mode at once.

The other questions sampling cannot answer

Time and space are the two sharpest cases, but the same existential ceiling blocks several everyday sign-off questions, each of which formal answers with a certificate and simulation cannot answer at all:

• Equivalence. “These two representations compute the same function for every input” — RTL against its synthesized netlist, a block before and after an engineering change, a pipelined implementation against its reference model. Re-running vectors after a change raises confidence; it never establishes equivalence (§3.5).
• Reachability and the negative claim. “This state is unreachable,” “these one-hot bits are never both set,” “this code is dead,” “this assertion can never fire.” Sampling can only fail to reach a state; it cannot prove the state unreachable. These are the everyday formal apps — connectivity, X-propagation, dead-code and range checks (§3.7).
• Security and non-interference. “A secret register never flows to an observable output under any sequence,” “a privileged region is unreachable from an unprivileged path.” Here the attack space is the entire input space — exactly what an exhaustive engine covers and a directed test cannot enumerate (§3.7).

Why emulation’s speed does not close the gap

It is tempting to think emulation, being orders of magnitude faster than simulation, eventually samples enough of the space to count as exhaustive. It does not, for the reason the datapath table already shows: emulation’s speed-up is a constant factor, and the spaces in question grow exponentially. A thousandfold faster machine turns four trillion universe-lifetimes into four billion — still not a verification. As designs grow, the reachable state space outruns any throughput improvement, so the gap widens with each generation rather than closing.

The deeper point is that emulation answers a different question. It is an instrument of validation — does the real system, running real firmware and operating-system code, behave acceptably across realistic scenarios? — and that is a legitimately existential question, asked of exactly the workloads that matter. Verification — does the design meet its specification on every input? — is universal. Using a validation instrument to make a verification claim (“we ran fifty workloads and saw nothing”) is a category error: excellent validation, and no verification at all.

This is why the flow keeps all three engines rather than choosing one — the division of labor introduced with Chapter 1’s three-gate pipeline (§1.6). Formal owns block- and subsystem-level exhaustiveness and the categorical claims above; constrained-random simulation owns the dynamic interaction of components at moderate scale; emulation owns full-SoC behavior under real software at speed. Each does what the others cannot. The boundary between them moves — formal’s reach has been climbing the hierarchy, as this section’s closing subsection describes — but it moves; it does not collapse to a single tool.

The executable specification: one artifact across every gate

Everything above describes formal as a prover. Its quieter and larger payoff is what it leaves behind: an executable temporal specification. The properties written for the proof — the assertions a design must uphold and the assumptions it may rely on — are not consumed by the formal run. The same SystemVerilog Assertions execute as monitors in simulation and, in their synthesizable subset, as hardware checkers in emulation. One artifact, written once, is checked three ways: proven exhaustively by formal, watched continuously in simulation, carried into emulation at speed. This is Assertion-Based Verification, and it is what makes the three gates one system rather than three silos.

It speeds every form of verification, because debugging is dominated not by the fix but by localization — finding where and when a run first went wrong. An end-to-end check fails far downstream of its cause, thousands of cycles and many blocks away, and the engineer traces backward across that whole distance. An assertion is a tripwire planted at the cause: it fires at the cycle the rule breaks, in the block where it breaks, naming the rule that broke, and the long backward search collapses to reading the first assertion that fired. The same property set also sorts the two kinds of failure automatically: reused in simulation, an assumption that fires indicts the testbench — illegal stimulus, a mis-driven input, a wrong reference model — while an assertion that fires indicts the design. The bug is classified as environment or device the moment it appears, which is exactly the time a long emulation run would otherwise spend discovering that the “failure” was a bad test. And when formal does the finding, the trace it returns is the shortest one from reset, free of unrelated activity: the witness arrives pre-localized.

The same artifact is what makes late change cheap to sign off again. A specification captured as a stable set of contracts does not move when the architecture beneath it does. When a block is re-pipelined, a clock domain is added, or an interface is re-timed late in the schedule, the properties are re-proven and the assertions re-run against the unchanged intent; what still holds is signed off automatically, and what breaks points directly at the contract the change violated. Confidence is re-established by re-checking a certificate, not by rebuilding a testbench — the difference between an afternoon and a milestone. This is the property that lets the near-zero-bug flow of Chapter 1 absorb continuous specification change rather than slipping under it, and day to day it is the executable specification, more than any single proof, that does the work.

Why formal now reaches the subsystem and the SoC

For most of its history formal was confined to blocks — not because the engines could not check larger designs, but because the proven context that makes a large proof converge could not be carried across the cuts by hand. A block proved in isolation sees its inputs as unconstrained and wanders an environment the real system never produces; hand it the already-proven facts of the blocks around it and the same proof closes. Reaching the subsystem and the SoC is therefore less about cutting the design into pieces — that decomposition is only the scaffold — than about propagating proven knowledge along it: threading each closed property into the blocks that depend on it, and holding a protocol’s contract and a design’s global assumptions consistent across hundreds of interfaces while the RTL churns underneath. That bookkeeping was the bottleneck: its cost grew with the size of the design and the team, not with the engines. It is exactly the cross-interface labor large language models now take over — proposing the cuts and helper invariants, propagating the proven facts, and keeping the whole contract web consistent as the design moves — while the engineer keeps the strategy and the spec. The automation adds reach without adding risk, because the prover still checks every step: a propagated assumption the context does not justify cannot be quietly absorbed — it surfaces as an undischarged obligation, never a silent pass. The unreliable proposer cannot corrupt the proof, because soundness rests on the checker; §3.2 develops this propagation in full.

This is the same pairing of logic and probability that Chapter 1 draws between formal methods and modern AI (§1.6), developed in full as the two siblings of mathematics at the close of §3.6: two models of inference with complementary strengths — probability reaches under ambiguity but cannot guarantee, logic guarantees but cannot guess — composed so each covers the other’s weakness. The result is that formal sign-off has begun to extend from units and blocks to integration-level subsystems and full-SoC scope. The shift is recent and accelerating, and its character matters: the bottleneck it removes scaled with scarce engineering hours, so formal’s reach now tracks the capability of the models rather than the size of the team.

One worry deserves a direct answer: a proof can itself be wrong — through an over-tight assumption that passes a property vacuously, a specification that states the wrong intent, or an unsound abstraction. But these failures are named, bounded, and auditable: the assumptions are written down and reviewable, vacuity is detectable, and the proof’s scope is declared. Sampling’s residual risk is the opposite — unnamed, unbounded, and invisible, hiding in the part of the space no run reached while the coverage report reads complete. The honest comparison is not infallible proof against fallible test, but risk one can see and sign off against risk one cannot point to. It is why the near-zero-bug tape-out of Chapter 1 is at bottom a universal claim — no bug of any known class remains in the verified scope — and why only formal can make it. Without formal the goal is not merely harder to reach; it cannot even be stated.

3.2 The Model-Checking Verification Flow

Formal verification is not a single activity. It is a pipeline that starts at the natural-language specification and ends at sign-off, with each stage producing artifacts the next consumes (Figure 3.2).

Stage 1 — Specification capture. A natural-language specification is parsed into a small number of intent statements: the protocol guarantees the design must uphold (every request is eventually granted; in-flight transactions are never dropped; reset disables all outputs; the arbiter is fair). The intent statements are not yet machine-checkable — they are the bridge between English and SVA. Modern flows use LLMs to translate intent statements into candidate SystemVerilog Assertions (SVA) or PSL (Property Specification Language) properties, which a human reviews and a tool elaborates against the RTL signal names.

The artifact of this stage is more than a formal-tool input; it is an executable specification, and the most reusable thing the whole flow produces. Once a property is bound to the design, the same SVA fires in both formal and simulation contexts, giving the team Assertion-Based Verification (ABV) without additional infrastructure. That one artifact is what shortens the debug cycle, sorts testbench bugs from design bugs at the instant of failure, and makes late architectural change cheap to re-sign-off — the payoff developed in §3.1. It also doubles as living documentation: properties say what the design promises in a form that does not rot the way English docs do.

Stage 2 — Block-level proof. Each RTL block — a FIFO, an arbiter, a state machine, a datapath unit — gets its property set elaborated against the module boundary. The formal engine (§3.3) runs in three escalating modes: first BMC (Bounded Model Checking) to a bounded depth (find bugs fast), then $k$-induction with helper invariants where induction does not close, then IC3/PDR (Property-Directed Reachability) for unbounded proofs; liveness obligations such as deadlock- and starvation-freedom take the complementary fair-cycle and ranking-function route of §3.4. The artifact of this stage is a per-block proof certificate or a counterexample trace. A counterexample is also valuable — it is a bug report with a minimal stimulus sequence and zero noise.

Stage 3 — Module composition with assume–guarantee. Once each block proves locally, neighboring blocks compose by mutual contracts: block A’s outputs are the assumptions of block B’s environment, and block B’s inputs are the guarantees of block A. Chapter 2’s asynchronous handshake controller showed the pattern in miniature. At sub-system scale it becomes a chain: producer’s guarantee $\to $ consumer’s assumption $\to $ consumer’s guarantee $\to $ next-consumer’s assumption, and so on. The artifact of this stage is a composition certificate showing every assumption is discharged by a proven guarantee somewhere upstream.

Stage 4 — Subsystem and SoC. At subsystem and SoC scale the state space is large enough that direct property checking fails. Two moves rescue the flow — abstraction (Move 1) collapses irrelevant detail, and LLM-propagated contracts (Move 2) thread each proven fact to the blocks that need it, over an assume–guarantee decomposition of the proof into a tree of interfaces. The two are partners, not alternatives: the decomposition fixes the proof boundaries; the abstractions make each piece tractable inside those boundaries. Chapter 2’s design examples have already shown most of the toolbox in context; Move 1 names it as a single catalogue, and Move 2 develops the SoC-scale composition flow. Read Move 1’s catalogue as a map of ground already walked, not new terrain — most are moves Chapter 2’s examples already made, gathered here so the rest of the book can cite them by name.

• Move 1 — Abstraction:
- – Cone-of-influence (COI) — drop every signal and register that does not fan in to the property’s cone. Every modern formal tool applies COI automatically before any other engine runs; see Figure 3.3.
  
  Figure 3.3: Cone-of-influence abstraction.
- – Data abstraction (Wolper symbolic token) — replace a wide data path with a single symbolic value tracked through the design. Introduced in the FIFO of §2.7 (deep model) with the soundness argument.
- – Symmetry reduction — prove the property for a symbolic representative of an $n$-way replicated structure rather than for all $n$ concrete instances. Introduced in the round-robin arbiter of §2.5.
- – Counter abstraction — replace deep initialization / refresh / watchdog counters with an expected-counter FSM of $\mathcal {O}(1)$ buckets. Demonstrated on three interacting counters in the memory controller of §2.10; see Figure 2.29.
- – Memory abstraction — prove read-back properties without representing every entry: four rigid free variables track four arbitrary-but-stable addresses, and the conclusion generalizes by universal quantification. Worked SystemVerilog in §2.10.
- – Symbolic constants / free variables (localization) — replace a wide upstream signal or a generate-loop index with a single arbitrary-but-stable witness; the proof generalizes by symmetry. The arbiter’s symbolic client and the memory controller’s fv_active_addr are both instances.
- – Predicate abstraction — the most general form: instead of a register’s exact value, track only the truth of a few Boolean predicates (yes/no questions about the state) the property depends on, collapsing every concrete state into the pattern of answers it gives. Done by hand in the MSI coherence controller of §2.9, whose M, S, and I states are the chosen predicates and whose multi-node coherence proof is reduced to a single representative node by local-node abstraction; that section also develops CEGAR, the counterexample-guided loop that discovers the predicates automatically when they are not obvious (Figure 2.27).
• Move 2 — Probability proposes, logic verifies (LLM-driven propagation of knowledge across interfaces). It is worth being explicit about why an unreliable engine can be trusted with the hard part of this move. An LLM is probabilistic — its proposals, however fluent, can be wrong — and formal verification is the sibling that makes leaning on it safe: a logic engine that cannot guess but also cannot be fooled, so a hallucinated cut, invariant, or contract is corrected by the proof rather than shipped. That corrector is the advantage no dynamic method has: in simulation and emulation the oracle is the testbench and the reference model, themselves as fallible as the design — and as fallible as anything an LLM would write to build them — so nothing sound stands ready to catch the model when it errs. The asymmetry compounds. As the prover’s reach climbs with the model doing the proposing, formal absorbs the property-checkable work that dynamic verification used to carry; the trajectory points to a smaller slice of DV — much of it repurposed to feed emulation rather than to hunt bugs itself — while emulation carries the proven design through system workloads to sign-off. The mechanism that does the absorbing is the rest of this move.

The decomposition framework is assume–guarantee, shown in miniature by the asynchronous handshake controller of §2.6: two blocks across one interface, each proving what it guarantees under what it is allowed to assume. At SoC scale, though, the decomposition is not where the leverage lies — it is in what flows across the cuts. A block proved in isolation sees its inputs as unconstrained, so the engine wanders an over-approximated environment full of stimuli the real system never generates: spurious counterexamples to chase, reachable depths that never settle. Hand that same block its proven context — the end-to-end intent it must meet, the contracts of the IP it talks to, the already-discharged properties of the blocks beneath it — as assumptions on its inputs and lemmas inside its proof, and the environment collapses to what the system can actually do. The local proof now runs with its context in place, and what would not converge in isolation closes in seconds. This propagation of proven knowledge — instantiating a protocol’s assume/guarantee set at every interface that speaks it, threading a fact established in one block into every block that relies on it, carrying each closed property forward as a lemma for the next — is what converges a subsystem, and then an SoC. The cut structure is just the scaffold the knowledge travels along.

Built by hand, that propagation was the wall. The per-block proofs were seldom the hard part; the cost was the human labor around them — choosing where to cut, planning the divide-and-conquer, writing each assumption correctly, and holding assumptions and assertions consistent across hundreds of interfaces while the design churned underneath, labor that left teams fighting to converge even block-level proofs. That bookkeeping is what an LLM now takes over: it proposes the cuts and the helper invariants, propagates the proven facts, and — the part that used to break under churn — keeps the whole contract web consistent as the RTL moves. The engineer still owns the strategy and the spec; the model does the cross-interface labor that used to cap how far formal could scale.

None of this loosens the proof — it changes where the rigor sits. The classic assume–guarantee discipline was a bottom-up handshake, the pre-LLM idea: prove each block’s guarantee, then discharge each neighbor’s assumption against it, one interface at a time, by hand. Propagation subsumes that handshake by letting knowledge travel in whatever direction the proof needs it, from wherever a specification is anchored. The SoC’s end-to-end intent flows down into each block as exactly the assumptions and assertions it must meet; a well-specified inner IP propagates its contract outward to every block that interfaces with it; and a deep inner block’s specification can propagate up into the subsystems built on top of it. A block’s proven assertions become the context its neighbors inherit, and the same proven knowledge carries across multiple blocks, through their subsystems, and eventually closes the SoC — with no separate guarantee step to negotiate and discharge along the way. The propagated context is what confirms each proof: an assumption that context does not actually justify cannot be quietly absorbed; it surfaces as an undischarged obligation, an unclosed proof, never a silent pass. Closing each proof from the complete end-to-end context, rather than stitching it together from locally-negotiated guarantees, is what converges an SoC proof in practice. What changed in the last couple of years is the tractability: propagating the full specification across hundreds of interfaces — once the bottleneck — is exactly the cross-interface labor the LLM now makes routine, and a cut that once took a senior engineer an afternoon of trace-staring closes in minutes.

Together the two moves are what take formal sign-off from individual blocks to subsystem and SoC scope — the trajectory developed in Stage 5 below.

Stage 5 — High-confidence RTL freeze. The flow ends when every intent statement has a discharged proof and every assumption traces back to a guarantee. The result is the near-zero-bug tape-out that Chapter 1’s Emergent Verification thesis (§1.6) is built around — not absolute zero, but a sign-off bar where every known class of bug is provably absent within the verified scope, and the residual risk is small enough, and benign enough, that first-pass silicon success becomes the norm rather than the exception.

Formal verification is central to making that goal real, and the trajectory across the last several years has been a marked step up in what it can reach. With LLM-assisted property generation, helper-invariant synthesis, counterexample triage, and partition proposal, the engineer’s time stops being absorbed by the proof-bring-up overhead that historically capped how much formal a team could afford. With the abstraction toolbox of Stage 4 — COI, data abstraction, symmetry, counter and memory abstraction, symbolic constants and free variables, predicate abstraction — the same engine reaches state spaces that were out of reach a few years ago. With LLM-propagated contracts composing the block proofs into a system proof, formal sign-off has moved from just units and blocks to integration-level sub-systems and full SoC scope.

Concurrency bugs at the SoC level — protocol races between coherence-controller nodes, cross-clock-domain handshake ordering, fabric back-pressure deadlocks, late-pipeline forwarding hazards under exotic stall sequences — are now captured by formal with comparatively modest effort. In the previous generation of methodology these bugs were either out of formal’s reach entirely or were caught only by lucky constrained-random simulation that happened to stumble on the exact stimulus pattern — and frequently they were not caught at all and showed up as silicon escapes.

The same propagation is beginning to reach past the gates entirely. A proven fact does not care whether the layer above it is more hardware or the firmware, compiler, and software stacked on top: the RTL’s discharged guarantees become the contract a custom driver is checked against, that driver’s contract becomes what a workload may assume, and a specification at any layer can be checked for consistency with the contracts above and below it rather than trusted on faith. None of this is free — software and firmware carry their own hardness (unbounded state, pointers, concurrency), and a layer is only ever as sound as the proofs beneath it — but every layer already proven becomes context that makes the next one tractable instead of harder. Prove the hardware, and a system-level misbehavior that survives must live in the driver or in the contract between them: formal checks the custom driver against the now-trusted hardware guarantee and localizes the bug to the line that breaks it. That role is a natural continuation of one the formal team already holds: they already own the executable specification, captured as temporal logic, so the contracts already live with them in formal form. What changes is the span — less “prove this block,” more keeper of consistency across every layer, from the workload down to the gates — which is exactly the cross-boundary bookkeeping an LLM makes tractable.

The leftover risk after a clean freeze is in the spec (did the team name every intent statement that mattered?) and at the boundary between formal and emulation, where the system-level verification gate of Chapter 1’s pipeline takes over. The signoff artifact (§3.9) makes both of these boundaries explicit, so the residual risk is documented rather than hidden.

3.3 Proof Engines: SAT, BMC, $k$-Induction, Interpolation, IC3/PDR

The workflow above names when each engine runs; this section opens the hood and shows what each engine does. Five algorithms close the proofs that modern formal tools depend on — CDCL (Conflict-Driven Clause Learning) for combinational SAT (Boolean Satisfiability), BMC (Bounded Model Checking) for finite-depth bug hunting, $k$-induction for the first attempt at unbounded proof, Craig interpolation for the fence $k$-induction was guessing at, and IC3/PDR (Property-Directed Reachability) for the backward-search engine that closes most industrial proofs without ever unrolling the design. We walk all five on two running examples, a matched pair: an elevator interlock whose safety is easy to prove, and a FIFO overflow guard whose safety is hard — and the gap between them is what forces each next engine into existence.

Satisfiability and validity — one duality, two questions. Before the engines, one observation that unifies everything in this and the next two sections. A formula $F$ is valid (true under every assignment) if and only if its negation $\neg F$ is unsatisfiable (true under no assignment). Verification engineers care about validity — the property must hold for every input — but solvers chase satisfiability, because finding one witness is mechanical while ruling out all assignments is not. The verification flow exploits the duality: take the property $P$, negate it, hand $\neg P$ to a SAT/SMT engine, and read UNSAT as the proof that $P$ holds and SAT as the counterexample witness. Every engine in this section is fundamentally a witness-finder; the framework just runs each one on the negation of what we actually want to prove.

CDCL: the detective inside every SAT solver

A SAT solver answers exactly one question: is there an assignment of true and false to the variables that makes a given Boolean formula evaluate to true? For a hardware engineer that question reads more naturally as: is there an input pattern that drives this combinational circuit’s output to one? A gate-level netlist is a Boolean formula. A Boolean formula is a gate-level netlist. The two pictures are interchangeable.

The algorithm that does the search is CDCL — Conflict-Driven Clause Learning. It works like a detective with four moves:

• Decide. Pick an unassigned variable and guess a value.
• Propagate. If any clause now has all literals false except one, the survivor is forced. Cascade these forced assignments through the formula until nothing more is forced.
• Conflict. If a clause becomes all-false, the current assignment is dead.
• Learn. Walk back from the conflict through the chain of implications. Find the smallest subset of decisions that caused the conflict. Add the negation of that subset as a new permanent clause to the formula. Backtrack past the bad decision.

The trick is the learn step — the one move that separates CDCL from its predecessor. The older DPLL (Davis–Putnam–Logemann–Loveland) algorithm shares the first three moves exactly — decide, propagate, detect a conflict — but on a conflict it simply flips the bad decision and tries the other branch, learning nothing. CDCL instead extracts a permanent rule from every failure — it accumulates lessons the way an experienced designer accumulates rules of thumb.

CDCL works on exactly one shape of formula, conjunctive normal form (CNF): an AND of clauses, where each clause is an OR of literals — a variable or its negation, as in $(x \vee \neg y \vee z)$. A clause demands that at least one of its literals hold; the whole formula is satisfied only when every clause does. A single gate is where clauses come from. Take a two-input AND, $y \leftrightarrow (a \wedge b)$; it becomes clauses by a chain of equivalences:

\[ \begin {aligned} y &\leftrightarrow (a \wedge b) && [\,\text {the gate}\,]\\[2pt] &\equiv (y \to (a \wedge b)) \;\wedge \; ((a \wedge b) \to y) && [\,{\leftrightarrow }\ \text {as two}\ {\to }\,]\\[2pt] &\equiv (\neg y \vee (a \wedge b)) \;\wedge \; (\neg (a \wedge b) \vee y) && [\,p \to q \equiv \neg p \vee q\,]\\[2pt] &\equiv (\neg y \vee a) \wedge (\neg y \vee b) \wedge (y \vee \neg a \vee \neg b) && [\,\text {distribute, De Morgan}\,] \end {aligned} \]

A satisfying assignment makes all three clauses true at once — $a$ and $b$ both on, with $y$ on — and a clause set with no satisfying assignment is UNSAT, the answer a proof wants: the forbidden combination cannot occur.

Why CNF, not some other form. Why target this particular form rather than the equally expressive alternatives? Two properties make CNF the shape every SAT engine wants, independent of how the formula is produced:

• CDCL is defined on clauses. A unit clause (one literal) immediately forces an assignment; propagation cascades these forcings through the clause set; and the learn step produces the new fact by resolution between two clauses on a shared literal. Every piece of the engine — unit propagation, watched-literal indexing, conflict analysis — assumes the formula is a flat conjunction of disjunctions. Replace CNF with anything else and the algorithm has to be reinvented.
• CNF favors UNSAT, which is what verification asks. The dual form DNF (disjunction of conjunctions) makes satisfiability trivial — pick any conjunctive cube and you are done — but makes unsatisfiability hard. Verification queries are the other way round: most of them are UNSAT proofs (“the property holds, the bug state is unreachable”), and CDCL’s resolution-based proof system lives natively in CNF. We want the form where “impossible” is the cheap answer.

From netlist to clauses: the Tseitin transformation. A gate-level netlist already is a Boolean formula, but not yet in CNF — and the obvious way to get there is a trap. Converting an arbitrary formula to CNF by the distributive law $A \vee (B \wedge C) \equiv (A \vee B) \wedge (A \vee C)$ blows up exponentially in formula depth, hopeless for any real design.

The Tseitin transformation avoids the blow-up with a linear-size encoding: introduce one Boolean variable per wire, and for every gate emit a handful of clauses that pin the output variable to the gate’s truth-table relation — exactly as the AND gate above became three clauses. A two-input AND $y = a \wedge b$ is three clauses ($\neg y \vee a$, $\;\neg y \vee b$, $\;y \vee \neg a \vee \neg b$); an XOR is four; a 2:1 mux $y = (\neg s \wedge a) \vee (s \wedge b)$ is six. One fresh variable per gate and a constant number of clauses per gate, so a million-gate netlist Tseitin-encodes to a few-million-clause CNF, not $2^{n}$ clauses — the only reason industrial SAT is feasible at all, and exactly the form CDCL eats. The fresh per-gate variables make the CNF equisatisfiable with the design rather than logically equivalent to it: it has a satisfying assignment exactly when the design does, which is all a SAT query needs.

The two examples we will use throughout

Every engine in this section needs a design to work on, so we fix two and carry both through all five, beginning with the CDCL runs below and then the four engines that build on them. Two rather than one, because what makes a proof easy or hard is exactly what differs between them. The elevator is safe for a local reason — its property holds because of what the logic does in a single step — so it is the easiest to picture, and every engine settles it at once. The FIFO is safe for a global reason: its property rests on a relationship the RTL never states, one that holds in every reachable state yet that nothing in the single-step logic enforces — so the simpler engines stall on it, and only the heavier ones close it. That gap is the whole subject of the section, and no single design shows both sides of it. Both are deliberately tiny: real proofs close on CPUs, cache-coherence protocols, and floating-point datapaths, and against those even a production block is modest — these two are about the smallest designs on which each engine still has something to show. The elevator is the one to picture, the FIFO the one to run.

The one to picture: an elevator interlock (Figure 3.6). Everyone already knows the rule a lift obeys, which makes it the cleanest thing to hold in your head while an engine works. The controller keeps four state elements — moving (the motor is engaged), door_open, and the current and requested floors floor[1:0], target[1:0]. It moves toward its target only while the door is shut, and opens the door only once stopped at the target. The property is the interlock every passenger trusts: the car never moves with the doors open — $\neg (\mathtt {moving} \wedge \mathtt {door\_open})$ at every cycle. It holds for a reason you can see at a glance: the next-state of moving requires the car has not yet arrived ($\overline {\texttt {at\_target}}$), the next-state of door_open requires it has ($\texttt {at\_target}$), and one wire cannot be both. The floor registers decide only whether the car has arrived; their actual values never bear on the interlock — a detail generalization will use later.

A real conflict, from the elevator’s own gates. Run Tseitin on the next-state logic of Figure 3.6 and ask the natural question: is there any state that makes the next state unsafe — both $\texttt {moving\_nxt}$ and $\texttt {door\_open\_nxt}$ on at once? If the answer is UNSAT, the interlock cannot be broken in one step — which is exactly the property we want.

The transformation draws a clean line through the signals. The current state — moving, door_open, floor[1:0] and target[1:0] — stays as the original variables: the free inputs the solver searches over. Each gate output then gets one fresh Tseitin variable naming its wire, and here there are exactly three (a $\neg $ gate needs none of its own — a negated wire is already free in CNF, an inlining the per-gate packets of Figure 3.5 did not yet apply):

• $\mathtt {at\_target}$ — the comparator floor == target, shared by both next-state gates.
• $\mathtt {moving\_nxt} = \neg \,\mathtt {at\_target} \wedge \neg \,\mathtt {door\_open}$ — move only while not yet arrived and the door is shut.
• $\mathtt {door\_open\_nxt} = \mathtt {at\_target} \wedge \neg \,\mathtt {moving}$ — open only once arrived and stopped.

Each gate contributes the handful of clauses that pin its variable to that relation, and the negated safety property adds two unit clauses — the goal, moving_nxt$=1$ and door_open_nxt$=1$. The derivation DAG below carries each gate’s Tseitin variable in brackets. Both next-state gates read at_target, and that single shared variable is where the proof turns.

CDCL closes the question by pure propagation — no decision is ever made. The chain reads as a hardware story:

1. Assume the next state is unsafe: moving_nxt $=1$ and door_open_nxt $=1$ (the two goal units).
2. Move logic fires the first lead: for moving_nxt to be $1$, the car must not yet be at its target, so $\texttt {at\_target} = 0$.
3. Door logic fires the second lead: for door_open_nxt to be $1$, the car must have arrived, so $\texttt {at\_target} = 1$.
4. Conflict. Step 2 demanded $\texttt {at\_target} = 0$; step 3 demands $\texttt {at\_target} = 1$. The same wire cannot be both at the same instant.

The detective walks the implication graph backward to the assumptions that drove the conflict. Both forced values of at_target trace back to the two goal units, and to nothing else the solver chose for itself. CDCL learns the clause

\[ \neg \mathtt {moving\_nxt} \;\vee \; \neg \mathtt {door\_open\_nxt} \]

which reads, in English: the next state can never be unsafe. The clause is permanent. Any future SAT query that reuses this clause base will see $\neg (\texttt {moving\_nxt} \wedge \texttt {door\_open\_nxt})$ as a hard constraint and never re-derive the chain.

That conflict arose at decision level zero — no decision was ever made — so there is nothing to backtrack and the search is already over. What certifies the result is the refutation: the learned clause resolves against the two goal units, pivot by pivot, down to the empty clause, the clause that no assignment can satisfy. Reaching $\square $ is what “UNSAT” means.

That single learned clause is the whole interlock proof: because the bad combination cannot be produced from any current state, it cannot be produced from a reachable one either — the elevator is safe, and CDCL settled it in one conflict with no search at all. IC3/PDR will confirm exactly this fact in a single frame at the end of the section; same conflict, same fact, different orchestration.

The one to run: a FIFO overflow guard (Figure 3.10). The elevator is a picture; the FIFO is RTL you can type into a file, hand to a formal tool, and watch the verdict come back.

A FIFO of depth $4$ keeps a count[2:0] of live entries, a registered full flag, an array of stored payloads, and the read/write pointers that index it. A write is accepted only when $\overline {\texttt {full}}$, a read only when $\overline {\texttt {empty}}$, and a write and a read may arrive in the same cycle. The property is the FIFO never overflows — $\mathtt {count} \le 4$ at every cycle. The part worth watching is the full flag: for timing it is registered and maintained incrementally — set when a write fills the last slot, cleared when a read drains it — not recomputed as “$\mathtt {count}{=}4$” each cycle. Safety then rests on a relationship the RTL never states outright: $\mathtt {full} \Leftrightarrow (\mathtt {count}{=}4)$. Let it slip — $\mathtt {full}{=}0$ while $\mathtt {count}{=}4$ — and a write sails through at capacity and the count rolls past the depth. The stored payloads and their pointers, exactly like the elevator’s floor numbers, have nothing to do with whether the guard holds.

The same question, a different answer: the FIFO. Run the identical one-step query on the FIFO of Figure 3.10 — is there any state from which one step pushes count past $4$? The encoding follows the same recipe as the elevator: one Tseitin variable per gate, each gate adding the clauses that say what it computes. Read as the hardware constraints a designer would see on a review:

• Accept logic: if push_ok $=1$, then push $=1$ and full $=0$. (A write goes through only when one is requested and the FIFO reads not-full.)
• Count update: count_nxt $=$ count $+$ push_ok. (The full update is $+\,\texttt {push\_ok}-\texttt {pop\_ok}$; a pop only lowers the count, so it cannot overflow and the query drops it.)
• Overflow: if overflow $=1$, then count_nxt $\ge 5$. (Overflow is the next count stepping past the depth-$4$ capacity.)
• Goal (one unit clause): overflow $=1$.

The derivation tree exposes the trap the elevator’s encoding did not hide. The accept gate push_ok reads push and $\neg $full, but never count — no edge runs from count into it. Nothing in a single frozen instant ties the full flag to the actual occupancy, so the solver is free to hold full$=0$ while count already sits at $4$, and a write sails through at capacity.

That missing dependence is the whole difference. Where the elevator’s shared comparator forced a contradiction, the FIFO’s freedom leaves room for a satisfying assignment: the query comes back SAT, with the witness $\mathtt {count}{=}4 \wedge \mathtt {full}{=}0$ — the write accepted at capacity, count rolling to $5$. The gates permit it. But is that state reachable? CDCL is combinational; it sees one frozen instant, with no memory of how the design got there, and cannot say. That gap — a SAT witness that may or may not be reachable — is exactly the crack the sequential engines spend the rest of this section trying to close. The elevator’s safety was settled in one step; the FIFO’s is still an open question.

Getting to that witness is the more interesting run. The elevator was closed by propagation alone — no choices, one conflict at decision level $0$. The FIFO query is satisfiable, but the solver does not simply see the answer; it has to search, hitting conflicts, learning from them, and backjumping before it lands on the model. The shape of that run is the part worth drawing.

The run splits into two conflicts. The first is a wrong guess: the solver tries $\mathtt {count[2]}{=}0$, which holds count below $4$ and makes overflow impossible, so the assumed goal collides with the propagated $\mathtt {overflow}{=}0$ at the conflict node $\kappa _1$. The analysis learns the one-literal clause $(\mathtt {count[2]})$ — count’s high bit must be set — and backjumps; $\mathtt {count[2]}{=}1$ is now forced, not guessed. It enters the second conflict at the top right, directly beneath the lesson that produced it: there the solver guesses $\mathtt {full}{=}1$ and then the low bits to assemble $\mathtt {count}{=}4$ — the reachable configuration, with the full flag correctly set. Propagation drives $\mathtt {push\_ok}{=}0$, $\mathtt {count\_nxt}{=}4$ and $\mathtt {overflow}{=}0$ into a second collision with the goal, at $\kappa _2$.

Where CDCL learns from $\kappa _2$ turns on the unique implication point (UIP): an assignment at the current decision level that every path from that level’s decision down to the conflict must pass through — a bottleneck that dominates the conflict. A conflict can have more than one, numbered outward from the conflict; the figure marks the two that matter. $\mathrm {UIP}_1$ is $\mathtt {count\_nxt}{=}4$, the propagated assignment nearest $\kappa _2$ that drives the contradiction; $\mathrm {UIP}_2$ is the decision $\mathtt {count[0]}{=}0$, further back. CDCL learns at the first, $\mathrm {UIP}_1$ — the one closest to $\kappa $ — because it is the most local explanation of the conflict and gives the shortest backjump: the learned clause sends the solver back only to level $1$, asserting $\mathtt {count\_nxt}{\ne }4$ — under which the choice $\mathtt {full}{=}1$ no longer leads anywhere and the search revises full to $0$. The remaining assignment is then forced — $\mathtt {count}{=}4 \wedge \mathtt {full}{=}0$ rolls count to $5$, and the derived $\mathtt {overflow}{=}1$ now agrees with the assumed goal instead of colliding with it. Nothing is left violated, so the same goal unit that produced a conflict above closes the search the other way: SAT.

That is the heart of CDCL. The elevator’s small formula closed by propagation alone; the FIFO needed the full search — decide, conflict, learn, backjump — and on industrial netlists with millions of gates that same search carries everything, every conflict a permanent fact that prunes a region of the space. Tens of thousands of learned clauses later, what looks combinatorially infeasible becomes seconds of wall time. CDCL is the combinational core every other engine in this chapter is built on.

The FIFO has left a question CDCL cannot answer. Its witness $\mathtt {count}{=}4 \wedge \mathtt {full}{=}0$ satisfies the one-step query — yet CDCL sees only that one frozen instant, with no way to say whether the design can ever reach that state from reset. The elevator needed no such follow-up: its bad combination was impossible from any state, reachable or not, so one conflict closed it. The FIFO’s is possible from one specific state, and whether the design can get there is a question about time — exactly what the next four engines answer. Each wraps a different temporal argument around the same combinational core, and each, underneath, hands its query back to CDCL. We take them in order of increasing reach: BMC (bounded and exact), $k$-induction (the first unbounded attempt), Craig interpolation (the fence carved from the refutation), and IC3/PDR (backward reachability that never unrolls).

BMC: unroll time, hand the result to CDCL

The most direct way to settle the FIFO’s open question — can the design actually reach $\mathtt {count}{=}4 \wedge \mathtt {full}{=}0$? — is to walk forward from reset and look. CDCL cannot do that on its own: it answers combinational questions (given a circuit with no registers, is there an input that drives the output to one?), while reachability is sequential — the FIFO’s count at cycle $j$ depends on its value at $j-1$, and the guard must hold across all cycles, not one frozen instant.

Bounded Model Checking (BMC) bridges the two with one trick: copy the combinational logic once per clock cycle and chain the flop outputs of cycle $i$ into the flop inputs of cycle $i+1$ (Figure 3.14). The unrolled chain is one large combinational circuit spanning $k$ cycles — exactly the shape CDCL eats. The question BMC hands it is sharp: starting from reset, can the design take $k$ legal steps and stand in a bad state at the end? CDCL reads back SAT as a concrete $k$-cycle counterexample trace, UNSAT as a clean bill of health for the first $k$ cycles.

BMC anchors every trace at reset — and that single anchoring is what makes its answer differ from the combinational one-step query above. From reset both designs are safe, so both grind out the same unsatisfiable answer at every bound, and neither can be closed.

• The elevator. Reset leaves the car stopped at floor $0$ with the door shut. At bound $0$, is $\texttt {moving} \wedge \texttt {door\_open}$ true in the reset state? No. At bounds $1, 2, 3, \ldots $ the car travels, arrives, opens, closes, and travels again — the interlock holds at every step, so every query is UNSAT. The bug exists at no bound.
• The FIFO. Reset leaves the FIFO empty: $\texttt {count}{=}0$, $\texttt {full}{=}0$ — a consistent state, where the flag and the count agree. Unrolled forward from there, the incremental full logic keeps $\texttt {full}\Leftrightarrow (\texttt {count}{=}4)$ true cycle after cycle, so a write is blocked exactly at capacity and count never reaches $5$. Every query is UNSAT. Note what BMC never reaches: the garbage state $\texttt {count}{=}4 \wedge \texttt {full}{=}0$ that the one-step query produced so easily. It is unreachable from reset — so BMC, anchored at reset, never finds the overflow the combinational query found. Good news for the design; cold comfort for the proof.

Both designs are now stuck for the same reason. Each query at bound $k$ creates one more copy of the design in memory. If a bug existed at cycle $100$, BMC would find it in milliseconds; if at cycle $10{,}000$, in seconds. But both properties are true — the bug exists at no cycle — and BMC has no way to know it can stop. The engine just keeps escalating $k$. By bound $1{,}000{,}000$ the chained SAT formula contains tens of millions of variables and the server runs out of RAM. State-space explosion.

The explosion has a precise algorithmic shape (Figure 3.15). Building the BMC formula is cheap: each unroll step appends one more copy of the combinational logic to the SAT formula, so the formula grows linearly in the bound $k$ — size $O(k \cdot |\mathrm {design}|)$. Solving it is the expensive direction: in the worst case, the SAT solver’s work grows roughly as $2^{(k+1)\,n_{\mathrm {state}} + k\,n_{\mathrm {input}}}$ — the number of assignments to the $(k{+}1)\,n_{\mathrm {state}} + k\,n_{\mathrm {input}}$ Boolean variables the unroll introduces ($k{+}1$ copies of the $n_{\mathrm {state}}$ state bits, $k$ copies of the $n_{\mathrm {input}}$ inputs). CDCL’s learned clauses prune most of that space in practice, but the unbounded-$k$ limit is unwinnable in practice: a finite bound past which UNSAT would be a proof does exist — the recurrence diameter — but it is as hard to compute as the property itself, so no flow waits for it. BMC simply runs until it hits either available memory or available time, whichever comes first — and the orchestrator hands the rest of the proof to one of the engines that follow.

BMC is the fastest bug-finder there is when the bug is shallow. For unbounded proofs it cannot finish.

$k$-induction: the first attempt at infinite proof

If we cannot unroll to infinity, can we close the proof by induction? Mathematical induction proves a fact about every natural number with two steps: show it for $0$, and show that if it holds for $n$ then it holds for $n+1$. $k$-induction (Sheeran–Singh–Stålmarck, FMCAD 2000) applies the same logic to hardware:

• Base case. The property holds in the reset state. Checked by BMC at depth $0$.
• Inductive step. If the property holds in some arbitrary state $s$, it holds in $s' = T(s)$. Checked by one SAT query on a single transition, starting from an unconstrained state.

If both queries return UNSAT, the property holds forever. No unrolling required beyond depth $k$.

Where the $k$ comes in. Those two bullets are the heart of it; the $k$ is a strengthening knob on the inductive step. At $k=1$ the step looks back at just one prior state; raising it assumes the property held for the last $k$ consecutive states before proving the next (and discharges the base case to depth $k$ by BMC). A longer assumed-good window rules out short excursions through illegal states, so raising $k$ sometimes closes a proof that plain induction cannot — it is the engine’s first knob before anyone reaches for helper invariants.

The elevator: $k$-induction wins at $k=1$. The base case is trivial — reset leaves the car stopped, so $\neg (\texttt {moving}\wedge \texttt {door\_open})$. The inductive step starts from an arbitrary state — any combination of moving, door_open, floor, target — assumes the interlock holds there, runs one transition, and asks whether the next state is unsafe. But that is exactly the one-step query CDCL already closed: $\texttt {moving\_nxt}$ needs $\overline {\texttt {at\_target}}$, $\texttt {door\_open\_nxt}$ needs $\texttt {at\_target}$, and the two collide. UNSAT. Even if the step picks a garbage start — say $\texttt {moving}{=}1 \wedge \texttt {door\_open}{=}1$, unreachable but the inductive step is free to choose it — the next state still cannot be unsafe, because the cross-gating forbids it wherever you begin. The interlock is $1$-inductive: safety the control enforces locally needs no history at all. $k$-induction proves the elevator at $k=1$, no helper invariant, done.

The FIFO: $k$-induction loses at every $k$. The base case is again trivial — reset is empty and consistent. The inductive step is where it falls apart. It starts from an arbitrary FIFO state, and nothing stops it from picking the garbage $\texttt {count}{=}4 \wedge \texttt {full}{=}0$: a state that satisfies the property ($\texttt {count}{=}4 \le 4$) yet steps straight to overflow the moment a write arrives. The SAT solver finds it, returns SAT, and $k$-induction reports a spurious counterexample — a fake bug born of a state real operation never reaches (Figure 3.16).

Raise $k$? It does not help, and the reason goes to the heart of what $k$-induction can and cannot do. $k$-induction at depth $k$ assumes the property for $k$ consecutive states before the violation, so a longer window can rule out short garbage excursions. But the FIFO’s garbage need not be short.

Leave the cell idle and $\texttt {count}{=}4 \wedge \texttt {full}{=}0$ steps to exactly itself; run a write and a read in the same cycle (net change zero) and it steps to a sibling in the same band — count and flag pinned, only the stored payload churning. Either way the incremental full logic neither sets nor clears the flag, so the desync persists for as many cycles as you like before a write-without-read finally overflows. For any $k$ there is a chain of $k$ consecutive property-satisfying garbage states leading to the bug, and no fixed window catches it.

The standard strengthening that bars repeated states from the window does not rescue it either: because the payload churns, the desync band holds on the order of $2^{4w}$ distinct states (depth $4$, $w$-bit words), and a repeat-free chain can wander the whole band before it overflows — so the window would have to exceed the entire band. The refinement helps in principle and never in practice; $k$-induction cannot prove this property at any feasible depth.

The textbook cure is a helper invariant — here, exactly $\texttt {full}\Leftrightarrow (\texttt {count}{=}4)$ — supplied by hand to fence the garbage out. Finding such invariants by hand is the labor bottleneck of large formal projects. The two engines that follow find them automatically: interpolation carves the fence from a proof, and IC3 builds it clause by clause.

Craig interpolation: the perfect fence

Craig’s theorem (William Craig, 1957) is a tool from pure logic that turns out to solve $k$-induction’s garbage-state problem cleanly. The theorem says: if two formulas $A$ and $B$ are jointly unsatisfiable ($A \wedge B$ has no satisfying assignment), then there exists a third formula $I$ called the interpolant with three properties:

1. $A \implies I$ — every model of $A$ satisfies $I$.
2. $I \wedge B \implies \bot $ — no model of $I$ satisfies $B$.
3. $I$ uses only the variables that appear in both $A$ and $B$.

$I$ is a fence drawn in the shared vocabulary — snug against $A$ on one side, excluding $B$ on the other. Ken McMillan adapted Craig’s theorem to model checking in 2003. The mapping to hardware is direct:

• $A$ = the reset state followed by one transition — its models are exactly the states reachable from reset in one clock cycle.
• $B$ = “the design is in a state that leads to the bug within some bounded number of cycles.”
• Shared variables = the state of the design at the partition point.

If the BMC formula at depth $k$ is UNSAT (no counterexample of length $k$ exists), then $A \wedge B$ is unsatisfiable and the interpolant $I$ exists. The SAT solver’s UNSAT proof tells the tool how to extract $I$ in linear time. By Craig’s three properties, $I$ over-approximates the one-step reachable states, lives in the design’s state variables, and never overlaps the bug region.

What lives at the partition. The third Craig property — $I$ uses only the variables shared by $A$ and $B$ — is the one that earns its keep in hardware. In the BMC partition, $A$ ranges over the cycle-$0$ initial-state predicate (a formula over the state variables) plus the combinational gates and primary inputs that drive one transition; $B$ ranges over the cycle-$k$ bad-state predicate plus the combinational gates leading up to it. The only variables that appear on both sides are the flop bits at the partition cycle: the state-holding elements that the prefix writes to and the suffix reads from. Combinational wires of cycle $0$ and cycle $k$ appear on only one side, and so do primary inputs. Craig’s theorem forces those out of $I$. What comes back is a clean predicate over the design’s flop values — exactly the right shape to plug back into the next iteration. The interpolant lives in the same vocabulary the engineer thinks in.

Where the fence comes from. Interpolation does not compute the fence separately — it is carved from the SAT solver’s own UNSAT proof. When CDCL refutes $A \wedge B$, it produces a resolution proof tree: the chain of pairwise clause resolutions that derived $\bot $ from the input clauses. The interpolant is extracted from that tree in linear time by walking the proof bottom-up and projecting each step onto the shared variables. Clauses derived purely from $A$-side gates contribute $A$-side facts; clauses derived purely from $B$-side gates contribute $B$-side facts; resolution steps that mix the two sides reduce to a clause over the shared (flop) variables only. Two consequences fall out. First, interpolation is essentially free — it piggybacks on the refutation CDCL already produced. Second, the interpolant is literally the solver’s own explanation of why $A$ and $B$ could not both be true — an autopsy of a dead path, projected onto the partition’s flop bits.

This is the fence $k$-induction was trying to guess (Figure 3.17). But where $k$-induction guessed blindly and got tripped up by garbage states, interpolation carves the fence from the UNSAT proof itself. The mathematics guarantees the fence contains every reachable state and excludes every state from which the bug is still reachable — so the garbage states that step to the bug, exactly the ones that tripped $k$-induction’s inductive step, fall outside by construction. The fence is an over-approximation: some harmless unreachable states may sit inside it, but a bug-reaching state never can.

Iterate: extract $I_1$ over-approximating the one-step image, fold it into the reachable frontier ($R \leftarrow R \vee I_1$), and re-image from there to get $I_2$, then $I_3, I_4, \ldots $ — each $I_{n+1}$ the interpolant of the current frontier against the fixed-depth bad suffix. The fence grows. Because each fence over-approximates, it can overshoot — enclose an unreachable state that does reach the bug, turning the next query SAT even though no real counterexample exists; the cure is a deeper BMC bound, whose longer suffix forces tighter interpolants until the overshoot is squeezed out (or a genuine counterexample surfaces). When $I_{n+1} = I_n$ we have reached a fixed point. The interpolant is now an inductive invariant: a predicate that contains every reachable state, is preserved by the transition relation, and never touches the bug. The property is proved for all time.

On our two designs the contrast is sharp. The elevator’s property is already inductive, so its first interpolant is the property itself and the iteration halts at once — there is no garbage to fence out. The FIFO is where interpolation earns its keep: the fence the iteration converges to is $\neg (\mathtt {count}{=}4 \wedge \mathtt {full}{=}0)$ — the safety-critical half of the $\texttt {full}\Leftrightarrow (\texttt {count}{=}4)$ helper $k$-induction needed supplied by hand, here carved for free from the solver’s own refutation.

IC3/PDR: working backward, never unrolling

Interpolation alone can drive the proof, but in 2011 Aaron Bradley introduced IC3 (Incremental Construction of Inductive Clauses for Indubitable Correctness; also called PDR, Property-Directed Reachability) which works from the opposite direction and more efficiently. IC3 never unrolls the design at all. It works backward from the bug.

Containment rings: forward frames, backward queries. The single confusion that trips readers on IC3 is an apparent contradiction between two of its features: the frames $F_0, F_1, F_2, \ldots $ are forward-looking (each $F_i$ over-approximates what is reachable in $i$ cycles) while the engine’s SAT queries run backward from the bug. Two metaphors, one for each direction, dissolve the contradiction; the panels that follow put them to work step by step (Figures 3.18 and 3.19).

For the forward half, picture a forest fire at a campsite (the reset state, $F_0$) and a house we want to protect (the bug). $F_1$ is a rough ring around the campsite holding every point the fire could reach in one hour; $F_2$ holds every point in two; the rings grow outward by definition. For the backward half, picture a detective who never watches the fire spread forward: she stands at the house and reasons back from it, asking the nearest ring a single one-step question, “can a spark jump directly from you to me?” When the answer is yes, she carves a firebreak into that ring to block the jump, pulling the ring back just enough to exclude the dangerous slice. The rings are the map; the detective is the motion across it. Forward frames, backward queries — both at once, no contradiction. The fire metaphor carries the rest of the engine; the detective’s sharpest move — generalization — is the one the FIFO will force, a few paragraphs below.

The architecture is a sequence of frames $F_0, F_1, F_2, \ldots $ each a CNF predicate over the design’s state variables (a formula describing a set of states). Five invariants hold throughout the engine’s run:

• $F_0 = $ initial state.
• $F_i$ contains every state reachable in $i$ cycles.
• $F_i \subseteq F_{i+1}$ (frames only grow).
• $F_i \wedge T \implies F_{i+1}$ (consecutive frames respect the transition).
• $F_i \implies P$ (every frame is safe with respect to the property).

IC3 alternates two moves (Figures 3.18 and 3.19). Strengthen — pull the ring back where it is dangerous: each $F_k$ starts life as a loose over-approximation, a containment ring drawn generously around the fire, enclosing ground the fire will never actually reach. (That slack is the over-approximation: the ring claims more than the fire can do.) The engine asks the SAT solver whether any state in $F_k$ can transition to a bad state — a patch of that over-included ground from which a spark could still jump to the house. If it finds one, it carves a firebreak: it generalizes the offending state into a small clause and removes that clause’s region from $F_k$, pulling the ring back just enough to exclude the danger. Repeat until no spark can jump from $F_k$ to the bug.

Push — drag a firebreak outward: a wall that stops the fire at ring $i$ may also stop it at ring $i{+}1$. For each clause in $F_i$, the engine checks whether the transition relation preserves it; if so, the clause is promoted into $F_{i+1}$ — the way CDCL’s learned clauses live forever in the SAT base, a firebreak once carved is dragged forward through the whole stack of rings.

When two consecutive rings become identical ($F_k = F_{k+1}$), the engine has built a permanent firebreak the fire can never cross: an inductive invariant, and the proof is complete.

On our two designs IC3 behaves as differently as every engine before it. Both runs ask single-cycle SAT questions only — no unrolling — and both start from the bug and work backward.

The elevator: one frame — and generalization for free. IC3 asks the backward question: can any state in the current over-approximation step into $\texttt {moving}\wedge \texttt {door\_open}$? That is the very query CDCL closed — $\texttt {at\_target}$ cannot be both $0$ and $1$ — so it is UNSAT at once, and $F_1 = P$ is inductive on the first try: no firebreak need be carved.

The lesson here is not the strengthening (there is none) but the shape of the invariant. The clause that proves the interlock, $\neg (\texttt {moving} \wedge \texttt {door\_open})$, is written over the two control bits and nothing else — it never names floor or target. So that single clause clears all $4{\times }4 = 16$ floor/target settings at once: the proof keeps the load-bearing control bits and drops the irrelevant data wholesale. That is exactly what generalization buys — discharging a whole class of states with one clause instead of one state at a time — and it is why the floor registers were called irrelevant from the start. On the elevator the data drops out for free, because the interlock never reads it. The FIFO is where IC3 must earn the same move: it has to discover which bits are load-bearing and drop the rest itself.

The FIFO: strengthen, generalize, push, converge. Now the engine has to work (Figure 3.20).

• Strengthen. The backward query does find a spark: a state in the loose frame that steps to overflow — the garbage $\texttt {count}{=}4 \wedge \texttt {full}{=}0$, exactly the witness CDCL produced. IC3 must block it, so it asks, recursively, whether that cube has a predecessor in the frame below ($F_0$, the empty FIFO). The desync recurs without bound — which is exactly why $k$-induction choked — but $F_0$ cannot step to $\texttt {count}{=}4$, so the cube is unreachable from reset, and IC3 carves it out of the frame.
• Generalize (MIC). The blocked cube names the entire state — $\texttt {count}{=}4 \wedge \texttt {full}{=}0$ and every payload bit and the pointers. IC3 keeps only the load-bearing pair, leaving the clause $\texttt {count}{=}4 \Rightarrow \texttt {full}$: one firebreak across a whole region of garbage at once.
• Push. That clause is preserved by the transition, so it is promoted forward through the frame stack — the firebreak dragged outward ring by ring.
• Converge. When two consecutive frames coincide, the surviving clause $\texttt {count}{=}4 \Rightarrow \texttt {full}$ — the safety-critical half of the $\texttt {full}\Leftrightarrow (\texttt {count}{=}4)$ relationship the design maintains — together with $\texttt {count}\le 4$, is the inductive invariant nobody wrote down, and the proof is complete.

Generalization is where IC3’s speed comes from. The backward query hands back the entire offending state — the count, the flag, every payload byte, the read and write pointers. A naive engine would block exactly that state and meet its near-twin (same count and flag, one payload bit different) on the next query, for all $2^{\text {payload}}$ of them. IC3 instead drops a literal from the cube, re-issues the SAT call, and keeps the drop while the cube stays unreachable: the payload bits fall away first — overflow does not care what is stored — then the pointers, until only $\mathtt {count}{=}4 \wedge \mathtt {full}{=}0$ remains. The wall it then builds bars not this fully-specified state but any state with $\mathtt {count}{=}4 \wedge \mathtt {full}{=}0$, whatever else it holds — and that surviving clause, $\neg (\mathtt {count}{=}4 \wedge \mathtt {full}{=}0)$, is one half of $\mathtt {full}\Leftrightarrow (\mathtt {count}{=}4)$. Generalization did not merely prune the search; it discovered the invariant nobody wrote down.

Bradley’s inductive generalization (the MIC procedure — Minimal Inductive Clause) does exactly this, mechanically: drop a literal from the bad cube, re-issue the SAT call, keep the drop while the cube stays unreachable, repeat until nothing more can go. Later variants refine it in two directions. Eén–Mishchenko–Brayton’s PDR (FMCAD 2011) keeps inductive generalization but derives the blocking clause from the SAT solver’s UNSAT core (with ternary simulation to lift the predecessor cube), making each generalization nearly free; interpolation-based tools — Vizel–Gurfinkel’s AVY (CAV 2014) and IC3-with-interpolation generally — instead substitute a Craig interpolant extracted from the SAT proof, a more aggressive generalization at the cost of a heavier per-call computation. Either route, each SAT call seals off a region rather than a single point, and that is what makes the engine close hard properties in seconds on industrial designs.

Push-button automatic model checking

What the property became before the race. Before any engine wakes, one translation happens that the tool never shows. Hardware engineers write properties in SystemVerilog Assertions (SVA), but every engine in this section consumes (internally) a temporal-logic formula — SVA is a syntactic frontend to a fragment of Linear Temporal Logic (LTL), augmented with sequence operators (##, [*], [->]) that compile down to LTL with bounded auxiliary state. The model checker compiles the property into an $\omega $-automaton, takes its product with the design’s transition system, and reduces verification to “does the product have an accepting run?” — the automata-theoretic formulation of Vardi and Wolper (1986). Industrial tools hide the translation; the engineer sees SVA in, pass/fail out. The engines do not care which dialect the property started in.

Modern formal tools run all of these engines in parallel. The orchestrator wakes up, analyzes the property, and spawns four workers: BMC, $k$-induction, McMillan-style interpolation-only model checking (IMC), and IC3/PDR (Bradley’s frame structure, with MIC or interpolation-based generalization). They race.

The race plays out exactly as the five subsections predicted: on a shallow bug BMC wins in milliseconds with a replayable waveform; on a deep unbounded proof IC3/PDR closes it without unrolling while BMC crashes and $k$-induction tangles in garbage states. The orchestrator takes whichever engine finishes first, kills the rest, and returns proven (unbounded) or a counterexample.

This is what makes hardware formal verification routine in 2026. The engineer writes a property in SVA, types one command, and the orchestrator runs four engines in parallel underneath. The hand-authored helper invariants that dominated block-level proofs in the 2010s are now largely carved automatically by IC3 and interpolation. For a single block-level property, the engineer’s job is the property catalogue of §3.2 — naming the intent of the design — and everything below it is push-button. At SoC scale, invariants still drive the proof, but the LLM-plus-prover loop of §3.2 does the authoring rather than a human.

3.4 Proving Liveness: Fair Cycles and Ranking Functions

Every engine in the race just described answers a safety question — is any bad state reachable? — and returns the same kind of certificate, an inductive invariant: the fence that contains reset, holds under every transition, and never touches the bug. A large class of properties does not fit that mould. Chapter 2 wrote them as liveness properties — “something good eventually happens,” the arbiter’s promise that no client is starved, a handshake’s promise that every request is answered — using the strong s_eventually operator. This section is how an engine discharges them, because the shift away from safety is larger than the syntax suggests: a safety violation is a finite trace to a bad state, but a liveness violation is an infinite trace on which the awaited event never arrives, and the previous section’s engines find counterexamples as finite traces of bounded length.

The counterexample is a loop: the lasso. A finite-state design cannot run forever without repeating itself. By the pigeonhole principle, any infinite run eventually re-enters a state it has already visited, and from there it can cycle forever. So if a liveness property can fail, it fails on an ultimately periodic path: a finite stem from reset into a cycle that repeats with no further progress. This shape is a lasso (Figure 3.21), and it is exactly the “accepting run” that the $\omega $-automaton product of §3.3 is built to find — for a liveness property, an accepting run is a reachable cycle on which the awaited event never occurs. The lasso makes a liveness counterexample as concrete as a safety one: the engineer gets a stem-and-loop waveform to replay, not an open-ended “still running.”

Fairness: ruling out the schedules that cheat. Most liveness properties are false exactly as written, because there is always a degenerate run that withholds progress forever — an arbiter that is simply never clocked, an input that never arrives. These are scheduling artifacts, not design bugs. Fairness assumptions rule them out: weak fairness says an action that stays continuously enabled is eventually taken; strong fairness says an action enabled infinitely often is taken infinitely often — Chapter 2’s fairness property, a client requesting infinitely often is granted infinitely often, is the strong form. The engine then searches not for any non-progress cycle but for a fair one: a lasso that honors every fairness assumption and still starves the awaited event. A property that fails only on unfair cycles is, correctly, reported as proven.

The certificate is a measure: the ranking function. Safety’s proof certificate is the inductive invariant; liveness’s is its dual, a ranking function. Where the invariant is a region that proves nothing escapes, the ranking function is a clock that proves something must finish: a map from each state into a well-founded order — the natural numbers suffice — that strictly decreases on every step until the awaited event. A well-founded order has no infinite descending chain, so the measure cannot fall forever; the only way for the descent to stop is for the event to arrive. Like the invariant, it is hard to find but cheap to check — confirming that a candidate measure decreases each step is a single-step query, the same certificate asymmetry drawn in §3.1.

Worked example: starvation freedom of the round-robin arbiter. The arbiter of §2.5 carries its ranking function in plain sight. Its starvation-freedom property — written in Chapter 2 as req[i] |-> s_eventually gnt[i] — says no client waits forever. The measure is the rotational distance from the priority pointer to client $i$: how many positions the round-robin pointer must advance before it reaches $i$. Under the fairness assumption that the arbiter is clocked, the pointer advances one position per cycle, so the distance strictly decreases; it is bounded above by the client count $N$ and reaches zero exactly when client $i$ is selected and granted. A measure bounded by $N$ that drops every cycle is a proof that no client waits more than $N$ cycles — starvation freedom, with the bound $N$ falling out as a worst-case latency for free. The fence proved the arbiter’s safety half (no two clients granted at once); the ranking function proves its liveness half (no client starved). The same design needs both certificates because it makes both kinds of promise.

Liveness in practice: the reduction to safety. Searching for fair cycles costs more than the safety race, so industrial flows usually convert liveness back into safety and reuse the engines already built. The standard liveness-to-safety transformation adds a little bookkeeping state — a nondeterministic snapshot of a candidate loop-start state plus a flag that the awaited event has not occurred since — which turns “there exists a fair non-progress cycle” into “this safety monitor is violated,” a question BMC and IC3 answer directly. A coarser but common practice is bounded liveness: assert that the event occurs within a fixed window, request[i] |-> ##[1:k] grant[i], which is a pure safety property and is what much block-level sign-off actually checks. Bounded liveness is the right tool when a real deadline exists — the deadlock and starvation concerns of §3.1 usually do carry one; the unbounded ranking-function proof is the right tool when no fixed bound should be frozen into the contract.

3.5 SMT: Lifting SAT to Theories

The four-engine race of §3.3 has a blind spot, and every engine in it shares it: they all think in bits. To CDCL a 32-bit adder is two hundred clauses and a 32-bit multiplier nearly twenty thousand; datapath designs speak in words — sums, products, addresses, array reads — and bit-blasting every word-level fact down to clauses is where the four-engine stack starts to drown. This section climbs one rung up: a solver that accepts $a \times b$ as a single atom and reasons about it algebraically. The lift is called SMT, and the first thing to understand is the ladder it climbs.

From gates to words: a logic ladder. The whole chapter so far has been about propositional logic (sometimes called Boolean logic, or zeroth-order logic): every variable is true or false; every wire in a netlist carries a $0$ or a $1$; every gate is a propositional operator. SAT chases one question — is there an assignment to these Boolean variables that makes the formula true? — and CDCL is how it chases. That is enough when the question is about gates.

The questions verification engineers actually want to ask are bigger:

• For every $32$-bit input pair $(a, b)$, does the multiplier output match the spec?
• Is there any $32$-bit address where a store and a subsequent read return different values?
• For every cycle, if req is high, does gnt go high within five cycles?

These questions have a structural feature propositional logic does not capture: variables that range over a domain (the 32-bit integers, the address space, the natural numbers) and quantifiers ($\forall $ “for every,” $\exists $ “there exists”) that bind them. The logic that admits both is first-order logic (FOL).

A real-world analogy: the speed limit. The structure of FOL is obvious in plain English. The law is a universal first-order statement: “for every car on the road, $\mathrm {speed}(\mathrm {car}) \le 65$.” Cars are the domain; $\mathrm {speed}(\mathrm {car})$ is a function from a car to a number; the comparison $\le 65$ is a numeric predicate. The cop with a radar gun asks the negated existential: “is there a car with $\mathrm {speed}(\mathrm {car}) > 65$?” Same content, flipped so that a single witness (one speeder) settles the question. The radar gun is the solver. It does not check the law’s universal claim directly — it looks for one violation that disproves it.

Hardware verification has exactly this shape. The spec is universal: “for every input, design $A$ matches design $B$.” The miter (introduced as a worked example later in this section) is the negated existential: “is there any input where $A$ and $B$ produce different outputs?” SAT and SMT solvers chase the existential form, because witnesses are concrete and tractable while universals over a $2^{32}$-element domain are not.

Quantifier-free first-order logic: where SMT lives. Quantifier-free means the formula has no $\forall $ or $\exists $ left in it after the negation flip. The variables still range over a typed domain — 32-bit integers, byte arrays, mathematical integers — but no explicit quantifier binds them. The formula is a Boolean combination of typed atomic predicates (relations that are each true or false) like (= a b), (bvult x y), or (= (select mem i) v), with the variables free. The solver’s job is exactly what the radar gun does: find values, if any, that satisfy this conjunction.

The QF_ prefix on every SMT-LIB logic the rest of this chapter uses — QF_BV, QF_ABV, QF_LIA, QF_UF — is exactly that promise: the formula has already been moved to the quantifier-free rung. The verification flow does the lifting (universal spec $\to $ negated existential; temporal quantifier $\to $ bounded prefix); the SMT engine chases the witness (Figure 3.22).

SMT and DPLL(T). Satisfiability Modulo Theories (SMT) is exactly SAT lifted from gates to words: typed variables (32-bit integers, byte arrays, mathematical integers), theory atoms (a < b + 3, select(mem, i) == v, f(x) == f(y)), and a theory solver sitting on top of an unchanged CDCL engine that enforces the rules of the underlying domain. The algorithm is DPLL(T) — “DPLL with theory”; the name is historical, kept from the framework’s lineage, and the Boolean engine inside it is the modern CDCL solver. Each theory atom becomes a fresh Boolean variable, and the two solvers play ping-pong:

1. SAT proposes. The SAT engine picks a Boolean assignment over the atoms.
2. Theory checks. The theory solver tests whether the chosen atoms are jointly consistent in the theory.
3. Theory learns. On a conflict it returns a short explanation — a theory lemma — that the SAT engine adds as a new clause.

They repeat until they agree on a satisfying assignment or jointly prove unsatisfiability.

SMT theories useful in hardware verification. The SMT-LIB standard defines a catalogue of theories; only a few matter day-to-day for RTL work:

• QF_BV (Quantifier-Free Bit-Vectors) — the workhorse. Variables are fixed-width bit-vectors with operations bvadd, bvsub, bvmul, bvand, bvor, bvxor, bvnot, bvshl, bvlshr, signed and unsigned compares (bvult, bvslt, bvuge, bvsge), concatenation, and bit-slicing. A 32-bit adder in pure SAT requires hundreds of clauses; in QF_BV it is a single bvadd atom that the bit-vector solver handles internally.
• QF_AX (Quantifier-Free Arrays with extensionality) — models memories and register files. The signature is select(a, i) (read) and store(a, i, v) (write); combined with QF_BV it becomes QF_ABV, the standard theory for memory-system verification.
• QF_LIA (Quantifier-Free Linear Integer Arithmetic) — atoms are linear inequalities over mathematical integers: 2x + 3y <= 7. Used for spec-level reasoning where counters and addresses are treated as unbounded, and for software-side verification.
• QF_UF (Quantifier-Free Uninterpreted Functions) — function symbols with no fixed interpretation, only the congruence axiom: $f(x) = f(y)$ whenever $x = y$. Used for abstraction — replace a complex circuit’s behavior by an uninterpreted function and prove the surrounding logic correct against the abstraction.
• Combinations. QF_ABV (arrays + bit-vectors) is the default theory most hardware tools target. QF_AUFBV adds uninterpreted functions for partial abstraction.

Why SAT alone does not scale to datapath

The four-engine stack from §3.3 runs on top of CDCL, and CDCL runs on top of CNF. Every theory atom written at the word level — a + b, a * b, select(mem, addr) — ultimately compiles down to a Boolean formula the SAT solver chews on. The cost of formal verification is therefore the cost of the encoded CNF, not the visible size of the property. And the cost is sharply uneven across operations (Table 3.2).

Table 3.2: CNF cost of common 32-bit operations after Tseitin / bit-blasting. The cost is per operator; a typical datapath property chains several. “QF_BV atoms” counts the corresponding SMT-LIB constructs — always small, because the theory solver does the bit-blasting internally with optimizations that are not visible to the encoder. The adder and comparator rows reflect the compact carry-chain encodings solvers actually emit; the textbook 2-XOR/2-AND/1-OR full-adder expansion priced in Table 3.3 is looser.

Operation	Width	SAT (Tseitin CNF)	QF_BV atoms
`bvand`, `bvor`, `bvxor`	32	$\sim 100$ clauses	1
`bvnot`	32	free (negated literals)	1
`bveq`, `bvult`, `bvslt`	32	$\sim 100$ clauses	1
`bvshl`, `bvlshr` (constant)	32	free (wiring)	1
`bvadd`, `bvsub` (ripple-carry)	32	$\sim 200$ clauses	1
`bvmul`	32	$19\,616$ clauses	1
`bvudiv`, `bvurem`	32	$> 50\,000$ clauses	1

The first three rows are well-behaved: bitwise operators and ripple-carry arithmetic produce a few hundred clauses each, growing linearly with bit width. CDCL closes them fast. Multiplication is the inflection point. A 32-bit bvmul produces nearly twenty thousand clauses, and the partial-product tree has the kind of dense connectivity that misleads the solver’s branching heuristics (VSIDS, CDCL’s activity-based decision scoring) into unproductive decisions. The clause count grows quadratically with bit width — $n^2$ partial products summed by an array of $n^2 - 2n$ full adders, each a fixed five variables and seventeen clauses under Tseitin (Table 3.3 works the count out to 64 bits). Division and modulo are worse still — the bit-blasted CNF can reach hundreds of thousands of clauses for a single 64-bit operation. This is the multiplier wall: the point at which a pure SAT encoding stops being a practical answer to a datapath property.

Table 3.3: The multiplier wall. An $n$-bit array multiplier bit-blasts to $n^2$ partial-product AND gates plus a reduction array of $n^2 - 2n$ full adders and $n$ half adders — every full adder removes one bit, carrying the $n^2$ partial-product bits down to the $2n$-bit product. Tseitin charges 1 variable / 3 clauses per AND, 5 variables / 17 clauses per full adder (two XORs, two ANDs, an OR), and 2 variables / 7 clauses per half adder, so the CNF has $n^2 + 5(n^2{-}2n) + 2n = 6n^2 - 8n$ variables and $3n^2 + 17(n^2{-}2n) + 7n = 20n^2 - 27n$ clauses — both $\mathcal {O}(n^2)$, tabulated below. The 64-bit row is where eager bit-blasting routinely times out in CDCL even on modern solvers.

multiplier width $n$	CNF variables	CNF clauses
$8$	$320$	$1\,064$
$16$	$1\,408$	$4\,688$
$24$	$3\,264$	$10\,872$
$32$	$5\,888$	$19\,616$
$64$	$24\,064$	$80\,192$

QF_BV is built exactly for this wall. The bit-vector solver receives the operator symbol bvmul and applies word-level decision procedures internally — modular reasoning, Booth recoding, dedicated multiplier procedures depending on the back-end — before falling back to bit-blasting only when it must. When word-level reasoning is not enough, the back-end uses incremental bit-flattening: emit the propositional skeleton plus the cheap operators first, omit the expensive multiplier constraints, ask the SAT solver. If unsat, the formula is unsat without the multiplier; if sat, check whether the assignment also respects the omitted constraints, and if not, add the missing constraints for the conflicting terms and re-solve. This is abstraction–refinement applied at the operator level — what lets industrial SMT solvers close datapath properties an eager bit-blasting solver cannot finish. A 32-bit bvmul property that times out in eager SAT often closes in milliseconds in QF_BV; 64-bit properties out of reach for SAT alone are routine.

Equivalence checking: SMT’s industrial workhorse

The largest single application of SMT in production hardware verification is equivalence checking after synthesis. Synthesis is the most-trusted transformation in the silicon pipeline. It converts a few thousand lines of RTL into a million-gate netlist that an FPGA or ASIC fabric can implement, applying clock-gating, retiming, resource sharing, and pipeline restructuring along the way. The trust is necessary because no team writes the netlist by hand; the trust is not free because a synthesis bug, an aggressive optimization, or a misapplied user attribute can introduce a functional divergence between the RTL the designer wrote and the netlist that goes to the fab. Equivalence checking is the formal proof that the two designs compute the same function, run after every synthesis pass and before tape-out. By run count, it is the most-deployed formal tool in the industry — more than BMC, model checking, and theorem proving combined — because it is the gate that synthesis has to pass to be shipped. It is also the place where SAT alone visibly fails and SMT visibly rescues, which is why it lands in this section as the canonical worked-out application.

The miter circuit. The construction is strikingly simple (Figure 3.23). Wire identical inputs to design $A$ (the RTL) and design $B$ (the netlist). XOR each pair of corresponding output bits. OR the XOR results into a single signal called differ. Hand the circuit to a SAT or SMT solver and ask one question: is there any input that makes differ go high? UNSAT proves that $A$ and $B$ compute the same Boolean function for every input combination, on every cycle the analysis spans. SAT returns a concrete counterexample — a specific input pattern producing different outputs — which is a synthesis bug with replay instructions.

Combinational vs sequential EC. The miter approach divides into two regimes by how the flops are arranged:

• Combinational Equivalence Checking (CEC). Both designs have the same number of flops in corresponding positions. Each flop in $A$ has a partner in $B$; the combinational cones between flops can be compared independently. The miter degenerates into one independent SAT query per flop boundary. Industrial CEC tools handle billion-gate designs by exploiting structural matching: corresponding internal points are identified by name or by topological hash, equivalences propagate through the netlist using SAT-based engines on small slices, and the global proof is the conjunction of the local ones.
• Sequential Equivalence Checking (SEC). The two designs have different flop structures. Retiming may have moved flops across logic; pipelining may have added a flop boundary; resource sharing may have collapsed two parallel multipliers into one. There is no per-flop structural match. The comparison is over the design’s whole input–output trajectory: for every reachable starting state pair $(s_A, s_B)$, prove that $A$ and $B$ produce identical outputs forever. SEC is the four-engine stack from §3.3 (BMC, $k$-induction, Craig interpolation, IC3/PDR) applied to the miter, with abstraction essential when the cross-product state space is large.

CEC is push-button on industrial designs. SEC inherits the full proof-engine machinery, applied not to a property but to a “function preserved” obligation. The two-design state space is the cross-product of $A$’s and $B$’s state spaces; abstraction (§3.2 Stage 4) is what makes that tractable.

SEC arises in three regimes, all sharing the miter that Chapter 2’s pipelined ALU (§2.8) earns: transformation equivalence (RTL against an aggressively-synthesized netlist — retiming, pipelining, resource sharing — the case this section develops), arithmetic equivalence (datapath restructuring such as Booth vs. shift-and-add, the worked example below), and model equivalence (RTL against a C-model or Lean reference, the SLEC case taken up in §3.6).

Where SMT earns its keep in EC. The synthesis transformations that break CEC’s structural matching are mostly arithmetic. A Booth multiplier and a shift-and-add multiplier produce the same product but completely different gate-level structures — bit-by-bit, the netlists look unrelated. A balanced 32-bit adder tree differs gate-for-gate from a ripple-carry adder but computes the same sum. CEC’s bit-blast-and-compare can be technically correct but practically intractable: the SAT solver does not know that two combinational cones implement the same arithmetic function until it has searched the entire Boolean cross-product space.

SMT with the QF_BV theory closes this gap. Encode both sides at the word level — (bvmul a b) for one, (bvmul b a) for the other — and assert that the outputs differ. The bit-vector solver recognizes the operators as arithmetic and applies word-level decision procedures (Booth recoding, modular reasoning, dedicated multiplier procedures depending on the back-end) without bit-blasting the multipliers into hundreds of thousands of clauses. The same query that exhausts a pure SAT solver closes in milliseconds in QF_BV.

A worked example: rebalanced adder tree. Consider a 32-bit four-operand adder, written two ways. Design $A$ is a linear chain: ((a + b) + c) + d. Design $B$ is a balanced tree: (a + b) + (c + d). The two designs compute the same final sum but route through different intermediate values — CEC’s per-flop matching fails because the intermediates differ. The miter in QF_BV is direct:

(set-logic QF_BV)
(declare-fun a () (_ BitVec 32))
(declare-fun b () (_ BitVec 32))
(declare-fun c () (_ BitVec 32))
(declare-fun d () (_ BitVec 32))


; Design A: linear chain (((a+b)+c)+d)
(define-fun sumA () (_ BitVec 32)
  (bvadd (bvadd (bvadd a b) c) d))


; Design B: balanced tree (a+b)+(c+d)
(define-fun sumB () (_ BitVec 32)
  (bvadd (bvadd a b) (bvadd c d)))


; Miter: is there any input where the two structures differ?
(assert (not (= sumA sumB)))


(check-sat)         ; reports: unsat

Z3 returns unsat in milliseconds, mathematically proving the rebalanced tree is functionally identical to the linear chain. The same query bit-blasted to pure SAT (Boolector with eager bit-blasting, or CBMC’s bit-vector front-end) produces several thousand clauses; CDCL still closes it, but on a 64-bit version the bit-blasted SAT can time out while QF_BV finishes in under a second. The asymmetry grows fast with bit width, and it is sharper for multiplication than for addition — which is why production EC tools running on aggressively-synthesized datapaths reach for QF_BV by default.

Table 3.4: Decidability and complexity of common SMT theories. Full first-order is mostly undecidable; the quantifier-free fragment is the practical sweet spot. Theory names follow SMT-LIB. Bit-vector rows stay decidable even under quantifiers because fixed-width domains are finite; uninterpreted functions and unbounded-index arrays do not.
Theory	Full FOL	QF fragment	Practical complexity

Propositional (no theory)	decidable	decidable	NP-complete
QF_UF (equality + UF)	undecidable	decidable	NP-complete
QF_LIA (linear integer)	decidable	decidable	NP-complete
QF_LRA (linear real)	decidable	decidable	NP-complete
QF_BV (bit-vectors)	decidable	decidable	NP-complete
QF_AX (arrays)	undecidable	decidable	NP-complete
QF_ABV (arrays + BV)	undecidable	decidable	NP-complete
T_R (real algebra)	decidable (Tarski)	decidable	$2^{2^{O(\|F\|)}}$ (CAD)
T_N (Peano arithmetic)	undecidable	undecidable	—

Industrial use. Equivalence checking runs on every silicon project, mostly unremarked. Every nightly synthesis run produces a netlist; every netlist goes through CEC before place-and-route. Every ECO (Engineering Change Order) — the targeted post-tape-out fix that swaps a small piece of the netlist without re-synthesizing the whole chip — goes through CEC to prove the patched netlist still matches the (patched) RTL. SEC kicks in when the synthesis tool is configured aggressively (retiming across clock boundaries, multi-cycle path optimization, clock-gating insertion), when the designer wants to verify a custom microarchitectural transformation, or when low-power flows restructure the flop layout. The proof obligation never changes — $A$ and $B$ compute the same function — but the engines underneath are the four-engine stack from §3.3 lifted to the bit-vector theory of this section.

Decidability and cost at a glance

The reach and cost of each theory differ sharply. Table 3.4 summarizes (after Bradley & Manna, The Calculus of Computation, Tables 3.1 and 3.3) what is decidable in the full theory, what is decidable in the quantifier-free fragment, and the practical complexity class of the quantifier-free decision procedure. The pattern that drives 2026 formal verification is the third column: the QF fragment is decidable for every theory hardware cares about, and the worst-case complexity is reasonable for everything except elementary algebra. Production tools restrict the user to the QF fragment by construction.

Looking up the ladder: why the climb costs decidability

The engines walked through in §3.3 and this section each target one rung of the logic ladder of Figure 3.24: SAT lives on the bottom rung (propositional formulas, variables in $\{0, 1\}$); SMT lifts one rung up (quantifier-free first-order logic over bit-vectors, arrays, linear arithmetic); the SVA spec language sits another rung up (full first-order logic with $\forall $ and $\exists $ over typed state sequences), reduced back down by BMC and IC3 for the model checker to decide; theorem proving (the subject of the next section) handles the top rung, higher-order logic. Each rung adds expressive power and loses some decidability; each engine pays exactly for the rung it targets. The shape of the ladder is not engineering convenience — it is a theorem from 1936.

Each rung adds quantification over a larger domain, and a decision procedure has to terminate on every input. Propositional logic has no quantifiers; the domain is finite Booleans, and SAT decides every formula in finite time (exponential worst case, tractable in practice). Quantifier-free first-order logic still has no $\forall $ or $\exists $ in the formula itself; SMT decides it because the standard theories the SMT-LIB committee chose — bit-vectors, linear arithmetic, arrays — happen to be decidable. Non-linear integer arithmetic, also quantifier-free, is already undecidable (Hilbert’s tenth problem, settled by Matiyasevich in 1970); so the QF rung is safe by choice of theory, not by structure. Full first-order logic adds $\forall $ over an infinite domain: the engine cannot enumerate the integers, and the best it can do is search for a proof. Validity becomes only semi-decidable — Church 1936. Higher-order logic adds quantification over functions and predicates (Boolean-valued functions): $\forall f \colon \mathbb {N} \to \mathbb {N}$ is uncountable, and validity ceases even to be semi-decidable.

The verification consequence is load-bearing: a model checker on a finite state space reduces full-FOL property checking back down the ladder (BMC unrolls to QF-FOL, IC3 to SAT) and stays decidable; on an unbounded state space ($\forall N$-core caches, $\forall $-depth FIFOs) the reduction has no terminus and the engine cannot decide. This is the structural reason the top rung needs human-or-AI guidance — not because the engineering is harder, but because no algorithm decides what lives there in general. The kernel does not decide HOL validity; it mechanically checks whether a specific proof is sound. Undecidable to find a proof automatically; decidable to verify one once written — that asymmetry is what makes the top rung work at all, and it is the architectural opening every AI integration in the next section exploits.

The engines, in runnable form. Every engine of §3.3 and the SMT lift of §3.5 is implemented as a small, dependency-free companion under ch3_fv_examples/ in the repository. The engines and a from-scratch CDCL solver they all call as their decision oracle live in 01_proof_engines/, and a minimal SystemVerilog frontend there reads the actual RTL of the two running examples — elevator.sv (in 02_elevator_proof/) and fifo.sv (in 03_fifo_proof/) — builds each one’s transition system, and Tseitin-encodes it, exactly as a real flow does: the model the engines prove is derived from the design, not hand-written. Running make in either example directory reproduces the verdicts stated here: the elevator’s one-step unsat and its $k{=}1$ proof; the FIFO’s $\mathtt {count}{=}4 \wedge \mathtt {full}{=}0$ witness and its stall at every $k$; the fence $\neg (\mathtt {count}{=}4 \wedge \mathtt {full}{=}0)$ that IC3 and interpolation each discover; and — in 04_adder_equiv_smt_proof/, from SMT-LIB miters read by a small QF_BV frontend — the rebalanced adder-tree equivalence of §3.5, plus a bvmul-commutativity miter that DPLL(T) closes word-level instead of paying the multiplier wall’s bit-blast cost. Every run is quiet by default and narrates on demand: --trace prints the transition system, the Tseitin encoding, and each engine step — every IC3 obligation, ternary lift, and generalization, literals by name — and --deep adds the CDCL search itself under each query, decision by decision, so a reader can watch precisely how the solver reaches the verdict each section describes.

The Chapter 2 contracts, machine-closed. The same companion closes every design of Chapter 2 from its book files: in ch2_rtl_fv_examples/, make check reads each design together with its bound checker — both exactly as this book prints them — and proves every contract; make mutations confirms that one realistic bug injected into each design is caught. Three ingredients carried the reach from the two running examples to the full set. The frontend grew to what the checkers actually write: packed structs and bind, the bounded-queue reference model of the FIFO’s shallow checker, user functions, generate loops, $past and $stable as synthesized shadow registers, and the bounded delay window ##[lo:hi] lowered exactly — one monitor bit per cycle of horizon — so a bounded stand-in is proved as the safety property it really is, and only a genuine s_eventually becomes a liveness obligation. The IC3 implementation gained the two moves named in §3.3: ternary-simulation lifting of every SAT witness, and queries that encode only the cone they read — which is what lets the MSI cache node close unbounded, in three frames at a single set, rather than stall as a bounded check. And cegar.py makes the CEGAR loop of Chapter 2 runnable at scale: localization cuts every register outside a kept set free, an abstract counterexample replays as a deterministic simulation, and the full-geometry cache — four ways by sixteen sets, 288 state bits, where a direct run drowns in the all-sets snoop cone — proves single residence in one round keeping sixteen bits: the probed set’s state and tag pairs, the checker’s own predicates. Where a proof runs at a reduced data or tag width, the reduction is the data abstraction of the catalogue above — the contract is width-independent, and a bit-level invariant over data equality enumerates values — stated per target in the Makefile. The remaining companions in the fv/ directories are assume-only environment files — the input contracts any formal setup supplies: coin pacing, stable requests, a set index that addresses a real set, FIFO flow control — the bug-injected twins, and single-property restatements for the full-geometry and word-level runs.

3.6 Theorem Proving

The chapter opened with the maze metaphor: simulation walks the corridors with a flashlight, the four push-button engines turn on the stadium lights and reveal the whole maze from above (Figure 3.1). Theorem proving is the third step on the same arc — reading the wiring rules of the maze and proving by induction over its size that no $N$-room maze built to those rules can contain a path from start to bug, including the case $N = \infty $ (Figure 3.25).

The search engines walk one specific maze of bounded size; theorem proving reasons over the family of mazes parameterized by $N$, closing infinitely many instances with a single argument.

The cost shifts from machine time to human reasoning, and the payoff is the case the search engines structurally cannot reach: properties that hold for every $N$, including the $N$ that ships in silicon after tape-out, beyond any bound a finite-state engine can enumerate.

The walls are dominoes, not barriers. The lemmas $L_1, L_2, L_3, L_4$ annotated on the blueprint are not static obstacles in one specific maze — they are propagation rules between sizes. $L_1$ proves the property holds for the smallest case (room 1, or 2 cores, or an empty FIFO); each subsequent $L_k$ proves the wiring rule if room $k$ is safe, room $k{+}1$ inherits the same safety from the wiring rule that connects them (Figure 3.26). Once the chain closes, the first domino tips the second, the second the third, and the property propagates without bound. This is the structural difference between theorem proving and the four push-button engines of §3.3: BMC, IC3, and $k$-induction each verify one fixed $N$ — pick $N{=}4$, they finish; pick $N{=}1024$, they may not — whereas a closed induction proof covers every $N$ in a single argument. That is the only way to discharge a property whose quantifier ranges over a parameter the design will be instantiated at after tape-out.

The wall: knowing when push-button has stopped

The four-engine stack of §3.3 and the SMT lift of §3.5 close most of what hardware verification asks. Engineers cross into theorem proving’s territory when push-button gives one of four signals:

• The solver returns unknown. An SMT query that mixes non-linear arithmetic (multiplication, division, square root) or transcendentals exits with unknown rather than sat/unsat. The decision procedure does not exist or is incomplete. The canonical case is the multiplier wall: a 64-bit FPU has $2^{128}$ input pairs, and bit-blasting two multiplied operands produces a partial-product tree no SAT solver finishes; this is exactly why the landmark FPU-multiplier proofs were done in ACL2 (AMD’s K7/K8) and HOL Light (Intel’s post-FDIV work) rather than with SMT-backed model checking.
• BMC and IC3 run out of memory. The CNF after bit-blasting a wide multiplier or array of FP operations exceeds RAM. IC3’s frame frontier collapses or grows without convergence. Cache-coherence at scale is the recurring example: a MESI/MOESI protocol on 4 cores finishes in seconds, on 8 cores in minutes, on 32 cores stops finishing at all, and on 1024 cores the model checker cannot even build the model. The protocol is the same — only the parameter changed — so hoping the next core-count bump is still tractable is not a verification plan. Abstraction softens the blow — the CEGAR loop of Chapter 2 closes any fixed node count by predicate abstraction — but each new $N$ is a fresh proof; the family $\forall N$ has no finite-state reduction at all.
• $k$-induction never closes. The engine asks the engineer for an ever-larger family of helper invariants and still does not prove the unbounded property. The reachable-state set has too much arithmetic structure for the inductive step to follow. An unbounded FIFO is the canonical case: $k$-induction proves no-packet-drop for depth $D=16$, then for $D=32$, then for $D=64$, and never collapses to a single argument that covers every $D$. The proof obligation is a list of incomparable bounded proofs, not the unbounded result the spec actually asks for.
• The property mentions “every” over a domain SAT cannot enumerate. “For every IEEE 754 input pair, the FP multiplier produces the correctly-rounded result,” “for every $N$-core configuration of this coherence protocol, no deadlock occurs,” and “for every FIFO depth $D$, no packet is dropped” all quantify over a parameter the search engines cannot enumerate. No SMT encoding closes the universal in finite time, because the proof obligation is not about a witness but about the algebraic structure of the design as the parameter varies. Induction over the parameter is the only proof shape that fits.

These four signals are the wall. Hitting one of them is the cue to switch flow: stop trying to make push-button finish, build a theorem-proving project, and prove the property by reasoning over the algebraic structure of the operation. The transition is irreversible in the same way moving from simulation to formal is: once a property crosses into theorem-proving territory, it stays there for the rest of the design’s life.

The theorem-proving project lifecycle

A theorem-proving project runs as a staged pipeline of its own — seven stages, from picking a tool to ongoing maintenance, each producing an artifact the next consumes (Figure 3.27). Its shape echoes the model-checking flow of §3.2 — capture intent, prove, sign off — but the stages and their artifacts are different.

The rest of this section walks each stage on a single concrete running example: a 4-bit signed Booth multiplier. The example is small enough to fit in a textbook and large enough to exhibit every stage of an industrial-flavored proof, and we will return to it as we discuss the AI revolution and the case studies. The reader leaving this section should be in a position to clone a small Lean project, restate the example in Mathlib, and have a closed proof on their own machine within an hour.

Stage 1: pick the tool

The theorem-proving world has four serious tools with industrial pedigree, three of which are commonly applied to hardware and the fourth of which — Isabelle/HOL — sits adjacent (system-software verification rather than RTL, but with hardware-relevant lessons):

ACL2 (A Computational Logic for Applicative Common Lisp) carries the largest industrial portfolio. Russinoff’s team verified the AMD K7/K8 floating-point units in ACL2 (1998–2003). Centaur used ACL2 to verify x86 FPU correctness for over a decade, shipping silicon. The tradition descends from the Boyer–Moore prover; the language is Common Lisp with an automated rewrite engine and induction tactic; proof automation for arithmetic and recursion is mature. ACL2 remains the right choice for an FP-hardware project today if you can build on Russinoff’s published libraries; for a new project without that legacy, Lean is the modern default.

Coq (renamed Rocq in 2025) is a dependent-type-theory prover with a long mathematical pedigree — the Four Color Theorem (Gonthier, 2005), the verified C compiler CompCert (Leroy, 2006), the Feit–Thompson theorem (2012). It is the home of much programming-language and cryptography verification — AWS verified parts of its s2n TLS implementation in Coq, Apple verified HomeKit cryptography there.

Isabelle/HOL is the fourth serious prover, with a higher-order-logic foundation and the largest single-project industrial result in formal methods — the seL4 microkernel proof, around 200,000 lines of Isabelle against $\sim $9,000 lines of C. The Archive of Formal Proofs (AFP) is its mathematical library. The same HOL family includes John Harrison’s HOL Light, the prover behind Intel’s post-FDIV floating-point verification — the Intel counterpart to AMD’s ACL2 tradition. Isabelle has historically been more widely used in system-software verification than in RTL, and for new hardware-FV work Lean is the usual starting point, but Isabelle’s HOL-style approach and its proof-script idiom remain a reference that engineers reading older verification literature will encounter.

Lean 4 (with the Mathlib library) is the modern, AI-friendly choice. Mathlib has grown to multi-million-line scale; the language is dependently typed but reads like a modern functional language; the tooling (lake build, VSCode integration, fast type-checker) is genuinely good. Most importantly, Lean is where the AI-assisted theorem-proving frontier lives: AlphaProof + AlphaGeometry 2 (DeepMind, 2024) jointly achieved IMO silver-medal performance, with AlphaProof proving its solutions in Lean, and Seed-Prover (ByteDance, 2025) is a whole-proof search system that refines candidate Lean proofs against kernel feedback.

We use Lean for the worked example that follows because modern Lean is the easiest prover for a hardware engineer to get running in 2026: elan installs the toolchain, lake new creates a project, import Mathlib pulls in Mathlib’s millions of lines of supporting library, and the VSCode interaction loop closes within minutes of installation. The conceptual moves are the same in ACL2 and Coq; the syntax differs.

Stage 2: model the design

The first concrete artifact of a theorem-proving project is the model — a description of the hardware written in the prover’s language. The model is not the Verilog (which mixes timing, pipelining, and clock-gating with the arithmetic) and not the spec (which is mathematical). It sits between them: an executable, side-effect-free description of what the hardware computes, stripped of timing.

For our running example, the hardware is a 4-bit signed Booth multiplier (Figure 3.28). Booth’s algorithm is what every FPU and DSP multiplier descends from: it inspects overlapping bit-triplets of one operand, encodes each as a digit in $\{-2, -1, 0, +1, +2\}$, and sums the multiplicand scaled by those digits. Two triplets at radix-4 suffice to multiply a 4-bit operand.

The Lean model is direct:

-- File: BoothProof/Basic.lean
import Mathlib.Tactic.Ring
import Mathlib.Tactic.NormNum


-- Booth digit encoding for triplet (b_{i+1}, b_i, b_{i-1})
def boothDigit : Bool -> Bool -> Bool -> Int
  | false, false, false =>      0
  | false, false, true     =>   1
  | false, true,    false =>    1
  | false, true,    true   =>   2
  | true,   false, false => -2
  | true,   false, true    => -1
  | true,   true,   false => -1
  | true,   true,   true   =>   0


-- The hardware: two Booth digits, two partial products, shifted sum
def boothMul (a : Int) (b3 b2 b1 b0 : Bool) : Int :=
  let d0 := boothDigit b1 b0 false       -- triplet (b1, b0, 0)
  let d1 := boothDigit b3 b2 b1          -- triplet (b3, b2, b1)
  a * d0 + 4 * a * d1

That is the entire hardware model: 17 lines, mechanically translatable to Verilog by a script (or written first in Verilog and translated to Lean by a tool like RAC’s translator). Critically the model is executable — you can call boothMul 3 false true true false and Lean computes 18, which is $3 \times 6$. The reference model is also the simulation oracle. One modeling choice carries weight downstream: the multiplicand a is an unbounded Int, so the theorem will quantify over every multiplier input at once — the case no bit-blasting engine can take in a single query — while the 4-bit operand b is pinned to its two’s-complement value by signedValue. The proof therefore discharges the algebraic identity over $\mathbb {Z}$; the fixed-width output encoding and overflow behavior are not in this model at all, and transfer to the production RTL through the sequential-equivalence leg of Stage 6.

Stage 3: specify the property

The property is the mathematical claim we want to prove. Booth’s correctness obligation is: for every multiplicand $a$ and every 4-bit signed $b$, the hardware produces the integer product $a \times b$. The spec needs one more definition — the integer interpretation of a 4-bit signed bit-vector — and then the theorem itself:

-- 4-bit two's-complement interpretation
def signedValue (b3 b2 b1 b0 : Bool) : Int :=
    (if b3 then -8 else 0) + (if b2 then 4 else 0)
  + (if b1 then   2 else 0) + (if b0 then 1 else 0)


-- The proof obligation: hardware = mathematics
theorem boothMul_correct (a : Int) (b3 b2 b1 b0 : Bool) :
    boothMul a b3 b2 b1 b0 = a * signedValue b3 b2 b1 b0 := by
  sorry       -- to be filled in by Stages 4-5

The sorry keyword is Lean’s placeholder for an unfinished proof. The theorem is well-typed and the project compiles; the obligation just is not yet discharged. The discipline is to land the theorem statement before writing the proof — you spec what you want, then you build the argument.

Stage 4: build the lemma library

The theorem’s proof is a tree of justifications (Figure 3.29). At the leaves sit axioms and library facts — “$x + 0 = x$,” “Booth’s encoding is faithful,” Lean’s built-in integer arithmetic. Above them sit lemmas the engineer proves: small intermediate facts that each turn into a single rewriting step in the main proof. At the top sits the theorem itself — the cathedral whose mortar joints are checked by the kernel.

Why a programming language can be a prover: Curry–Howard. A natural question at this point: why is Lean, which is technically a functional programming language with definitions and pattern-matching functions, able to act as a strict math professor? The answer is one of the deepest connections in twentieth-century logic, the Curry–Howard correspondence (Curry 1934, Howard 1969): types are propositions, programs are proofs. A mathematical theorem like “$x + 0 = x$ for every natural number” is, in Lean’s type theory, the type (x : Nat) -> add x 0 = x; a proof of that theorem is exactly a function that returns a value of that type. The Lean kernel does not have a separate “theorem checker” distinct from its “function checker” — it has one type-checker, and theorem proving is a special case of type-checking. This is also why Lean’s proof syntax reads like programming: induction x with | zero => rfl | succ n ih => ... is mechanically a recursive function over the natural numbers that produces a value of the goal type. The tool is not magic; it is the type-checker of a sufficiently expressive functional language, reading proofs as programs and refusing to compile illegal arguments.

For the Booth proof, two engineer-proven lemmas and one Mathlib lemma form the structure:

Lemma 1 (boothDigit_arith): The Booth digit of a triplet $(b_{i+1}, b_i, b_{i-1})$ equals the signed sum $b_i + b_{i-1} - 2 b_{i+1}$. In English: the Booth recoding faithfully turns a 3-bit window into the algebraically-correct contribution to the result. The proof is exhaustive over the eight possible triplets:

theorem boothDigit_arith (bHi bMid bLo : Bool) :
    boothDigit bHi bMid bLo =
      (if bMid then 1 else 0) + (if bLo then 1 else 0)
      - (if bHi then 2 else 0) := by
  cases bHi <;> cases bMid <;> cases bLo <;> rfl

The proof reads: case-split on each of the three Booleans, then close every case by reflexivity. Once the three bits are concrete, both sides reduce to the same integer literal, so rfl discharges each of the eight cases.

Lemma 2 (signedValue_decomp): The 4-bit signed integer decomposes as the sum of the two Booth-digit contributions. Concretely: $\textsc {signedValue}(b_3, b_2, b_1, b_0) = \textsc {boothDigit}(b_1, b_0, 0) + 4 \cdot \textsc {boothDigit}(b_3, b_2, b_1)$. The proof rewrites each digit through Lemma 1, leaving a boolean-weighted arithmetic identity that the sixteen-case evaluation then closes:

theorem signedValue_decomp (b3 b2 b1 b0 : Bool) :
    signedValue b3 b2 b1 b0
    = boothDigit b1 b0 false + 4 * boothDigit b3 b2 b1 := by
  simp only [boothDigit_arith]   -- rewrite each digit via Lemma 1
  cases b3 <;> cases b2 <;> cases b1 <;> cases b0 <;> simp [signedValue]

Lemma 3 (Mathlib.mul_add): Standard distributivity, $x(y + z) = xy + xz$. Already in Mathlib; we cite it.

These three lemmas are the entire library for this proof, and they compose: Lemma 1 is the per-digit fact, Lemma 2 consumes it, and the theorem rests on Lemma 2 with distributivity. At four bits the sixteen-case evaluation could have discharged Lemma 2 on its own — but that shortcut is exactly what does not scale. A 53-bit mantissa multiplier has $2^{53}$ operand values and no proof enumerates them; there, the per-digit lemma is proved once and reused at every position, so the case explosion never happens. The fact is small here and load-bearing at width. In a real industrial multiplier proof the library has dozens of supporting facts (truncation-of-summand, carry-bit propagation, sign-extension correctness, denormal handling); the structure is the same and the lemma library is the lasting asset.

Stage 5: drive the proof

With the library built, the main theorem closes in three lines:

theorem boothMul_correct (a : Int) (b3 b2 b1 b0 : Bool) :
    boothMul a b3 b2 b1 b0 = a * signedValue b3 b2 b1 b0 := by
  unfold boothMul
  rw [signedValue_decomp]
  ring

“Unfold the hardware definition; rewrite the right-hand side using Lemma 2; let ring discharge the resulting algebraic identity by distributivity (Lemma 3) and commutativity.” That is the entire proof: 17 lines of model, 16 lines of lemmas, 3 lines of main theorem. The Lean kernel checks every step in under a second.

The math-textbook analogy. The proof above can be read at three registers, each for a different audience. The Lean script is for the machine. The terse English version is for the reviewer: “By the signed-value decomposition lemma, the spec equals $a \cdot d_0 + 4 a \cdot d_1$; the hardware computes that directly.” The textbook version is for the architect: “By construction $b = d_0 + 4 d_1$, where $d_0$ and $d_1$ are the Booth digits of the lower and upper triplets. Therefore $a b = a d_0 + 4 a d_1$, which is exactly the hardware output. Q.E.D.” All three say the same thing. The Lean kernel checks the first; the second and third are what you put in a design review document.

Autopilot vs co-pilot. Model checking is an autopilot: write an SVA property, press run, and walk away. The tool returns a green check or a counterexample trace. Theorem proving is a co-pilot: there is no run button; the engineer drives the proof, the machine flags every illegal step. The interaction looks like a dialogue:

Engineer: “I claim the Booth multiplier is correct.”
Machine: “I do not see how. Prove it.”
Engineer: “Unfold the hardware definition.”
Machine: “The goal is now a*d0 + 4*a*d1 = a*sv(b3,b2,b1,b0).”
Engineer: “Rewrite using signedValue_decomp.”
Machine: “The goal is now a*d0 + 4*a*d1 = a*(d0 + 4*d1).”
Engineer: “Apply ring.”
Machine: “Closed. Q.E.D.”

Nipkow’s phrase for this relationship — “a surgical tool for formal proofs about computer science artifacts” — captures the register well. Michael Norrish’s three-line epigraph in Concrete Semantics catches the daily mood:

It’s blatantly clear / You stupid machine, that what / I tell you is true.

The fix, in practice, is never to argue with the machine but to find the missing lemma it is implicitly demanding.

Where the induction is. The Booth proof closed by algebra alone — unfold, rw, ring — with no induction in sight, and that is no accident: the design is a fixed size, four bits and sixteen cases, so the whole obligation fits in a single algebraic identity. The domino argument of Figure 3.26 only engages when a dimension becomes a parameter — $N$ bits wide, a $D$-deep FIFO, $k$ pipeline stages — and one proof must cover every instance at once. Then rewriting is not enough: the proof inducts over the parameter, and meets the move that decides whether a parametric proof closes at all — strengthening the induction hypothesis. The smallest honest example is an accumulator.

A failing proof and how to fix it. If the engineer tries to prove that accumulator correct without generalizing the accumulator, the induction stalls:

def mul_acc : Nat -> Nat -> Nat -> Nat
  | 0,      _, acc => acc
  | n+1, y, acc => mul_acc n y (acc + y)


-- First attempt: looks plausible
theorem mul_acc_zero (x y : Nat) : mul_acc x y 0 = x * y := by
  induction x with
  | zero         => simp [mul_acc]
  | succ n ih     =>
    -- goal: mul_acc (n+1) y 0 = (n+1) * y
    -- unfolds to: mul_acc n y (0+y) = (n+1)*y
    -- but the IH says: mul_acc n y 0 = n*y -- WRONG accumulator
    sorry           -- proof stuck

The induction hypothesis ih fixes the accumulator to 0, but after one step the recursive call passes 0 + y. The hypothesis is the wrong shape for the goal. The fix is to strengthen the property so the accumulator becomes a universal variable:

-- Strengthened theorem: holds for ANY accumulator
theorem mul_acc_eq (x y acc : Nat) :
    mul_acc x y acc = x * y + acc := by
  induction x generalizing acc with
  | zero       => simp [mul_acc]
  | succ n ih => rw [mul_acc, ih]; ring


-- Original follows as a one-line corollary
theorem mul_acc_zero (x y : Nat) : mul_acc x y 0 = x * y := by
  rw [mul_acc_eq]; simp

The key move is induction x generalizing acc: keep the accumulator universally quantified in the hypothesis. The same pattern recurs constantly in hardware proofs — pipelined designs, FIFO depth invariants, divider iteration counts — and is identical in structure to the inductive-invariant strengthening that IC3 (§3.3) performs at the SAT level. When IC3’s MIC procedure carves a stronger blocking clause, it is doing what the engineer just did to the IH: the unstrengthened form was too weak; generalize to a form that survives the transition relation.

Five induction principles the engineer will use. Before the heuristics, the principles themselves. The induction the previous lemmas applied is one of five flavors, each suited to a different shape of design or property:

• Structural induction — on an inductively-defined data type: the natural numbers, Booth digit triplets, lists of pipeline stages.
• Computation induction — on the shape of a recursive function’s call graph; the right principle when the natural variable is the recursion depth, not the index.
• Rule induction — on the derivation of an inductively-defined relation: operational semantics, transition systems, the very structure RTL reasoning naturally fits.
• Well-founded induction — on any relation with no infinite descending chain; the principle behind “for every $N$-core configuration” and “for every FIFO depth $D$” — the proofs the engines hit a wall on.
• Co-induction — for reactive systems and infinite traces, where the property is about behavior that does not terminate: LTL safety, fairness, infinite-trace bisimilarity.

Hardware proofs tend to use structural induction on the data the design processes and well-founded induction on the parametric size of the design. The Booth lemma above is structural; the “every $N$-core” proofs the wall subsection above described are well-founded.

Three heuristics for picking the induction variable. The rules, taught in Nipkow’s Concrete Semantics and applicable directly:

1. Induct on the variable the function recurses on. If mul_acc recurses on its first argument, induct on the first argument.
2. Generalize constants in the goal to variables before inducting. The original mul_acc x y 0 = x * y fails because 0 is a constant fixed in the IH; replacing it with a variable and strengthening the property unblocks the proof.
3. Keep all relevant free variables universally quantified in the IH. Lean’s generalizing keyword, ACL2’s :hints :induct clause, and Coq’s revert tactic all do this.

These three rules close more proof attempts than any other single piece of advice in interactive theorem proving.

Daily automation: aesop and hammers. Long before AI tactic search arrived, Lean and Isabelle engineers had a layer of automation between manual tactics and full proof search. In Lean 4 the canonical tool is aesop (Automated Extensible Search for Obvious Proofs) — a goal-directed proof search that follows a configurable rule database and closes a substantial fraction of routine subgoals in one shot. The engineer’s natural reach when stuck is to type aesop before reaching for a tactic-by-tactic plan; if it fails, the residual goal narrows the manual work. In Isabelle/HOL the equivalent is sledgehammer, which translates the current goal into solver formats (SMT-LIB for Z3 and CVC5, first-order TPTP for the resolution provers Vampire and E), ships it to all of them in parallel, and translates a successful proof back into Isabelle’s native tactic language. Engineers nickname this “the hammer” because it closes goals nobody had time to discharge by hand. Together, aesop-class search and SMT hammers form the pre-AI layer of automation: they sit between the manual tactics above (simp, ring, omega) and the AI tools described next, handling the routine cases so the engineer’s attention focuses on the genuinely creative steps.

Daily workflow with AI in the loop. A real proof session in 2026 looks like this. The engineer opens VSCode with the Lean InfoView pane visible. The current goal is highlighted; previous goals stack underneath. The engineer types a tactic, the kernel updates the goal state in real time. When manual moves stall, the next reach is aesop or a hammer; when those stall, the engineer triggers Lean Copilot (the LeanDojo team’s free LLM-in-Lean, models hosted on Hugging Face), which proposes the next tactic based on the current goal state and a semantic search over Mathlib. The engineer reviews the proposal, accepts or rejects it, the kernel verifies. A productive day closes 5–10 lemmas; a typical project closes a few hundred over a few months. The three layers — manual, automated, AI — compose: AI proposes a tactic, the kernel checks it; if the kernel rejects it, aesop runs on the residual; if aesop fails, the engineer steps in.

Stage 6: close and connect to RTL

A Lean proof of boothMul_correct establishes that the Lean model computes the right answer. The next obligation is to prove that the Lean model matches the actual Verilog. The standard pattern is Sequential Equivalence Checking (§3.5) between a synthesized version of the Lean model (or, in Russinoff’s flow, the RAC modeling language that compiles to both ACL2 and Verilog) and the production RTL. Commercial sequential equivalence checkers all close this leg by SAT-based equivalence. Once both legs close — Lean proves the model correct, SLEC proves the model equivalent to RTL — the chain is unbroken and the RTL inherits the proof.

The final sign-off artifact (§3.9) records this chain explicitly: the property statement, the Lean proof certificate, the SLEC equivalence certificate, and the spec-to-property delta. A safety-critical audit (DO-254, ISO 26262, IEC 62304) inspects the certificate chain; the verdict alone is not accepted.

The other direction: correct-by-construction via extraction. The flow above is retro-verification — write the Verilog, write the Lean model, prove they match. The high-assurance academic alternative is extraction (or correct-by-construction synthesis): write the algorithm in Lean (or Coq), prove it correct, and let the prover generate the Verilog or C automatically. Coq’s Kami, developed at MIT, is the canonical example. Kami lets you write hardware as a Bluespec-style DSL inside Coq, prove the design against its specification, and extract Verilog from the proven design. Several research RISC-V cores have been verified and carried to FPGA implementation this way. Lean 4’s ecosystem is catching up — experimental hardware DSLs (Sloth and adjacent macro-driven hardware-modeling projects) build the same machinery on top of Lean’s macro system, and Lean macros allow a DSL that looks exactly like Verilog while parsing as Lean terms. There is no production-grade “Chisel-for-Lean” in 2026; for a project that wants extraction today, Kami in Coq is the most-deployed answer. The trade-off matches the rest of this section: retro-verification works on existing Verilog (every team can adopt it gradually); extraction requires re-writing the design in the prover (higher initial cost, but the design and the proof stay structurally aligned forever).

If you only have Verilog: deep vs shallow embedding. The reverse problem — taking existing Verilog and getting it into Lean (or any prover) — has no push-button standard. Practitioners use one of two routes. The deep embedding route uses Yosys or an equivalent synthesis front-end to parse Verilog into an AST or intermediate representation (FIRRTL, btor2); the Lean side then ingests the AST and reasons about it as a tree of gate-level operators. This is what most academic Verilog-verification projects do. The shallow embedding route writes the Verilog functionality as a Lean function directly (as our boothMul does above); a separate tool (Centaur Technology’s VL SystemVerilog-to-ACL2 translator — since open-sourced into the ACL2 community books — was the canonical example; in 2026 LLM-assisted transpilers increasingly fill the same role) handles the Verilog-to-Lean translation. Shallow embeddings are easier to prove against but require maintenance as the Verilog changes; deep embeddings are mechanical to produce but more painful to write proofs against. The 2026 rule of thumb: choose deep when the Verilog is large and stable; choose shallow when the design is still evolving and you want fast iteration on the proof.

Stage 7: maintain

A theorem-proving project’s true lifetime begins when the first proof closes. The lemma library accumulated during the project — in our case the three Booth lemmas plus whatever supporting facts they pulled from Mathlib — becomes a reusable asset. The next project (the 8-bit Booth multiplier, the 53-bit FP mantissa multiplier, the FMA fused-multiply-add) starts from this library rather than from scratch.

Maintenance has three concrete activities. First, CI regression: every RTL change re-runs the Lean proofs (and the SLEC equivalence) overnight; a break is caught the morning of the commit. Second, library curation: when the engineer notices a lemma is used in three projects, it gets renamed, generalized, and moved to a shared layer. Third, Mathlib upstreaming: lemmas that turn out to be general mathematics (modular arithmetic, integer divisibility, polynomial identities) get contributed to Mathlib itself, where they become permanent fixtures available to every Lean project worldwide.

Defensive proof engineering: surviving RTL churn. A naive theorem-proving project produces a 100 000-line proof rigidly bound to one specific design. When the designer adds a clock-gating signal or tweaks a pipeline stage, large sections of the proof shatter and have to be rebuilt by hand. Industrial proof teams architect their proofs defensively against this churn. The standard tools are module systems, parameterized proofs (functors in Coq, section/namespace machinery in Lean), typeclasses, and abstraction layers. Instead of proving “this specific 32-bit multiplier satisfies the spec,” the engineer proves “any algebraic ring with these operations satisfies the spec” and then independently proves “the synthesized multiplier is such a ring.” When the RTL changes, only the second proof has to be updated; the first survives unchanged. This is identical in spirit to separating an interface from its implementation in software — one of the few areas where industrial software-engineering culture transfers directly to proof engineering. Mature TP projects spend roughly 20–30% of their effort on this architecture work, and it is what makes the difference between a proof that survives one design generation and one that becomes part of the company’s permanent verification asset.

The lasting product of a theorem-proving program is not the individual proof; it is the curated lemma library it builds. The end-game — already visible in 2026 — is a hardware-arithmetic Mathlib: a comprehensive library of carefully-stated lemmas about RTL primitives (adders, multipliers, dividers, comparators, encoders, FIFOs, memory hierarchies, NoCs) that AI-driven theorem provers can search and compose, applying them to whatever proof obligation the current project poses.

Scaling up: the industrial FPU pyramid

The 4-bit Booth multiplier proof is small. A production FP64 multiplier is the same shape at a different scale — a five-layer pyramid (the cathedral of Figure 3.29 is the same hierarchical proof-tree shape at small scale), following the structure of Russinoff’s Part V:

1. The modeling-language layer. Verilog $\to $ RAC (a bounded-recursion, fixed-width C++ subset) $\to $ ACL2 (or Lean). $\sim $140 kB of Verilog reduces to $\sim $20 kB of RAC by stripping pipelining and clock-gating.
2. The structural-decomposition lemmas. Booth-recoded partial-product tree, $\sim $27 3:2 compressors that reduce 29 partial products to 2, final 106-bit carry-propagate add. Our 4-bit example has 2 partial products and an implicit shift-add; the production version has the same shape with more layers.
3. The rounding-and-encoding lemmas. Normal-case rounding (GRS bits), denormal-case rounding (drnd), underflow/overflow flag generation. Each invokes the generic rounding lemmas in the arithmetic library — in Russinoff’s case Lemmas 6.95 (constant injection) and 6.97 (sticky-bit rounding, six preconditions across four rounding modes).
4. The architectural-spec correspondence. One lemma per output port: the encoded bit-vector equals the IEEE encoding of the mathematically-rounded real result.
5. The RAC–RTL equivalence, done outside the theorem prover by SLEC.

Total per FPU: $\sim $30 lemmas in the dedicated FMUL chapter, resting on $\sim $100+ generic lemmas in the underlying arithmetic library. The generic library is the asset that lives across projects.

One concrete lemma worth meeting. The single most-cited lemma in an industrial multiplier proof is the truncation-of-summand identity (Russinoff Lemma 6.14): if $x \ge 0$, $y \ge 0$, and $x$ has at most $k + \mathrm {expo}(x) - \mathrm {expo}(y)$ significant bits, then

\[ x + \mathrm {RTZ}(y, k) \;=\; \mathrm {RTZ}\bigl (x + y, \; k + \mathrm {expo}(x{+}y) - \mathrm {expo}(y)\bigr ). \]

In English: truncating $y$ before adding to $x$ gives the same result as truncating after, provided $x$ has enough headroom. Every FPU designer knows this intuitively — it justifies the universal hardware trick of pre-truncating one operand to save area. Russinoff’s lemma is that intuition made stringent enough that the proof of the entire FMUL chapter can use it as a single rewriting step.

Why pay the cost: three case studies

Three industrial projects justify the field’s existence and map the cost frontier from two months to twenty person-years.

1. AMD K5 and the FDIV scare (1994–1995). In late 1994 a Lynchburg College mathematician discovered the Intel FDIV bug — five missing entries in the SRT division lookup table. Intel’s recall cost $475 million in cash and immense reputational damage. AMD threw out its tested FDIV hardware, reimplemented division in microcode using FP add/sub/mul, brought in J Strother Moore and Matt Kaufmann from Computational Logic, Inc. (CLI), working with AMD’s lead FPU architect Tom Lynch, on a hard nine-week deadline, and constructed a mathematical proof that the K5’s division microcode would execute correctly for every IEEE 754 input pair before the silicon was fabricated; David Russinoff verified the companion square-root microcode. The K5 shipped on time. Moore’s later observation captures the value: in weekly meetings he would present a lemma he had just discharged and the FPU designers would respond “we knew that.” His reply: “There is a big difference between ‘knowing’ something and writing it down precisely so that it is always true even when interpreted by the most malicious of readers.”

2. Centaur Technology’s daily proof pipeline (late 2000s–2020). Where AMD K5 was a targeted rescue, Centaur (a VIA Technologies x86 subsidiary), whose formal-verification effort Warren A. Hunt Jr. led, turned theorem proving into a continuous-integration workflow. Centaur built infrastructure to automatically translate Verilog netlists into ACL2 logic, ran thousands of proof scripts nightly, and verified execution-unit correctness against the x86 ISA. When a designer checked in an RTL change that broke an invariant, the proof failed and the change was reverted before the next morning’s standup. Theorem proving became daily engineering practice on shipping commercial microprocessors.

3. seL4 and the hardware-software boundary (2009). An NICTA team led by Gerwin Klein published the formal proof of the seL4 microkernel using Isabelle/HOL — the world’s first fully-proven OS kernel. The proof established process isolation, no buffer overruns, no information leaks for $\sim $9,000 lines of C, in about 20 person-years. seL4 became the gold standard for high-assurance systems in defense, aviation, and autonomous vehicles, and demonstrated that the same theorem-proving machinery extends naturally across the hardware-software boundary.

The AI revolution

The cost calculus is changing rapidly, and Lean is at the center. The bottleneck of theorem proving has always been tactic selection — which lemma to apply next, which case to split on, which intermediate goal to introduce. Neural language models trained on millions of proofs are good at exactly this.

AlphaProof + AlphaGeometry 2 (DeepMind, 2024) jointly achieved IMO silver-medal performance — AlphaProof proving its algebra and number-theory solutions in Lean, AlphaGeometry 2 solving the geometry problem with its own symbolic engine. AlphaProof trains a neural policy on synthetic Lean proofs; the policy proposes tactics; the kernel verifies. Failed tactics generate negative training signal; successful proofs become new training data.

Seed-Prover (ByteDance, 2025) pushed the same idea to whole-proof scale: the engineer states a theorem, the system searches Mathlib for relevant lemmas via semantic retrieval, proposes a sequence of tactics, and presents a candidate proof; the engineer reviews; the kernel checks.

LeanDojo and ReProver (Caltech and NVIDIA, 2023) built the foundational retrieval-augmented stack: semantic embeddings of Mathlib lemmas combined with a fine-tuned LLM, autonomously proving a substantial fraction of held-out Mathlib theorems on the LeanDojo benchmark.

Lean Copilot (2024) put the same idea into the engineer’s editor. When the proof goal is stuck, the engineer triggers a completion; an LLM trained on Lean proofs suggests the next tactic; the kernel validates in real time. This is the day-to-day tool that turns the research results above into a workflow an engineer actually uses, in the same way GitHub Copilot turned LLM code completion into a daily tool for software engineers.

Integration architecture. AI-assisted theorem proving differs fundamentally from AI code generation in one critical respect: the kernel is the truth oracle. A few thousand lines of audited Lean type-checker decide what is and is not a valid proof. If the AI proposes a tactic that does not actually prove the goal, the kernel rejects it — the AI’s hallucination is harmless. The engineer can use a black-box LLM with full safety guarantees because the kernel’s verification is independent. This is structurally different from prompting an LLM to write RTL or testbenches, where the verifier is itself fallible.

The chip-scale parallel. This is also the loop the model-checking flow runs at chip scale (§3.2): the LLM proposes the cut, the invariant, the interface contract, and the model checker — not the Lean kernel, but every bit as sound — discharges it. Seed-Prover’s reuse of a growing pool of proven lemmas is precisely the propagation of proven knowledge that converges an SoC, run inside a single proof rather than across a hundred interfaces; AlphaGeometry’s auxiliary construction is the same where to cut, proposed by a net and verified by a sound engine. Probability proposes, logic verifies — the same loop whether the artifact it closes is a theorem or a chip.

Real-world analogy. The AI is not the carpenter who builds the cathedral. The AI is the master apprentice who has read every cathedral blueprint ever written and can whisper “have you tried this dovetail joint?” while the engineer drives the proof. The Lean kernel remains the structural inspector; the cost of finding the right joint collapses.

What this means for hardware verification. AI-assisted Lean is already collapsing FPU and crypto-primitive proofs that took Russinoff’s team years down toward weeks or days. As the tooling matures, SoC-level architectural correctness proofs — a RISC-V ISA implementation, a coherent memory hierarchy, the correctness of an AI accelerator’s INT4/INT8/FP8 quantization paths — move from out of reach toward economically feasible. The same loop then applies to AI accelerators themselves: the model proposes proofs about the quantization hardware, the kernel verifies, the engineer audits.

One caution. The AMD K5 nine-week effort produced a working proof but also, in Moore’s later words, “an ad hoc collection of inelegant formulas — valid, but far from a useful theory of floating-point.” Russinoff started over and followed a disciplined approach to build the reusable arithmetic library on which every subsequent FPU proof now rests. The same trap waits at the AI scale-up: LLMs can produce proofs of individual lemmas cheaply, and cheapness invites a flood of one-off scripts that close their goal and contribute nothing to the next project. The lasting artifact of a theorem-proving program is the curated lemma library, not the count of theorems closed. AI accelerates the labor but does not replace the discipline of factoring the theory.

The unification of engines

Modern industrial proofs do not choose between the engines on the logic ladder — they orchestrate them. The classical orchestration is TP + SMT: an engineer uses a theorem prover (ACL2, Coq, Lean) to lay out the macro architecture of the proof — the high-level lemmas, case splits, induction structures — and the theorem prover dispatches leaf-level arithmetic and bit-vector obligations to an SMT solver (Z3, CVC5, Boolector) that closes them by CDCL search. Lean’s omega (a native linear-arithmetic decision procedure), ACL2’s gl bit-vector package (SAT-backed symbolic execution), and Coq’s SMTCoq bridge to external SMT solvers are all realizations of the same pattern: TP for the structural argument, decision procedures for the leaves.

The four proof systems behind the engines. Why does this dispatch work at all? Every proof tool in the chapter — SAT, SMT, IC3, Lean, Isabelle, Coq — ultimately reduces to one of four classical proof systems, and the choice of system shapes what the tool feels like to use.

• Natural deduction introduces and eliminates connectives directly (assume $A$ to prove $A \to B$); it is the proof-script idiom of Lean, Coq, and Isabelle.
• Sequent calculus reasons about sequents $\Gamma \vdash \Delta $ symmetrically and underlies the analytic structure many automated tactics rely on.
• Hilbert-style calculus has minimal axioms and one inference rule (modus ponens); it is closest to the kernel-checking discipline (small trusted core) but verbose to write proofs in.
• Resolution reduces every proof to refutation of clauses, generates the proof tree by resolving complementary literals, and is the engine inside every SAT solver and every SMT solver’s propositional backbone.

The reason theorem provers and SMT solvers dispatch to each other cleanly is that resolution at the SAT level and natural deduction at the ITP level are interchangeable for the fragments where both apply; the prover’s tactic translates a sub-goal into a clause set, the SMT solver returns SAT or UNSAT, and the prover incorporates the result back into its natural-deduction structure. Knowing the four systems gives the vocabulary for why interactive and automated proof tools cooperate at all.

The 2026 development is the addition of an AI proposal layer that sits above every engine in the stack, not just above TP. The pattern is the same at every rung: the LLM proposes the next move; the underlying engine verifies it; failures are silently discarded.

• SAT. LLM proposes branching, restart strategies, and learnt-clause guidance; the solver verifies satisfiability by CDCL.
• SMT. LLM proposes theory choices, abstraction patterns, and quantifier instantiations; the solver verifies QF-FOL satisfiability over the chosen theory.
• Model checking. LLM proposes assume–guarantee decompositions, helper invariants, abstraction predicates, and refinement strategies under failure; the model checker verifies reachability via IC3 / BMC. The biggest practical unlock of 2026 lives here: SoC-scale model checking that was a multi-year manual campaign in 2022 closes in weeks, because the algorithms (IC3, BMC) are unchanged but the combinatorial orchestration around them — where to cut, which invariants to inject, how to refine when a local proof fails — is now LLM-driven. This is §3.2’s Move 2: propagation of proven knowledge across interfaces, carried over an assume–guarantee decomposition that LLM divide-and-conquer now makes tractable.
• Theorem proving. LLM proposes tactics, premise selections, and induction structures; the kernel verifies each step against the type-theoretic rules.

In every case the engine is mechanical, sound, and independent of the LLM. The LLM cannot fool it; hallucinations are caught and discarded. This is the truth-oracle property, and it applies to the entire formal-verification stack, not to theorem proving alone. It is the realization, at the proof-engine level, of the Emergent Verification thesis that runs through the rest of this book: a substrate of focused components, composed by the engineer to match the shape of the problem rather than wedged into a single monolithic tool.

When to reach for theorem proving

The practical short-list, 2026:

• Floating-point and divide/sqrt units. IEEE 754 conformance, transcendental approximations, denormal handling.
• Cryptographic primitives. AWS s2n (TLS), Apple HomeKit, Cryptol/SAW for AES, SHA, signature schemes.
• Compiler-like transformations. CompCert (C), CakeML (ML), verified-LLVM parts. Same pattern for hardware code generators and HLS verification.
• Safety-critical certification. DO-254 (avionics), ISO 26262 (automotive), IEC 61508 (industrial), MIL-STD-882 (defense system safety).
• AI accelerator numerics. As accelerators ship INT4, INT8, FP8, and custom number formats, proving the hardware quantization matches the framework’s mathematical semantics is exactly the kind of property theorem proving handles and the four-engine stack does not.

For everything else — arbitration, fairness, FIFO depth, protocol handshakes, equivalence after synthesis — the four-engine stack remains the answer. Theorem proving is the ceiling, not the default. But the ceiling is rising as AI-assisted Lean reaches industrial maturity.

Logic and Probability — the siblings

The pattern this chapter has walked — LLM proposes, formal engine verifies — has a deeper name. Formal verification and large language models are siblings of mathematics. The seventy-year estrangement that ended in the late 2020s is the unifying story behind everything in the chapter, and it explains why this works here and only here.

Formal verification models mathematics as logic: discrete symbols, exact rules, sound deduction, decidability where the algebra permits. Frege wrote down predicate logic in 1879; Hilbert proposed the verification program around 1900; Gödel established its limits in 1931; Turing operationalized the engine in 1936; Boyer and Moore built it as a tool in the 1970s. The output is a proof. The verification is mechanical.

Large language models model mathematics as probability: continuous distributions over token sequences, pattern recognition, plausible inference. Shannon laid the information-theoretic foundation in 1948; Bayes had the inference framework two centuries earlier; Rosenblatt built the first perceptron in 1958; the transformer architecture arrived in 2017. The output is a candidate. The verification is whatever oracle the candidate can be compared against. The discipline is fast, creative, scalable, and structurally unreliable.

For seventy years the two siblings worked separately. Logic was rigorous but did not scale — proof discovery is exponential, the inductive case-split is creative work, the lemma library is human labor. Probability scaled but was unreliable — hallucination is structural, not a bug to be patched. The formal community would not trust a neural net; the machine-learning community would not engage with proofs. The two literatures barely cited each other.

The reconciliation, the architectural insight of the late 2020s, is exactly the pattern this chapter has named: let probability drive proposal, let logic drive verification. The LLM is allowed to be fast, creative, and occasionally wrong. The formal engine is allowed to be slow, mechanical, and always right. Pure logic is precise but does not scale; pure probability scales but is unreliable; probability proposes, logic verifies scales and is sound.

Why FV gets there first. This is why formal verification is positioned to be among the first AI domains to reach genuinely human-out-of-the-loop autonomy in production. Not because the AI is smarter here than elsewhere — it is not. Because the logical sibling was already on the shelf, mature, waiting. Other AI domains (code generation, scientific discovery, autonomous decision-making) have to invent their logical half before the pairing works: sound program semantics for code generation, structured ground truth for science, formal safety envelopes for autonomous systems. Formal verification does not have to invent its half. Fifty years of logical machinery were already there.

What changes and what does not. The hard limits remain: a property at the HOL rung is in general undecidable; a property at the FOL rung over an unbounded state space is only semi-decidable; finding a proof can require resources beyond economic justification. The sibling architecture does not remove these limits. What it does is collapse the cost of working within them. Proofs that took a senior FPU verifier driving ACL2 for a year drop to weeks. An SoC model-checking campaign that once stretched across years now closes in weeks. Refinement chains spanning four levels (spec $\to $ algorithmic $\to $ RTL $\to $ gate-level) close end-to-end in days instead of years. The cost frontier moves; the engineer’s role moves with it, from constructor to director to, for the most established domains, auditor.

This is the chapter’s underlying argument and the book’s: the engines were built decades ago; the libraries grew alongside; the languages matured. What changed in the late 2020s is that the second sibling arrived ready — a proposal layer good enough to drive the engines. Logic supplies the soundness, probability supplies the reach, and the cost of working within the known limits finally drops.

3.7 Formal Verification Apps and Use Cases

From the theorem-proving summit, back to the factory floor. A modern formal verification flow exposes its shared engine stack (SAT/SMT, BMC, $k$-induction, IC3/PDR, CEC, SEC) through a set of named apps. The engine layer is largely the same across commercial tools; the differentiation is in workflow, debug, and which apps are mature. What matters for this book is not the catalogue for its own sake but a structural property the most consequential apps share.

The most consequential apps in modern production share one structural property: each removes a specific dependency on the UVM testbench infrastructure and lets formal run earlier in the block’s schedule.

• Formal Property Verification (FPV) — the general-purpose app. Bind SVA properties to the RTL, run the engine portfolio (BMC, $k$-induction, IC3/PDR), collect proofs and counterexamples. Every other app below is a specialization of this one.
• Auto-property / structural checks — the tool auto-generates a set of structural properties (no deadlock, no X-propagation, no arithmetic overflow, no protocol violations) without the user writing any SVA. The fastest path to first-day bug discovery on a new RTL block.
• Datapath / SEC — proves the arithmetic equivalence and transformation equivalence cases of §2.8. Word-level reasoning makes industrial-width arithmetic provable where bit-level CDCL alone times out.
• Clock-domain crossing (CDC) and reset-domain crossing (RDC) — structural and protocol verification of every asynchronous boundary in the design. Catches synchronizer misuses, missing handshakes, gray-code violations.
• Connectivity — proves top-level pins reach the correct internal IP-block boundaries through the right MUX path under the right control-register values; blackboxes most of the chip and runs at SoC scope. The pad-mux verification of every modern SoC handoff.
• Low-power / UPF formal — proves the RTL is consistent with the UPF/CPF power-intent spec: level shifters at every voltage-domain crossing, isolation cells where retention demands them, retention-register policy, power-state legality, isolation-clamp control. Removes the dependency on power-cycling stimulus that would otherwise have to be authored in UVM.
• Register verification — ingests the IP-XACT or SystemRDL register map and emits assertions proving every CSR obeys its declared semantics: RO, RW, W1C, W1S, WC, RC, sticky, lock bits, side-effect-on-read. Removes the dependency on the UVM RAL model and the register-access transactor: the app runs the moment the RTL is partially built.
• X-propagation — drives every uninitialized flop as X and proves no design-relevant output ever becomes X. Removes the dependency on simulation stimulus that has to coincidentally exercise the uninitialized path, and catches a class of post-silicon bugs that otherwise surface only after weeks of bring-up.
• Security / information-flow — proves access-control properties at SoC scope: privileged regions unreachable from non-privileged paths, secret registers cannot leak into observable signals, debug interfaces gated by fuses. Removes the dependency on an exhaustive attack stimulus that simulation cannot generate, because the attack space is the entire input space.
• Coverage closure (formal) — the tool proves which simulation coverage points are theoretically unreachable, eliminating wasted simulation cycles that would otherwise be spent chasing impossible targets.
• Mutation analysis — operates one rung up the meta-ladder, verifying that the assertion set itself catches synthetic bugs introduced into the RTL — the closure check on every other formal effort, developed in §3.8.

The shared pattern is not accidental: formal applied as a named app reaches each verification target without first standing up a full UVM testbench, and the cost of running the app is small once the RTL is compilable. Combined with bug hunting (§3.8), this is the trajectory of §3.1’s division of labor, by which formal and emulation absorb scope that previously belonged to constrained-random simulation: formal closes block-level and many integration-level claims with bounded effort and high confidence; emulation runs production software at near-silicon speed; simulation contracts to the integration layer between them. A few of the most formal-forward production teams are pushing toward this limit — formal-plus-emulation as the primary verification engines, with simulation increasingly the bridge rather than the workhorse. The LLM-assisted assume–guarantee layer (§3.2) is the active integration frontier; as Chapter 1 noted, general-purpose frontier models have proven at least competitive with the EDA-vendor add-ons on property generation and helper-lemma synthesis.

3.8 Bug Hunting with Formal

Formal is usually framed as a proof tool: a property either holds or it does not, and the win condition is a closed proof. Production teams use formal differently — as the fastest bug hunter in the verification toolkit. The bug-hunting use is independent of whether any property eventually proves: even when a property is too large to close, the bounded engine still finds whatever bugs are reachable in the first 50–100 cycles, and finds them fast.

Why formal hunts bugs faster than simulation. A constrained-random simulation explores narrow random walks through the state space. The probability of hitting a specific corner-case bug depends on how well the constraints and the random seed cover the path to that bug. A formal BMC engine, by contrast, explores all states reachable in $k$ cycles — exhaustively, not stochastically. When a property fails, the engine produces a minimal counterexample: the shortest stimulus sequence from reset that triggers the bug. Debug time is constant rather than linear in the simulation queue length.

Bug classes formal catches that simulation often misses.

• Concurrency bugs. Race conditions, ordering issues, and deadlocks happen in narrow timing windows that constrained-random rarely hits. The MSI cache controller is the canonical example: protocol races emerge only when two nodes execute interleaved transactions in a specific order. Formal’s exhaustive interleaving catches them; simulation regressions usually do not.
• Liveness bugs. Deadlock, livelock, and starvation are absence of progress — nothing happens when something should. Simulation runs out of time; the absence of a fire condition looks indistinguishable from “test passed”. Formal liveness properties (s_eventually) state the progress claim and find the witness for its violation — the lasso and ranking-function machinery of §3.4.
• Reset and X-propagation. An uninitialized register produces an X that propagates through downstream logic and corrupts outputs — often invisibly until the corrupted output is observed many cycles later. Formal X-prop apps trace every X back to its source.
• Datapath corner cases. Overflow, underflow, sign-extension errors, partial decoder gaps — the kinds of bugs that hide in the arithmetic that directed simulation tests miss. Formal datapath verification (DPV apps) proves equivalence between the RTL arithmetic and a C-model reference.
• Unreachable code and dead states. Cover properties answer “is this state reachable?”. If it is, formal returns the shortest stimulus that reaches it — useful for directed testing. If it is not, the user has found dead code or an over-constrained spec.
• Spec ambiguity. Writing the property to check forces the spec to be precise. A spec like “the arbiter is fair” becomes req[i] && !gnt[i] |-> s_eventually gnt[i], and writing s_eventually forces the team to decide whether they mean an unbounded promise or a concrete deadline ##[1:k]. The act of writing the property exposes the ambiguity in the spec — which is a bug, in the spec.

When formal does not converge. Most non-trivial properties on non-trivial designs do not close cleanly the first time. The proof times out, or returns “inconclusive” because the engine’s frame frontier grew too large. Three responses:

1. Bound it. Accept a $k$-bounded proof (“property holds for 100 cycles from reset”) instead of an unbounded one. For many properties this is enough — if the bug exists, it usually exhibits within a small bound.
2. Abstract. Apply the techniques of §3.2: data abstraction, symmetry reduction, counter abstraction. Each move trades precision for scalability; pick the move whose precision loss does not affect the property.
3. Decompose. Cut the design at a contract boundary (§3.2) and prove the two halves separately. LLM-assisted partitioning is the current best practice for picking the cut.

A property that resists all three is genuinely too hard for current technology, and the team falls back to simulation and emulation for that scope. The residue is small in current flows: with abstraction, decomposition, and LLM assistance, the fraction of properties that close has gone from “mostly fails” a decade ago to “mostly succeeds” today. Bug hunting with formal works in both regimes — whether or not the property eventually proves, the bounded engine has already found the bugs that simulation would have found after weeks of regression.

Mutation testing: verifying the verification. Bug hunting with formal asks “does the design contain bugs?” — but a useful sibling question is “does my assertion set actually catch bugs if they were present?” A clean formal proof tells the engineer that what was written has been satisfied; it does not tell them whether what was written is enough. A sparse assertion set on a complex block can produce a clean proof that means nothing of value.

Mutation testing closes this meta-gap. The tool synthetically introduces small mutations into the RTL — flip a bit, swap an operator, weaken a guard, invert a comparison, drop an assignment — and re-runs the existing assertion set against each mutated design. A mutation caught by an assertion is evidence the assertion set detects that bug class; a mutation that passes silently is evidence the assertion set has a gap, and a real bug of the same shape would also pass silently. The quality metric is the mutation-kill rate, and modern formal signoff criteria commonly require it to exceed a project-set threshold — often in the 90s — before the freeze is accepted. Mutation analysis is now a built-in capability of the major formal tools.

The conceptual move is worth stating directly: every other formal activity verifies the design against properties; mutation testing verifies the properties against the design. It is the formal analogue of testbench coverage in the simulation world — but stronger, because every mutation is a concrete candidate bug rather than a code line that happened to be reached by random stimulus. A signoff package that includes a high mutation-kill rate is qualitatively different from one that only includes clean proofs: it carries evidence that the proofs are about the right things.

3.9 The Signoff Artifact

The formal verification flow of §3.2 ends at a high-confidence RTL freeze, but a freeze is not a green checkmark on a status slide. It is a structured artifact that downstream consumers — safety auditors, integration teams, silicon validation, and the next-stage emulation flow — can inspect, replay, and trust. Naming the components of that artifact is what makes formal verification a recognized engineering deliverable rather than an internal activity.

Five components make up the freeze deliverable.

The closed property list. Every intent statement extracted in Stage 1 of the flow has a row in a table whose columns are: the natural-language intent, the SVA/PSL property it elaborates to, the proof regime (unbounded, $k$-bounded with $k$ recorded, or sim-handoff), and the engine that closed it (BMC, $k$-induction, IC3/PDR, AVY, CEC/SEC). A row marked “sim-handoff” is not a failure — it is a documented scope-of-formal boundary, and the simulation/emulation gate of Chapter 1’s pipeline is expected to cover it.

The per-block formal-coverage table. Independent of property closure, every block in the design has a coverage row recording: which intent statements applied to it, what abstraction was used (cone-of-influence, data-independence, symmetry, counter, predicate, localization), the proof bound reached, and any helper invariants the proof relied on. Reading this table answers “how much of block $X$ is formally covered?” without re-running the proof.

The assumption-discharge tree. Every assumption used anywhere in the proof must trace either (a) to a property proven elsewhere in the propagation web — the proven context that justified it — or (b) to a documented environment assumption with a named source (a protocol spec, an interface IP datasheet, a verified-by-construction input constraint). An assumption with no documented source is the verification team’s open issue; the freeze is not closed until every leaf of the tree terminates in a proven fact or a named environment fact.

The certificate. Modern formal engines emit a checkable proof artifact alongside the verdict. For SAFE properties this is an inductive invariant in AIGER format (or equivalent); for UNSAFE it is the minimal counterexample trace. The artifact is of size linear in the design, and independently checkable by a small, auditable checker — a handful of SAT queries for a SAFE inductive invariant, a simulation pass for an UNSAFE trace — so the trust boundary moves from “the whole prover” to “the certificate checker.” The Hardware Model Checking Competition made certificates mandatory in 2024, and commercial flows are moving the same direction. For safety-critical industries (automotive ISO 26262, avionics DO-254, medical IEC 62304), the certificate is the artifact the auditor actually inspects — the verdict alone is no longer accepted.

The spec-to-property delta. A short summary recording, for each natural-language intent statement: was it formalized? If not, why (out of scope, deferred to simulation, requires emulation), and where does the responsibility transfer? This is the bridge between the executable specification (the SVA the formal tool checked) and the original specification (the English document the architects wrote). The delta is small in mature flows but never zero, and naming it explicitly is what prevents a hidden gap at the boundary between formal and downstream verification.

The five components together turn the freeze from a single boolean into a closed proof tree with named leaves. A handoff to an integration team, a safety audit, or a silicon-bringup engineer reads the artifact and knows exactly what was proved, under what assumptions, by which engine, with what certificate. The remaining risk is exactly what the spec-to-property delta declares it to be — no more, no less.

That artifact closes the formal leg of the EV pipeline. The next chapters turn to the second leg: Chapter 4 develops the object-oriented and constraint primitives a dynamic testbench is built from, and Chapter 5 assembles them into UVM as it is practiced today.

multiplier width \(n\)	CNF variables	CNF clauses
\(8\)	\(320\)	\(1\,064\)
\(16\)	\(1\,408\)	\(4\,688\)
\(24\)	\(3\,264\)	\(10\,872\)
\(32\)	\(5\,888\)	\(19\,616\)
\(64\)	\(24\,064\)	\(80\,192\)