Emergent Hardware Verification

Chapter 2 RTL Design and Its Formal Contract

This chapter is about the silicon under verification: how SystemVerilog models hardware at Register-Transfer Level (RTL), and which RTL idioms a verification engineer must understand before writing meaningful tests. Chapter 1 framed the verification problem at the methodology level. Chapters 4–6 develop the testbench side. Before any of that is useful, we have to be specific about the design under test.

Beneath TLM and the emergent test plans of Chapter 1 sits the cycle-by-cycle engine of the hardware: the RTL itself. To verify a complex system, the engineer cannot just observe it from the outside — they need a working understanding of how it is constructed at the clock-cycle level. SystemVerilog fuses design, assertions, and verification into a single language. This chapter develops RTL modeling first — concurrent processes, event scheduling, combinational and sequential logic, FSMs — and introduces SystemVerilog Assertions (SVA) only at the level needed to motivate the formal-verification work that follows in subsequent chapters.

2.1 The Concurrency Paradigm

Silicon is concurrent. A hardware chip is a network of signals that all evaluate at the same time across millions of gates. Modeling this concurrency requires independent simulator processes running in parallel to produce a coherent result.

The concurrency model creates two requirements that pull in opposite directions. Non-deterministic execution means the exact order in which concurrent processes wake up and evaluate is impossible to predict. Deterministic outcomes require that the system reach the same steady-state value regardless of execution order. When the final result depends on scheduling order, the design has a race condition — a class of bug that gets baked into silicon and cannot be fixed after fabrication.

SystemVerilog handles this with two complementary paradigms: event-driven programming and RTL modeling.

In an event-driven paradigm, program execution is triggered by specific events — a clock edge, a signal transition — rather than stepping linearly through lines of code. The simulator’s event scheduler manages process scheduling and signal propagation; without it, modeling parallel hardware behavior would be intractable.

RTL modeling provides a cycle-accurate framework for computation and communication. The architecture splits into two domains: registers (edge-triggered flip-flops) capture and store state at each clock event, and combinational logic (clouds of gates) reacts to input changes to compute the next value within the clock cycle. The new value is latched into the registers on the next clock edge. Because data is transferred from one register to another exactly once per clock cycle, this style is called Register-Transfer Level (RTL) modeling. Figure 2.1 shows the basic structure.

The next section examines the building blocks of an RTL model in detail: nets, variables, data types, combinational and sequential logic, and the simulator’s event-scheduling regions that make all of this behave deterministically. A note on what comes after that: once the RTL is written and synthesized into a gate-level netlist, agile verification teams anchor it with formal contracts (assertions and assumptions) that act as early executable specifications, accelerating downstream integration. Those contracts are introduced later in this chapter (§2.4), proven in Chapter 3, and reused in Chapter 6’s actor framework.

A note on framing: Chapter 1’s Models-of-Computation section separated computational power (combinational, FSM, pushdown, Turing) from structural model of computation (sequential, threaded shared-memory, message-passing, actors). The concurrent paradigm this chapter walks through — independent always-blocks, fixed signal topology, no shared memory between modules, communication through wires — is precisely the structural MoC of hardware. Modules are actors; signals are typed messages; the always-block is the message handler. Verifying hardware in software with the same MoC means picking the actor model on the testbench side. SystemVerilog’s class-based UVM uses a different MoC (OOP class hierarchy) and pays the price for the mismatch in factory glue and config-DB plumbing; Chapter 6’s actor framework uses the matching MoC and shows what falls out. The RTL primer in this chapter is the first half of the picture; the actor framework is the second half, and they meet at the same structural shape.

2.2 RTL Design Building Blocks

SystemVerilog provides a structural hierarchy for designs that contain millions of gates. Modules are the primary unit: each module encapsulates a piece of concurrent hardware behavior — registers, combinational logic, sub-module instances — behind a clean interface of input and output ports. Larger designs are built by instantiating modules inside other modules, recursively, until the top-level module represents the whole SoC. Read this section the way you will use it: straight through once, then as a reference — every construct here returns in the design examples of the second half of the chapter, where it earns its place in a proof.

The design elements an RTL engineer composes (Figure 2.2) are: modules as the structural unit; datapaths built from combinational logic; sequential control logic (typically finite state machines) that orchestrates the datapath; memory elements such as SRAMs and FIFOs for storage; and the nets that wire these components together. Building real designs by composing these elements requires precision at the bit level, which is what the next subsection covers.

The hierarchy in Figure 2.2 is fractal. The SoC contains modules; modules contain FSMs and datapaths; datapaths contain registers and logic gates. At every level, each element holds private state, communicates through fixed wire connections, and runs concurrently with its peers — there is no shared global memory and no sequential main thread anywhere in the structure. Chapter 6 (§6.3) develops this observation formally as the hardware-actor isomorphism: each level of the RTL hierarchy is an actor, and the actor model is the structural model of hardware itself. The practical consequence for this chapter is that the modular composition the figure shows is the shape of the silicon, not merely an organizing convenience — the verification methodology the book proposes follows that shape directly.

Nets vs. Variables

When connecting modules, every port (input, output, or inout) corresponds to a metal trace — a net.

SystemVerilog’s logic type is the only data type the modern designer needs to know in most situations. Throughout this book we use logic consistently for both single-driver wires and procedurally-assigned variables; the older Verilog-1995 wire / reg distinction survives only in specific cases described below.

When you must use wire instead of logic: when a single signal is driven by multiple continuous drivers (e.g., a tri-state bus shared by multiple modules). The simulator then resolves contention with a built-in resolution function (a 1-and-0 collision produces x, etc.). logic cannot have multiple drivers; the elaborator will reject it.

When you might still see reg: legacy Verilog-1995 code uses reg to mean “a variable assigned inside an always block” — not, despite the name, a flip-flop. Modern SystemVerilog code should write logic in those positions; reg survives only for compatibility. The rest of this chapter uses logic.

Fundamental Data Types

Hardware models wires and transistors at the bit level. All data in an RTL design is represented either as a single bit or as a bit vector (a bus).

A wire can be modeled with a 2-state value (0, 1) for pure logic, or a 4-state value (0, 1, x, z) that also captures uninitialized state (x) and the high-impedance / floating state (z). Figure 2.3 summarizes SystemVerilog’s core data types together with their bit values, sign-extension behavior, and widths.

Beyond raw bits, SystemVerilog provides constructs that make intent explicit in the source. Enumerated types (enum) define a fixed set of named values, so a state machine can use readable labels instead of magic numbers. Constant parameters are configured at elaboration time, before simulation begins, and let one generic module serve different bus widths; generate blocks then consume those parameters to replicate structure.

The string data type is heavily used in testbenches. SystemVerilog treats it as a built-in primitive — one value, indexable as bytes — rather than requiring users to manage a raw unpacked array of bytes.

Arrays: Packed vs. Unpacked

Arrays in SystemVerilog are either packed or unpacked. A contiguous bit vector — a parallel hardware bus — is a packed array: its bits are stored together and manipulated as a single value. An unpacked array is a sequence of independent elements, each addressed by index.

For multi-dimensional arrays, consider logic [15:0][7:0] d[3:0][15:0];. This is a 2D unpacked array of 2D packed vectors. Indexing a single bit such as d[0][9][5][6] walks from the outermost unpacked dimensions inward to the innermost packed bits. Figure 2.4 shows each form geometrically, and Table 2.1 summarizes when slicing is legal.

Type	Description	Example Declaration
Packed	Treated as a single scalar. Bit-slices are legal.	`logic [7:0] data;`
Unpacked	Array of discrete elements. Bit-slices are illegal.	`logic arr [0:7];`
Multidimensional	Combines both. Unpacked indices map first, then packed.	`logic [3:0][7:0] d [0:15];`

Table 2.1: Comparison of SystemVerilog array types.

User-Defined Types: Structs, Unions, and Packages

As designs grow, basic data types alone leave the source fragmented and harder to maintain. SystemVerilog provides three constructs for grouping and sharing related declarations: struct, union, and package.

Packages for Global Encapsulation

A package is a compilation unit that stores shared declarations: parameter constants, typedef aliases, user-defined structs, and enums. Modules import what they need, which keeps the global namespace clean and lets the same definitions be reused across many modules.

Structs, Unions, and Synthesis Bounds

• Synthesizability of unpacked vs. packed elements. Hardware is built from parallel wires and registers. Unpacked arrays and unpacked structs are synthesizable — the tools map them to separate registers — but unpacked unions are not. A flip-flop cannot have two different sizes simultaneously, so when a union overlays data types onto the same memory, the union must be declared packed. Packing forces all union members into one contiguous bit vector that matches the underlying silicon.
• Structs (struct). A packed struct bundles several variables (e.g., an opcode and its operands) into a single contiguous vector that can be routed across the chip in one cycle. Because it acts as a bit vector, a packed struct can be assigned as a whole, used in arithmetic, or bit-sliced (e.g., instr_data[15:0]). An unpacked struct keeps its members as separate variables, so it cannot be used in arithmetic or bit-sliced; it can still be assigned as a whole (aggregate assignment) to a struct of the same type.
• Unions (union). A packed union allocates one shared contiguous memory location for multiple variables. The same bit array can then be interpreted through different field views without extra logic — useful for casting between a raw bus and a structured instruction format, for example.

ALU Instruction Decoder Example: Leveraging Type Encapsulation

An ALU (Arithmetic Logic Unit) Instruction Decoder is a typical use case for a union. The example below defines a 32-bit union that overlays a structured instruction format (opcode plus two register addresses) on the same 32-bit vector that arrives from memory. The alu_decoder module assigns the raw memory packet into the union and reads out the decoded fields through the structured view.

Definition File: alu_pkg.sv

// 1. Define the Global Package
package alu_pkg;
 parameter int DATA_W = 8;


  typedef enum logic [7:0] {
  OP_NOP     = 8'h00,
  OP_ADD     = 8'h01,
  OP_SUB     = 8'h02,
  OP_MULT = 8'h03
} opcode_e;


  typedef struct packed {
  opcode_e                   opcode;
  logic [DATA_W-1:0]         regA;
  logic [DATA_W-1:0]         regB;
  logic [DATA_W-1:0]         pad;
} format_r_s;


  typedef union packed {
  format_r_s                 rtype;
  logic [(DATA_W*3)+7:0] raw_data;
} instr_u;


endpackage : alu_pkg

Hardware Implementation: alu_decoder.sv

// 2. Implementation Module bridging the Package types
module alu_decoder
     import alu_pkg::*;
(
     input   logic [(DATA_W*3)+7:0] raw_packet_in,
    output opcode_e                exec_opcode,
    output logic [DATA_W-1:0]      exec_regA,
    output logic [DATA_W-1:0]      exec_regB
);


    instr_u decoded_instr;


     always_comb begin
     decoded_instr.raw_data = raw_packet_in;


     exec_opcode = decoded_instr.rtype.opcode;
     exec_regA    = decoded_instr.rtype.regA;
     exec_regB    = decoded_instr.rtype.regB;
    end


endmodule : alu_decoder

Data Abstraction and Data-Oriented Design

The ALU decoder example shows more than RTL organization — it shows the value of data abstraction. In legacy Verilog, a designer or verification engineer would have to route, drive, and monitor 32 individual wires. Encapsulating those wires into a struct or union promotes the raw bits into a named transaction that carries meaning.

This abstraction boundary is the foundation of Data-Oriented Design (DOD) in verification — the design philosophy Chapter 6 (§6.5) develops in full. With transactions as the unit of data, a testbench no longer needs cycle-by-cycle knowledge of how each bit maps to a pin. Verification components can generate, randomize, inject, and scoreboard whole packet_s or instr_u objects. If the underlying bus width or protocol changes later, the data-oriented tests — which operate on the abstraction — continue to work without modification.

Combinational Logic

Combinational logic computes outputs from inputs with no internal state. SystemVerilog offers two styles for modeling it: dataflow continuous assignments, and behavioral procedural blocks.

For simple wire routing, the continuous assign statement (the dataflow style) is sufficient. For decision logic that is easier to express procedurally, always blocks are the natural choice — specifically always@(*) or the modern always_comb. All three constructs infer their sensitivity lists automatically, re-evaluating whenever any input changes.

Although always@(*) and always_comb look functionally identical, always_comb is purpose-built for synthesis. Two specific differences matter. First, an always_comb block executes once at simulation time 0, propagating constants through the logic cloud at startup — the same behavior real hardware exhibits when power is applied. Second, always_comb extends its sensitivity list into any functions called from inside the block, so a missed signal inside a function does not become a silent simulation mismatch.

Combinational logic always uses blocking assignments (=). Blocking statements execute sequentially within the active event region, so each one computes its output before any downstream process reads it. Non-blocking assignments inside combinational logic create simulation/synthesis mismatches and should be reserved for sequential blocks.

To compute within these combinational clouds, SystemVerilog provides a rich set of operators. While many mirror standard C-language constructs, others are uniquely tailored for hardware design:

• Bitwise (&, |, ~, ̂): Perform parallel boolean operations on each corresponding bit of the combined vectors.
Example: assign out_vec = in_vec_A & in_vec_B; (Bitwise AND).
• Logical (&&, ||, !): Evaluate multi-bit vectors as a single boolean truth value (Non-zero is True, All-zero is False).
Example: if (req_vec && !stall) out = 1’b1;
• Reduction (&, |, ̂): Compress a single multi-bit vector down to a 1-bit result.
Example: assign all_ones_flag = &vec; (Recursively ANDs all bits).
• Arithmetic (+, -, *, /): Synthesize adders, subtractors, and multipliers.
Example: assign next_addr = base_addr + offset;

Beyond standard operators, SystemVerilog provides streaming operators ({>>{}} and {<<{}}) for packing, unpacking, and reversing bit order in streams — the everyday tools for reformatting network packets between protocol layers or reorganizing data for memory storage. The four figures that follow walk the four modes one at a time.

The left-to-right pack {>>{A, B}} concatenates its operands in the order written, preserving bit order: A’s bits land in the high end of the destination and B’s bits follow (Figure 2.6).

The right-to-left operator {<<4{A, B}} reverses the chunks: the stream is emitted back-to-front, so B occupies the high end and A the low end — the operands swap as wholes while the bits inside each one keep their order (Figure 2.7).

Adding a slice size makes the reversal granular. With slice size 1, {<<1{A}} reverses A bit by bit — A[0] becomes the most significant bit — the classic bit-mirror used in CRC and LFSR datapaths (Figure 2.8).

With slice size 8, {<<8{A}} reverses byte-sized chunks while leaving the bits inside each byte untouched — exactly the endianness swap that network and memory interfaces perform on every word (Figure 2.9).

Sequential Logic

Where combinational logic computes, sequential logic remembers. It stores state values across clock boundaries using flip-flops and latches. This ability to hold information across time is what makes Finite State Machines (FSMs) possible — the control engines that drive nearly every non-trivial RTL design.

SystemVerilog provides two synthesis-aware constructs for sequential logic: always_ff for edge-triggered flip-flops, and always_latch for level-sensitive latches. Using the right construct lets the synthesis tool warn when a different element is accidentally inferred — a common bug source in legacy code that uses bare always blocks.

Sequential logic always uses non-blocking assignments (<=). The non-blocking operator evaluates the right-hand side immediately but defers the left-hand side update until after all blocking operations in the current time step have completed. This deferred-update behavior matches the way real flip-flops latch in parallel on a clock edge, and prevents the simulation race conditions that bare blocking assignments would produce in sequential code.

The correctness of the load-or-hold register above depends on the simulator ordering blocking and non-blocking updates correctly. The next subsection looks at how the SystemVerilog event scheduler enforces that ordering.

SystemVerilog Event Scheduling Regions

Translating concurrent flip-flop and combinational behavior onto a software simulator running on a single CPU requires a deterministic execution model. The IEEE 1800 SystemVerilog standard defines this model as the Event Scheduling Regions: a centralized scheduler that orders thousands of concurrent always blocks into a single, repeatable execution timeline. Every conformant simulator gives the same result for the same race-free source — which is what makes RTL portable across vendors.

These regions matter because they determine how concurrent assignments are ordered. The scheduler separates the Active queue from the NBA (Non-Blocking Assignment) queue. Blocking assignments (=) evaluate and update inside the Active region, making them the right choice for combinational cascades. Non-blocking assignments (<=) evaluate their right-hand sides in the Active region but defer their left-hand updates to the NBA region. This deferred update is what mirrors how real flip-flops latch in parallel on a clock edge, and it is the mechanism that prevents the simulation race conditions blocking assignments would create in sequential code (Figure 2.11).

The same scheduler also helps with delta-cycle debugging. When several signals appear to transition at the same nanosecond on a waveform viewer, what is actually happening is multiple iterations through the Active loop within a single time step. The simulator does not advance internal time until the Active queue settles, so combinational glitches bounce through the loop until the network reaches a stable value. Knowing this is what lets an engineer debug zero-delay race conditions and combinational loops that look identical to a real edge on the waveform.

The Decline of Program Blocks

SystemVerilog originally introduced program blocks to push all testbench execution into the outer Reactive regions. The intent was to guarantee that the testbench evaluated assertions and drove new stimulus only after the RTL (executing in the Active regions) had settled, eliminating one class of race condition between testbench and DUT.

In modern verification frameworks (UVM and the agile-style EV framework of Chapter 6), program blocks have fallen out of favor. The industry found that segregating the testbench into a separate scheduling region caused synchronization problems, hurt hierarchical reuse, and reduced debug visibility. Today’s testbenches use standard module-based architectures that execute in the same Active and NBA regions as the RTL they verify.

Race-free synchronization between the testbench and RTL is then provided by a different construct: the clocking block.

Clocking Blocks: Synchronizing the Testbench

A clocking block is a declarative construct, typically declared inside a SystemVerilog interface, that synchronizes communication between a testbench module and the RTL Device Under Test (DUT).

Rather than delaying the whole testbench into a separate scheduling region, a clocking block applies setup and hold semantics relative to a specified clock edge. The two halves of the contract are input sampling (#1step), where signals read from the DUT are sampled one delta cycle before the active clock edge so the testbench sees the stable pre-transition values rather than mid-transition values — the analog of a flip-flop’s setup window — and output driving (#0 or delay), where signals driven by the testbench update after the active clock edge resolves, so testbench stimulus is applied only after the DUT has finished evaluating the previous cycle — the analog of a flip-flop’s hold window.

Together these two rules give the testbench race-free communication with the RTL while running in the same scheduling regions as the design — which is why program blocks are no longer needed.

The Simulation Bottleneck and Hardware Emulation

The same scheduling precision that gives software simulation its determinism also caps its speed. Because every concurrent RTL event in the SoC must pass through a centralized event queue, simulation does not scale across CPU cores: a 128-core machine cannot distribute the queue without synchronization overhead that exceeds the benefits of parallelism.

As designs reach into the billions of gates, simulation alone is no longer sufficient. Hardware emulation fills the gap by mapping the RTL onto FPGAs or custom emulator processors. Bypassing the software event queue entirely, emulation runs concurrent gates in real parallel, executing orders of magnitude faster than the software scheduler can. Because the actor testbench is itself synthesizable RTL (Chapter 6; Appendix E), the whole verification environment — not just the DUT — maps onto the emulator with the design, which is what makes the simulation-to-emulation move automatic; Chapter 7 develops this. Chapter 1 introduced this Formal / Simulation / Emulation pipeline; here, it appears as a direct consequence of the scheduler’s serial structure.

2.3 Finite State Machines

The previous subsection introduced sequential logic — registers that hold state across clock edges. State alone is static; the work of digital design comes from transitioning between states in response to inputs. A Finite State Machine (FSM) is the construct that does this. Chapter 1’s Models of Computation section placed FSMs in the computational hierarchy: more capable than combinational logic (which has no memory), less capable than a Turing machine (which has unbounded memory).

It is worth making that comparison concrete. Figure 2.12 shows the Turing-machine model: a finite-state controller driving a read/write head that scans an unbounded tape. Mathematically, that pairing can compute any algorithm a modern computer can. Real silicon cannot have an unbounded tape, but it can pair an FSM (or a network of FSMs) with a very large, addressable system memory — which is what every modern CPU, GPU, and DMA controller comes down to.

On a real chip, the FSM transitions on a clock edge, memory has finite latency, and the address space is bounded. Within those bounds, the FSM coordinates the surrounding combinational datapath and memory traffic (e.g., AXI read/write transactions). The heaviest workloads in modern SoC design — CPU pipelines, cache controllers, DMA engines, and bus arbiters — are all built around FSMs that cooperate with datapaths and memories. Because an FSM that enters an illegal state or skips a transition tends to wedge the block around it, FSMs are among the highest-value targets for functional verification.

Formally, an FSM is the 5-tuple $\langle Q, \Sigma , \delta , s, F \rangle $ shown in Figure 2.13: a state set $Q$, an input alphabet $\Sigma $, a transition function $\delta : Q \times \Sigma \rightarrow Q$ that says which next state each (state, input) pair leads to, a start state $s \in Q$, and a set of accepting states $F \subseteq Q$. In hardware, every FSM has two pieces: a combinational logic cloud that computes the next state (and the outputs) from the current state and inputs, and a set of sequential registers that store the current state. FSMs come in two architectural flavors: Moore machines, where the output depends only on the current state, and Mealy machines, where the output depends on both the current state and the inputs — which lets a Mealy machine react in the same cycle, at the cost of routing inputs into the output cone (Figure 2.14).

Decision rule (Mealy vs. Moore): use Moore unless you specifically need a 0-cycle output reaction. Moore outputs come from registered state and have a clean, single-clock-edge timing path; Mealy outputs blend combinational input into the output cone, which costs timing margin on aggressive frequency targets. Mealy is the right choice for control paths that must respond combinationally to an input on the same cycle (a vending-machine dispense signal that should fire immediately when the last coin lands), but Moore is the safer default. In practice the answer is almost always “Moore, unless timing analysis says otherwise.”

To illustrate the architectural difference, we examine a classic sequence-recognition FSM: the Soda Vending Machine. The machine accumulates state from user inputs and dispenses a soda once at least 15 cents have been deposited using nickels (5¢) and dimes (10¢).

Before writing either implementation, we can define the machine’s behavioral contract once using a SystemVerilog Temporal Specification. The contract is the same for either Mealy or Moore — the implementations differ in cycle latency and gate count, not in the input/output relationship being checked.

SystemVerilog Temporal Specification: abstract_soda_props.sv

// Formal Temporal Properties for a Vending Machine
property p_safety_funds;
 @(posedge clk)
  // Always: A dispense NEVER occurs unless at least 15 cents are registered
  dispense |-> (accumulated_cents >= 15);
endproperty


property p_liveness_dispense;
 @(posedge clk)
  // Liveness: Once 15 cents are cleanly accumulated, a dispense eventually happens
 (accumulated_cents >= 15) |-> s_eventually (dispense);
endproperty


// System Model Execution
a_funds_safe: assert property (p_safety_funds);
a_will_dispense: assert property (p_liveness_dispense);

The Mealy Machine Implementation

Because a Mealy machine routes the current inputs into its output combinational logic, it can react in the same cycle. When the machine holds C10 and a nickel arrives, the combinational logic asserts dispense immediately and the FSM transitions back to IDLE on the next clock edge — no dedicated 15-cent state is needed (Figure 2.15).

The RTL implementation derives dispense inside the always_comb block from both the current state and the inputs (nickel, dime).

Hardware Implementation: soda_machine_mealy.sv

module soda_machine_mealy (
     input       logic clk,
     input       logic rst_n,
     input       logic nickel,
     input       logic dime,
     output logic dispense
);
     // Only 3 physical register states needed for Mealy
     typedef enum logic [1:0] {IDLE=0, C5=1, C10=2} state_t;
     state_t state, next_state;


     always_comb begin
       next_state = state;
       dispense        = 1'b0; // Default off to prevent latches


       case (state)
           IDLE: begin
              if (nickel)           next_state = C5;
              else if (dime)        next_state = C10;
           end
           C5: begin
              if (nickel)           next_state = C10;
              else if (dime) begin
                  dispense      = 1'b1; // COMBINATIONAL OUTPUT
                  next_state = IDLE;
              end
           end
           C10: begin
              if (nickel || dime) begin
                  dispense      = 1'b1; // COMBINATIONAL OUTPUT
                  next_state = IDLE;
              end
           end
       endcase
     end


     always_ff @(posedge clk or negedge rst_n) begin
       if (!rst_n) state <= IDLE;
       else             state <= next_state;
     end


endmodule

The Moore Machine Implementation

A Moore machine derives its outputs from the registered state only. This isolates the outputs from input glitches, giving a clean single-clock-edge timing path at the cost of one extra cycle of latency and one extra state. To assert dispense, the Moore machine has to clock into a dedicated C15 state before the output logic responds on the following cycle (Figure 2.16).

The RTL implementation isolates the dispense computation into its own combinational block that inspects only the registered state variable.

Hardware Implementation: soda_machine_moore.sv

module soda_machine_moore (
     input       logic clk,
     input       logic rst_n,
     input       logic nickel,
     input       logic dime,
     output logic dispense
);
     // Requires a full 4 physical register states for Moore
     typedef enum logic [1:0] {IDLE=0, C5=1, C10=2, C15=3} state_t;
     state_t state, next_state;


     always_comb begin
       next_state = state;
       case (state)
           IDLE: if (nickel) next_state = C5;
                 else if (dime) next_state = C10;
           C5:     if (nickel) next_state = C10;
                 else if (dime) next_state = C15;
           C10:    if (nickel || dime) next_state = C15;
           C15:     next_state = IDLE; // Auto-reset after dispense
       endcase
     end


     always_ff @(posedge clk or negedge rst_n) begin
       if (!rst_n) state <= IDLE;
       else             state <= next_state;
     end


     assign dispense = (state == C15);


endmodule

One vending machine is one FSM, and one FSM can be designed — and checked — by inspection. A real SoC carries thousands of them, all running at once, and no amount of inspection survives that multiplication. The next section confronts the multiplicity head-on and finds the abstraction that tames it.

2.4 Concurrent Distributed Temporal Properties

A modern SoC contains thousands of finite state machines, sequential processes, datapaths, and shared resources running concurrently. Trying to verify such a system by tracking every individual state transition leads to state-space explosion: the cross-product of all local states grows exponentially with the number of concurrent components.

Verification works at a higher level than tracking individual state transitions. Instead, we declare behavioral relationships across time using temporal properties — assertions that name a small number of significant events and the causal or temporal relationships between them. Figure 2.17 shows the picture: three concurrent FSMs, each with its own state trajectory; selected events along each timeline; and dashed causality arrows linking events that one FSM produces and another responds to.

Causality and State Space Abstraction

The leverage of this abstraction is in the explicit ordering and causality of events. Rather than verifying that $FSM_1$ is in $S_{1,2}$ at the same instant $FSM_2$ is in $S_{2,1}$, we verify that event $e_1$ causes event $e_3$. The example: if a network packet arrives ($e_1$), the system must eventually issue a memory write ($e_3$). The thousands of intermediate micro-transitions inside the datapath and memory controller drop out of the property entirely.

By defining temporal properties on causal events rather than full system state, the verification problem reduces from an exponential cross-product to a tractable ordered sequence. Verification engineers can reason about, simulate, and formally prove the correctness of complex behaviors without being overwhelmed by implementation-level state.

Emergence of Hierarchical State Machines

A consequence of this event-driven abstraction is that high-level events and their causal ordering form a top-level state machine of their own.

When one event occurs, and the system begins waiting for its causal response, that expectation is itself a distinct macro-state at the system level. The arrival of the response event triggers the transition out of that macro-state. Verifying causality this way overlays a higher-order Hierarchical State Machine (HSM) across the design.

Consider a traffic-intersection controller (developed in detail in the next subsection). At the implementation level, there are independent FSMs for the Road 1 car sensor, the Road 2 car sensor, a countdown timer, and the light controllers. Enumerating every permutation of those low-level states (Timer=5, Sensor1=Active, Sensor2=Inactive, etc.) is unscalable.

If verification focuses on causal events instead — “Road 1 Sensor Detects Car” ($e_\text {detect}$) causes “Road 1 Light Turns Green” ($e_\text {green}$) — the timer ticks and intermediate sensor micro-transitions drop out. The period between $e_\text {detect}$ and $e_\text {green}$ becomes a single macro-state expressing the system’s intent to serve Road 1. The thousands of internal FSM permutations collapse into one Hierarchical State Machine.

In an Emergent Verification flow, this hierarchical top-state machine — formalized as temporal properties — becomes the living, executable specification of the system. Static documentation is replaced with temporal contracts (“Code is Document”), and the implementation is checked against the architectural intent in both simulation and formal verification.

Interleaved Abstraction

To execute and reason about concurrent FSMs, verification tools rely on an interleaved abstraction. Rather than tracking each local process on its own independent timeline, we model a single global progression of discrete time (synchronous clock ticks, for example). At every discrete time step, the aggregated state of all concurrent FSMs is evaluated together, giving the synchronization points needed to evaluate transitions of the top-level HSM.

By defining and relying on these high-level sequences of causal events as a unified foundation, we can formally derive three essential classifications of temporal properties to describe and constrain systemic behavior: Safety, Liveness, and Fairness. We will rigorously define and formalize these categories in Section 2.4.

Before formalizing those categories, a note on how assertions execute. During a simulation, the assertions used as monitors run concurrently alongside the RTL design. They silently observe the executing signals, looking for specified sequences and properties. If a property is violated, the assertion fails, and the simulator reports the error. We cannot prove the same properties hold along simulation paths the trace did not take.

In a formal verification flow such as model checking, the properties are proven mathematically across all reachable states rather than by sampling under stimulus. State-space explosion bites only monolithic checking — one proof over a whole large design at once — not formal’s reach as such. The escape is what the rest of this chapter earns one design at a time: abstraction collapses the irrelevant state (the FIFO’s data-independence token, the MSI controller’s predicate abstraction, the memory controller’s counter and memory abstractions), and assume–guarantee composition propagates each block’s discharged guarantees as its neighbors’ assumptions — carrying the proof from blocks up to subsystems and full-SoC scope (Chapter 1, §1.6; Chapter 3, §3.1). Block-level proof still comes first, but as a foundation, not a ceiling: those foundational blocks establish the contracts — executable specifications — that let downstream integration and simulation skip re-debugging unit logic.

For matching implementation to specification in traditional simulation, verification engineers often write reference models (in C++ or SystemVerilog) that predict the correct output. The difference between a reference model and an assertion is procedural vs. declarative: a reference model says how to compute the expected result, while an assertion says what must be true. Because assertions are bound close to the design signals, they isolate failures at the moment they happen rather than at a downstream output mismatch, which shortens debug time.

In assertions, we use temporal logic to reason about the time and capture the temporal properties in a succinct and concise declarative form. A general form of SystemVerilog assertion binds a property to a clock, gates it on reset, and links an antecedent to a consequent over a time interval:

assert property (@(posedge clk) disable iff (rst)
                 antecedent |-> ##[1:N] consequent);

Verification Properties

The verification properties capture the intent of the specification and serve as the binding contract for the RTL implementation. We use temporal logic to define these contracts abstractly, without reference to specific implementation details. This separation of “what” from “how” means we can reuse these executable specifications even as the underlying RTL implementation evolves to improve performance or synthesizability.

To build these formal specifications, we rely on a core alphabet of temporal operators that govern behavior across a timeline of clock ticks:

• always: Indicates the condition holds true in every single state.
• eventually: Indicates the condition will hold true in some future state. Interestingly, we can define $eventually$ purely in terms of $always$:

\[ eventually~p~~~~~ \equiv \quad !~always~ (~ !p~ ) \]
• nexttime (or ##1): Indicates the condition will hold true precisely in the next clock tick.
• until: Indicates a condition holds continuously until a new condition is met. In its strong form — where the awaited condition must eventually arrive —

\[ true~ until~ p \quad \equiv \quad eventually~p. \]

Using these foundational operators, we classify the verification properties into the following fundamental categories:

• Invariance / Safety Properties

A Safety Property asserts that “something bad never happens.” It stipulates that across the entire interleaved progression of time, the system never inadvertently enters a specified invalid or forbidden global state. If an error occurs, it can be identified by looking at a finite prefix of the execution trace that terminates precisely when the violation occurs. To reason about such invariance properties conceptually, we use the temporal logic operator “$always$” to indicate the property holds at every moment:

\[\text {always } ( ! \text {bad} ) \]

For instance, returning to our traffic intersection abstraction from earlier in the chapter, a critical safety property dictates that the traffic lights for perpendicular intersecting roads must never both be green simultaneously. In a discrete timeline, this mutually exclusive constraint is evaluated at every single moment in time:

\[ \text {always } ( ! ( \text {road1.green} \text { \&\& } \text {road2.green} ) ) \]

In practical SystemVerilog Assertions (SVA) code, this executable contract is a Boolean safety invariant checked on every clock edge:

SVA: Safety Property (Mutual Exclusion)
```
assert property (@(posedge clk)
     !(road1.green && road2.green)
);
```
• Progress / Liveness Properties

A Liveness Property asserts that “something good eventually happens.” It guarantees the system makes progress in response to stimuli rather than deadlocking or stalling forever. A safety property can be refuted by the first bad state encountered, but a violation of an unbounded liveness property is only fully proven at the limit (no good state ever occurs). The conceptual operator is “$eventually$”.

\[ \text {always } ( \text {condition} \rightarrow \text {eventually } \text {good} ) \]

For example, if a car sensor detects a waiting vehicle on Road 1, a liveness property ensures the traffic controller must eventually change the lights to grant Road 1 a green signal:

\[ \text {always } ( \text {cars.waiting} \rightarrow \text {eventually } \text {cars.crossing} )\]

In SVA, this progress contract ensures that a waiting request is eventually serviced using the strong eventually operator (s_eventually):

SVA: Liveness Property (Eventual Progress)
```
assert property (@(posedge clk)
     cars.waiting |-> s_eventually cars.crossing
);
```
The s_ prefix matters. Bare eventually defaults to a weak operator that can never fail — an infinite trace in which the awaited event never arrives is treated as inconclusive rather than as a counterexample, so the property is unfalsifiable. The strong variant s_eventually is what actually catches the bug: it requires the awaited event to occur, and an infinite trace without it is a genuine falsification. Every liveness assertion in this book uses the strong operator for this reason.
• Fairness Properties

A Fairness Property specifies rules for mediating shared resources, often asserting that “if an entity requests a resource infinitely often, it will be granted that resource infinitely often.” It is essentially a conditioned restriction on the global interleaving, preventing scenarios where one fast component starves out slower processes.

In our traffic analogy, a fairness property guarantees that even if a major highway (Road 1) has continuous, heavy traffic, a waiting car on the intersecting minor street (Road 2) will not be starved infinitely; the minor street will eventually receive its fair turn to cross.

In formal model checking, fairness properties are usually expressed as assumptions: they restrict the input scenarios the tool considers, ruling out the pathological cases where a resource is held forever and excluding those traces from the liveness analysis.

SVA: Fairness Property (Anti-Starvation)
```
assume property (@(posedge clk)
     // Assume Road 2 eventually gets a turn
     s_eventually (intersection.turn == ROAD2)
);
```
• Precedence Properties

While Safety, Liveness, and Fairness are the foundational classifications, we often combine them to form more complex relationships. Precedence properties capture the sequential ordering of events using a temporal operator “$until$”. This is generally a combination of safety and liveness properties (sometimes called safe-liveness) that ensures nothing bad happens until something good happens.

Conceptually:

\[ \text {always } ( \text {request until grant} ) \]

In SVA, the until or throughout operators translate this sequential logic:

SVA: Precedence Property (Safe-Liveness)
```
assert property (@(posedge clk)
     req s_until gnt
     // strong until: req holds every cycle until gnt, and gnt must arrive
);
```

SystemVerilog Assertions

SystemVerilog Assertions (SVA) are the constructs that turn temporal properties into executable code. SVA extends Boolean logic with temporal operators and sequence regular expressions. It also introduces local variables that remember past values across cycles, giving sequence computations more expressive power than plain finite automata.

The Checker Module Methodology

While SVA can technically be embedded directly inside RTL design modules, this practice is strongly discouraged in modern agile verification. Mixing complex procedural design code with declarative property statements creates cluttered, hard-to-maintain files and risks inadvertently altering synthesis behavior.

Instead, the industry standard is to strictly isolate verification intent. All associated assertions, assumptions, and necessary auxiliary modeling logic are encapsulated within a dedicated, separate checker module. This standalone verification module is then virtually instantiated inside the Design Under Test (DUT) hierarchy using the SystemVerilog bind construct.

This separation of concerns offers critical advantages:

• Zero Intrusion: The original RTL source files remain untouched and clean for synthesis.
• Reusability: The identical checker module can be dynamically bound to multiple instances of a component across different block-level and system-level testbenches.
• Formal Readiness: Auxiliary state tracking (like counters or history buffers) needed purely to simplify complex temporal properties is kept cleanly segregated in the checker, avoiding pollution of the silicon design space.

2.5 Round-Robin Arbiter RTL Design

Now that we have covered design primitives, finite state machines, and verification properties, we can put them together into production-ready RTL designs. In an agile hardware flow, writing an executable specification is often the first step — before any synthesizable code is written. The examples that follow walk the full lifecycle: state the architectural intent, build the synthesizable SystemVerilog implementation, and pin the behavior down with formal SystemVerilog Assertion contracts. Six designs follow, and each earns its own tools for the formal kit: the arbiter a symbolic client, the handshake an assume–guarantee contract, the FIFO a data-independence token, the pipelined ALU a sequential-equivalence miter, the MSI controller local-node and predicate abstraction with the CEGAR loop, and the memory controller the counter and memory abstractions. By the end of the chapter, the abstraction toolbox that Chapter 3 catalogues has been earned one design at a time. And none of it stays on paper: every design and every checker these sections print is formally verified, exactly as written, by the proof engines Chapter 3 builds from scratch — each example directory in the companion’s ch2_rtl_fv_examples/ closes its contracts with make prove and refutes a bug-injected twin with make bug.

A round-robin arbiter is a common mechanism used to safely and fairly distribute access to a shared resource among multiple clients. It guarantees that no single client is starved (a Fairness property) and that no two clients are granted simultaneously (a Safety property) by strictly rotating the priority.

Before writing the state machine, we capture these requirements as a SystemVerilog temporal specification. The s_eventually operator lets us express starvation freedom without committing to a cycle-by-cycle implementation.

SystemVerilog Temporal Specification: abstract_arbiter_props.sv

// Formal Temporal Properties for an Abstract Arbiter
property p_safety_mutex;
 @(posedge clk)
  // Always: No two grants are ever issued simultaneously (One-Hot or Zero)
  $onehot0(gnt);
endproperty


property p_liveness_fairness(client_id);
 @(posedge clk)
  // Fairness: If a client requests access, they eventually receive a grant
  req[client_id] |-> s_eventually gnt[client_id];
endproperty


// System Model Execution
a_mutex: assert property (p_safety_mutex);
// (Instantiated in a generate block for all clients)

Figure 2.20 shows the rotating priority as an FSM. The RTL implements that rotation directly, routing the next grant from the current state. Each grant state also holds while only its own request is asserted (e.g. Grant 0 stays put on req==001); these self-loops are omitted from the figure for clarity.

Hardware Implementation: round_robin_arbiter.sv

module round_robin_arbiter (
     input       logic       clk,
     input       logic       rst_n,
     input       logic [2:0] req,
     output logic [2:0] gnt
);
     typedef enum logic [1:0] {IDLE=0, G0=1, G1=2, G2=3} state_t;
     state_t state, next_state;


     // FSM State transitions prioritizing round-robin order
     always_comb begin
       next_state = state; // Default stay
       case (state)
           IDLE: if (req[0]) next_state = G0;
                 else if (req[1]) next_state = G1;
                 else if (req[2]) next_state = G2;
           G0:     if (req[1]) next_state = G1;
                 else if (req[2]) next_state = G2;
                 else if (!req[0]) next_state = IDLE;
           G1:     if (req[2]) next_state = G2;
                 else if (req[0]) next_state = G0;
                 else if (!req[1]) next_state = IDLE;
           G2:     if (req[0]) next_state = G0;
                 else if (req[1]) next_state = G1;
                 else if (!req[2]) next_state = IDLE;
       endcase
     end


     always_ff @(posedge clk or negedge rst_n) begin
       if (!rst_n) state <= IDLE;
       else              state <= next_state;
     end


     // Output decode
     assign gnt[0] = (state == G0);
     assign gnt[1] = (state == G1);
     assign gnt[2] = (state == G2);
endmodule

Starvation Freedom and Structural Fairness

To verify the implementation against the specification without simulating millions of access patterns, we translate the design’s intent into an executable contract. Following common practice, we put the auxiliary tracking state and the SVA properties into a dedicated checker module bound to the DUT. The s_eventually operator covers starvation freedom across unbounded wait times — something that would explode the state space if expressed as a bounded counter.

Formal Contract: round_robin_arbiter_checker.sv

module round_robin_arbiter_checker (
     input logic          clk,
     input logic          rst_n,
     input logic [2:0] req,
     input logic [2:0] gnt
);


     // -------------------------------------------------------------
     // Auxiliary Formal State
     // -------------------------------------------------------------
     // Tracks the exact precedence to make the property trivial
     logic [1:0] last_gnt_id;
     always_ff @(posedge clk or negedge rst_n) begin
          if (!rst_n) last_gnt_id <= 2'd3; // Init
          else if (gnt[0]) last_gnt_id <= 2'd0;
          else if (gnt[1]) last_gnt_id <= 2'd1;
          else if (gnt[2]) last_gnt_id <= 2'd2;
     end


     // -------------------------------------------------------------
     // SVA Contracts
     // -------------------------------------------------------------
     // 1. Safety Property: Mutual Exclusion. Guarantee that two clients never access the resource simultaneously.
     assert property (@(posedge clk)
          $onehot0(gnt)
     );


     // 2. Liveness & Fairness Property: No Starvation. Guarantee that if a client requests, it is eventually granted.
     assert property (@(posedge clk)
          req[0] |-> s_eventually gnt[0]
     );


     // 3. Precedence Property: Strict Round Robin Ordering. Simplified using formal auxiliary logic: if Client 0 went last, and Client 1 is requesting, Client 2 MUST NOT illegally bypass Client 1.
     assert property (@(posedge clk)
          (last_gnt_id == 2'd0 && req[1]) |-> !gnt[2]
     );


endmodule


// Bind the checker to the DUT in the verification environment
bind round_robin_arbiter round_robin_arbiter_checker chk (.*);

The formal-workflow loop applied to these properties. The three assertions above — mutual exclusion, no-starvation, and round-robin precedence — are the recurring shape of every formal proof at the block level. The workflow on this arbiter is the workflow on every other block in the chapter: cover the interesting behaviors first to confirm the proof environment exercises real traffic (a cover for “a request was eventually granted” must hit, otherwise the assumption set is vacuous); assume the input contract (a request, once raised, holds until granted — so the arbiter is not racing a synchronous deassertion); assert the three safety and liveness contracts; bound the proof at a tractable depth if the unbounded engine stalls (a $k$-bounded proof at $k$ deep enough to cover the longest cover trace is sign-off-acceptable per §3.9); close on convergence or escalate to the abstraction techniques of §3.2. With LLM-assisted flows the same loop tightens further — the model proposes the assume and cover set from the spec, drafts the auxiliary tracking state, and triages the first counterexample — so the engineer’s time goes to the assertions that genuinely encode design intent rather than to the boilerplate of standing the proof up. The same loop, with the same five moves, reappears in the asynchronous handshake’s assume–guarantee composition, the FIFO’s data-independence proof, the pipelined ALU’s sequential-equivalence proof, the MSI coherence controller, and the memory controller’s abstraction-driven closure — only the shape of the assertions changes.

Scalability and Symbolic Abstraction

Proving fairness for a 3-client arbiter is trivial, but scaling the same approach to $N = 128$ clients exposes the standard limitation of formal verification: state-space explosion. Instantiating 128 fairness properties via a generate block forces the solver to track the temporal state of 128 concurrent requestors at once, which quickly exhausts memory and runtime.

To get around this, formal engineers combine two moves from the abstraction toolbox: a symbolic constant (a free variable the tool holds arbitrary-but-stable) and symmetry reduction. Because a round-robin arbiter is symmetrical — it treats Client 0 the same as Client 127 — we do not need to prove fairness for all 128 clients individually. Instead, we have the formal tool pick one arbitrary, symbolic client ID.

If the solver proves starvation freedom for this one unconstrained tracked_client_id without ever knowing which specific client it is observing, the property holds for all $N$ clients. This abstraction reduces the formal tracking complexity from $N$ state machines to one.

A single symbolic variable is enough for liveness (no starvation), but proving strict round-robin ordering requires observing the priority handoff between two requestors. We introduce two symbolic variables: client_A and client_B. If the arbiter correctly handles the handoff between any two clients the tool picks, the ordering property holds for all $N$ clients.

The symbolic-property listings in this subsection and the next are sketches for an $N$-client arbiter (here $N = 128$), not standalone-compilable modules: they show the shape of the liveness and precedence properties under the symbolic abstraction, and helper signals such as priority_tracker_A_higher_than_B stand in for state the surrounding formal harness maintains.

Symbolic Abstraction: scalable_fairness.sv

// For N = 128 clients, we define TWO completely symbolic IDs using IEEE standard $stable. The formal tool will choose two arbitrary values out of the 128 possibilities and evaluate the mathematical relationship between them.
logic [6:0] client_A;
logic [6:0] client_B;


// Force the solver to pick a value at time 0 and hold it perfectly constant
always_ff @(posedge clk) begin
  assume($stable(client_A));
  assume($stable(client_B));
end


// 1. Single-Symbol Liveness: Prove that ANY arbitrary client eventually gets served
property p_liveness_scalable_fairness;
  @(posedge clk)
  req[client_A] |-> s_eventually gnt[client_A];
endproperty


// 2. Dual-Symbol Precedence: Prove strict round-robin ordering between ANY two clients. Assume a helper `priority_tracker` tells us Client A has strictly higher priority than Client B. This proves that Client B can NEVER
      jump the queue ahead of Client A.
property p_precedence_scalable_ordering;
  @(posedge clk)
  (req[client_A] && req[client_B] && priority_tracker_A_higher_than_B)
  |-> !gnt[client_B];
endproperty


a_scalable_fairness: assert property (p_liveness_scalable_fairness);
a_scalable_ordering: assert property (p_precedence_scalable_ordering);

Advanced Symbolic Asymmetrical Arbiters

SoC arbiters often mix schemes. A common asymmetrical topology gives Client 0 a strict high-priority control thread while Clients 1–3 form a round-robin pool for bulk data.

Hardcoded properties for four clients are easy enough, but verifying an N-client variant requires an architecture that scales. The technique combines symbolic abstraction with an auxiliary expected-grant computation.

Instead of writing properties that read raw bus signals like req[2] or gnt[3], we add auxiliary formal logic that computes the boolean “expected” outcome using symbolic client IDs. The final assertion is the implication: if some other client is granted, our expected-grant condition must have been false. This reduces multi-tier arbitration logic to a small set of scalable implications.

Formal Contract: advanced_asymmetric_props.sv

// -------------------------------------------------------------
// Scalable Auxiliary Formal State: Computed Expected Outcomes
// -------------------------------------------------------------
// Priority Isolation: Track Round-Robin token, ignoring Client 0
logic [1:0] last_rr_gnt; // Tracks RR clients (e.g., 1 to 3)
always_ff @(posedge clk or negedge rst_n) begin
  if (!rst_n) last_rr_gnt <= 2'd3;
  else if (gnt[1]) last_rr_gnt <= 2'd1;
  else if (gnt[2]) last_rr_gnt <= 2'd2;
  else if (gnt[3]) last_rr_gnt <= 2'd3;
  // ... gracefully scales to N clients
end


// Select arbitrary RR clients to prove universal scalability using IEEE standard $stable
logic [1:0] rr_client_A;
logic [1:0] rr_client_B;


// Force the solver to pick a value at time 0 and hold it perfectly constant
always_ff @(posedge clk) begin
  assume($stable(rr_client_A));
  assume($stable(rr_client_B));
end


// Auxiliary computation: Calculate "A" (The Expected Mathematics)
logic expected_high_priority_grant;
logic expected_client_A_grant;


// Expectation 1: If High Priority requests, it strictly preempts RR Client A
assign expected_high_priority_grant = (req[0] && req[rr_client_A]);


// Expectation 2: If HP is off, and both RR clients request, A must beat B if B was the very last client served (Proving strict anti-starvation).
assign expected_client_A_grant = (!req[0] && req[rr_client_A] && req[rr_client_B] &&
                   (last_rr_gnt == rr_client_B));


// -------------------------------------------------------------
// SVA Contracts: "Other Grant Implies NOT Expected Grant"
// -------------------------------------------------------------
// Proof 1: High Priority with any low priority round robin. If the RR client is granted, it implies HP was NOT expected to beat it.
property p_hp_preempts;
  @(posedge clk)
  gnt[rr_client_A] |-> !expected_high_priority_grant;
endproperty


// Proof 2: When High Priority is off, any two round robin group works as expected. If Client B is granted, it implies Client A was NOT expected to beat Client B.
property p_rr_strict_order;
  @(posedge clk)
  gnt[rr_client_B] |-> !expected_client_A_grant;
endproperty


a_hp_preempts:       assert property (p_hp_preempts);
a_rr_strict_order: assert property (p_rr_strict_order);

2.6 Asynchronous Handshake Controller RTL Design

The arbiter’s proof stayed inside one clock domain; the second design crosses two. When data must travel between two asynchronous clock domains (Clock A and Clock B), we cannot rely on cycle-by-cycle setup and hold times. Instead, we use a dedicated synchronizer. A standard 4-phase Request/Acknowledge (REQ/ACK) handshake transfers a single signal safely by bouncing level-sensitive control signals through a 2-flop synchronizer on each domain crossing.

Why two flops, not one — the metastability MTBF argument. A flip-flop sampling an input that violates its setup/hold window may enter a metastable state where its output sits between the legal logic levels for an unpredictable amount of time before resolving. The probability that the flop has not resolved after a time $t$ falls exponentially:

\[ \text {Failure probability} \propto \exp \!\left (-\,t / \tau \right ), \]

where $\tau $ is the technology-specific resolution time constant (sub-nanosecond for modern CMOS). Routing the metastable output directly into combinational logic risks propagating an undecided value into the design state, producing intermittent and unreproducible bugs. The fix is to add a second flop on the destination clock: the first flop catches the metastable event, the second flop sees an output that has had one full destination clock period to settle. Mean Time Between Failures (MTBF) for a properly designed two-flop synchronizer is on the order of years to centuries; one flop is typically inadequate beyond toy designs. Three flops are used when the destination clock is much faster than $1/\tau $ allows. A CDC design without explicit MTBF analysis is ceremony without justification — always state the synchronizer depth and the assumed $\tau $.

Before looking at the hardware, we define the protocol constraints as a SystemVerilog temporal specification. Because the receiver (Domain B) has no control over when the sender (Domain A) asserts a request, we formalize that boundary as an assumption (assume property). Figure 2.21 shows the topology that this specification governs.

SystemVerilog Temporal Specification: abstract_cdc_props.sv

// Formal Temporal Properties for Async Handshake (Black-Box Interface)


// ASSUMPTION: The external sender (Domain A) must eventually assert Request
property p_assume_environment_liveness;
 @(posedge clk_rx)
  s_eventually (req_in);
endproperty


// ASSERTION: Data Stability (Safety)
// If we are actively processing a request, the sender's data bus MUST NOT change
property p_assert_data_stable;
 @(posedge clk_rx)
 (req_in && !ack_out) |=> $stable(data_payload);
endproperty


// System Model Execution
a_env_live:   assume property (p_assume_environment_liveness);
a_data_safe: assert property (p_assert_data_stable);

The RTL establishes these boundaries by isolating the synchronization flops from the main control state machines so metastability cannot leak into the FSM.

Hardware Implementation: cdc_handshake_rx.sv

module cdc_handshake_rx (
     input    logic clk_b,
     input    logic rst_b_n,


     // Async inputs from Domain A
     input    logic async_req_a,
     input    logic [31:0] async_data_a,


     // Sync outputs back to Domain A and local Domain B
     output logic sync_ack_b,
     output logic [31:0] data_out_b,
     output logic data_valid_b
);


     // 1. 2-Flop Meta-stability Synchronizer
     logic req_meta, sync_req;
     always_ff @(posedge clk_b or negedge rst_b_n) begin
       if (!rst_b_n) {sync_req, req_meta} <= 2'b00;
       else             {sync_req, req_meta} <= {req_meta, async_req_a};
     end


     // 2. Rx State Machine (4-Phase)
     typedef enum logic [1:0] {IDLE=0, CAPTURE=1, WAIT_REQ_FALL=2} state_t;
     state_t state, next_state;


     always_comb begin
       next_state = state;
       case (state)
           IDLE: if (sync_req) next_state = CAPTURE;
           CAPTURE: next_state = WAIT_REQ_FALL;
           WAIT_REQ_FALL: if (!sync_req) next_state = IDLE;
       endcase
     end


     always_ff @(posedge clk_b or negedge rst_b_n) begin
       if (!rst_b_n) begin
           state <= IDLE;
           sync_ack_b <= 1'b0;
           data_valid_b <= 1'b0;
       end else begin
           state <= next_state;
           sync_ack_b <= (next_state != IDLE);
           data_valid_b <= (state == CAPTURE);


           // Safe to latch data ONLY during the capture tick
           if (state == CAPTURE) data_out_b <= async_data_a;
       end
     end


endmodule

Assume-Guarantee Reasoning

Simulators cannot directly model the timing windows where metastability arises, so simulation alone cannot prove freedom from rare resolution failures. Formal verification reasons about the incoming signals as unbounded variables — every legal arrival pattern is considered.

To prevent the formal tool from generating spurious counter-examples (for example, a trace where the transmitter drops its request after a single tick), we establish an assume–guarantee perimeter. assume property constrains the external inputs to follow the 4-phase protocol. Under those assumptions, the formal engine then proves that the receiver always latches stable data.

Formal Contract: cdc_handshake_rx_checker.sv

module cdc_handshake_rx_checker (
     input logic clk_b,
     input logic rst_b_n,
     input logic async_req_a,
     input logic sync_ack_b
);


     // --- ASSUMPTIONS (The Environment Contract) ---
     // The TX domain promises to hold REQ high until it sees ACK go high
     property a_tx_holds_req_until_ack;
      @(posedge clk_b) disable iff (!rst_b_n)
       async_req_a && !sync_ack_b |=> async_req_a;
     endproperty
     assume property (a_tx_holds_req_until_ack);


     // The TX domain promises to completely drop REQ once it sees ACK
     property a_tx_drops_req_after_ack;
      @(posedge clk_b) disable iff (!rst_b_n)
       async_req_a && sync_ack_b |=> !async_req_a;
     endproperty
     assume property (a_tx_drops_req_after_ack);



     // --- GUARANTEES (The Receiver Contract) ---
     // The RX domain guarantees it will eventually raise ACK if REQ arrives
     property g_rx_answers_req_with_ack;
      @(posedge clk_b) disable iff (!rst_b_n)
       async_req_a |-> s_eventually (sync_ack_b);
     endproperty
     assert property (g_rx_answers_req_with_ack);


     // The RX domain guarantees it will drop ACK once REQ drops
     property g_rx_drops_ack_after_req_drop;
      @(posedge clk_b) disable iff (!rst_b_n)
      !async_req_a |-> s_eventually (!sync_ack_b);
     endproperty
     assert property (g_rx_drops_ack_after_req_drop);


endmodule


// Bind statement connecting the formal checker structurally
bind cdc_handshake_rx cdc_handshake_rx_checker chk (.*);

2.7 FIFO RTL Design

The first two designs were control logic; their proofs never touched a payload. The FIFO brings data — and with it the chapter’s first state-explosion wall. A First-In-First-Out (FIFO) buffer is perhaps the most ubiquitous concurrent data structure in hardware design. It safely transports data across boundaries — whether crossing clock domains (asynchronous) or absorbing bursty throughput (synchronous). To verify a FIFO, we cannot just look at waveforms; we must first define its abstract mathematical behavior before proving it in silicon.

We start with an executable specification. Before any synthesizable RTL exists, we model the FIFO’s requirements as a SystemVerilog temporal specification. Using SVA temporal operators (always, s_eventually, the implied nexttime/|=>) rather than the LTL operator symbols ($\Box $, $\Diamond $) keeps the specification executable in both simulation and formal flows. We can specify the atomic boundaries (push and pop) without committing to the underlying hardware implementation.

SystemVerilog Temporal Specification: abstract_fifo_props.sv

// Formal Temporal Properties for an Abstract FIFO
property p_safety_enqueue;
  @(posedge clk)
  // Always: A valid push transitions the sequential state
  (push && !full) |=> (occupancy == $past(occupancy) + 1);
endproperty


property p_safety_dequeue;
  @(posedge clk)
  // Always: A valid pop yields the oldest data and shrinks occupancy
  (pop && !empty) |=> (occupancy == $past(occupancy) - 1);
endproperty


property p_liveness_dequeue;
  @(posedge clk)
  // Fairness/Liveness: If the buffer is not empty, eventually it is popped
  (!empty) |-> s_eventually (pop);
endproperty


// System Model Execution
a_safe_enq: assert property (p_safety_enqueue);
a_safe_deq: assert property (p_safety_dequeue);
a_live_deq: assert property (p_liveness_dequeue);

This specification states the push and pop state transitions as executable temporal properties. The implicit always that wraps every concurrent assert property carries the safety reading: every step the system takes lies in the legal next-state set. The s_eventually (strong eventually) operator carries the liveness reading: if the queue is non-empty, it is eventually popped rather than stalling forever.

Figure 2.22 captures the abstract model. We now move from that abstraction to a synthesizable SystemVerilog implementation, which manages the RAM array and wrap-around pointers explicitly.

Hardware Implementation: sync_fifo.sv

     typedef struct packed {
       logic push;
       logic pop;
       logic [31:0] wdata;
     } fifo_req_t;


     typedef struct packed {
       logic [31:0] rdata;
       logic full;
       logic empty;
     } fifo_rsp_t;


module sync_fifo #(
     parameter DEPTH = 8,
     parameter DATA_W = 32
)(
     input    logic       clk, rst_n,
     input    fifo_req_t req,
     output fifo_rsp_t rsp
);
     localparam AW = $clog2(DEPTH);


     logic [DATA_W-1:0] mem [DEPTH];
     logic [AW:0]          wr_ptr, rd_ptr;


     assign rsp.full     = (wr_ptr[AW] != rd_ptr[AW]) &&
              (wr_ptr[AW-1:0] == rd_ptr[AW-1:0]);
     assign rsp.empty = (wr_ptr == rd_ptr);


     always_ff @(posedge clk or negedge rst_n) begin
       if (!rst_n) begin
           wr_ptr <= '0;
           rd_ptr <= '0;
       end else begin
           if (req.push && !rsp.full)
             wr_ptr <= wr_ptr + 1'b1;
           if (req.pop   && !rsp.empty)
             rd_ptr <= rd_ptr + 1'b1;
       end
     end


     always_ff @(posedge clk) begin
       if (req.push && !rsp.full)
           mem[wr_ptr[AW-1:0]] <= req.wdata;
     end


     assign rsp.rdata = mem[rd_ptr[AW-1:0]];


endmodule

Once the executable specification is embodied in RTL, we must mathematically prove that the silicon matches the abstract sequence specification. However, verifying FIFOs in formal tools runs into a classic obstacle: the state-space explosion problem.

The Shallow Model: Bounded Circular Buffer

If our FIFO is relatively shallow (e.g., $N \le 32$ entries), modern formal property checkers can usually exhaustively explore the entire state space. The most robust way to verify a shallow FIFO is to build an independent, abstract circular buffer model inside the SVA checker module. We mimic the RTL’s behavior using an array, but we write the logic as simply and sequentially as possible, prioritizing mathematical clarity over synthesizability. We then assert that the RTL’s flags and read data exactly match our abstract model at all times.

Formal Contract (Shallow): fifo_bounded_checker.sv

module fifo_bounded_checker #(
     parameter DEPTH = 8,
     parameter DATA_W = 32
)(
     input logic           clk, rst_n,
     input fifo_req_t req,
     input fifo_rsp_t rsp
);
     // Abstract Reference Model
     logic [DATA_W-1:0] model_q [$];


     always_ff @(posedge clk or negedge rst_n) begin
       if (!rst_n) begin
           model_q = {};
       end else begin
           if (req.pop && model_q.size() >0)
             void'(model_q.pop_front());
           if (req.push && model_q.size() <DEPTH)
             model_q.push_back(req.wdata);
       end
     end


     // Equivalence Checking Properties
     a_empty_match: assert property (@(posedge clk)
       rsp.empty == (model_q.size() == 0));
     a_full_match:    assert property (@(posedge clk)
       rsp.full == (model_q.size() == DEPTH));


     a_data_match:    assert property (@(posedge clk)
       (!rsp.empty) |-> rsp.rdata == model_q[0]);


endmodule


// Bind the shallow checker; DEPTH tracks the FIFO under test.
bind sync_fifo fifo_bounded_checker #(.DEPTH(DEPTH), .DATA_W(DATA_W)) b_chk (.*);

The Deep Model: Wolper’s Data Independence (Symbolic Token)

When a FIFO becomes extremely deep (e.g., $N = 1024$ or $4096$), the state space of the memory array explodes asymptotically. The formal tool will time out or run out of memory attempting to prove the circular buffer equivalence.

To break this complexity wall, we use Pierre Wolper’s data independence. A FIFO’s control logic behaves the same regardless of the specific data values flowing through it ($\mathtt {0x00}$ versus $\mathtt {0xFF}$). So instead of tracking all $N$ elements, we inject a single symbolic token — an unconstrained data value placed at an arbitrary point in time. We track only this token’s position in the FIFO, and prove that whenever the token reaches the front of the queue, it emerges uncorrupted.

Formal Contract (Deep): fifo_symbolic_checker.sv

module fifo_symbolic_checker #(
     parameter DEPTH = 1024,
     parameter DATA_W = 32
)(
     input logic         clk, rst_n,
     input fifo_req_t req,
     input fifo_rsp_t rsp
);
     // The arbitrary symbolic token we want to track
     wire [DATA_W-1:0] symbolic_token;
     assume property (@(posedge clk) $stable(symbolic_token));


     // Auxiliary Occupancy Tracker required for Token Injection
     logic [$clog2(DEPTH):0] current_occupancy;
     always_ff @(posedge clk or negedge rst_n) begin
       if (!rst_n) current_occupancy <= '0;
       else current_occupancy <= current_occupancy + (req.push && !rsp.full) - (req.pop && !rsp.empty);
     end


     // Token Tracking Logic
     logic token_in_flight;
     logic [$clog2(DEPTH)-1:0] token_pos;


     always_ff @(posedge clk or negedge rst_n) begin
       if (!rst_n) begin
           token_in_flight <= 1'b0;
           token_pos <= '0;
       end else begin
           // 1. Inject the token at a non-deterministic time
           if (req.push && !rsp.full && !token_in_flight &&
             (req.wdata == symbolic_token)) begin
             token_in_flight <= 1'b1;
             // It enters at the back of the queue; a concurrent pop shifts every element one slot toward the head, so discount it from the position
             token_pos <= ($clog2(DEPTH))'(current_occupancy - (req.pop && !rsp.empty));
           end
           // 2. Token advances forward when elements ahead are popped
           else if (req.pop && !rsp.empty && token_in_flight) begin
             if (token_pos == 0) // Token popped out!
                 token_in_flight <= 1'b0;
             else
                 token_pos <= token_pos - 1'b1;
           end
       end
     end


     // 3. The key property: when the token hits position 0, the RTL must be outputting the exact uncorrupted token value
     a_data_integrity: assert property (@(posedge clk)
       (token_in_flight && token_pos == 0) |->
       (rsp.rdata == symbolic_token));


endmodule


// Bind the deep checker; DEPTH tracks the FIFO under test.
bind sync_fifo fifo_symbolic_checker #(.DEPTH(DEPTH), .DATA_W(DATA_W)) chk (.*);

2.8 Pipelined ALU with Data Forwarding RTL Design

The earlier examples focused on FSMs and control paths. The bulk of the datapath logic inside a modern CPU or GPU lives in deeply pipelined Arithmetic Logic Units (ALUs). A pipeline runs multiple instructions concurrently across different stages to maximize throughput, which introduces data hazards: an instruction in the Execute stage may need the result of a preceding instruction still in flight that has not yet written back to the register file. The hardware resolves this with data forwarding (also called bypassing).

To verify the forwarding network, we start by separating the specification from the cycle-by-cycle implementation. A specification should describe the input/output relation, not RTL syntax.

Languages like TLA+ exist for that kind of abstraction, but SVA’s liveness operators let us write the specification at a comparable level inside the SystemVerilog source. Instead of hardcoding cycle-accurate $past delays tied to a specific pipeline depth, we simply assert that if a valid instruction enters the pipeline, the abstract “golden” result eventually appears at the writeback stage.

SystemVerilog Temporal Specification: abstract_pipeline_props.sv

// Formal Temporal Properties for Pure Mathematical Equivalence


// The "Golden" instantaneous mathematical result
logic [31:0] golden_result;
assign golden_result = abstract_alu_math(fetch_operand_a, fetch_operand_b);


// ASSERTION: Cycle-Agnostic Sequential Equivalence
// We ignore the depth of the pipeline. A valid input must eventually produce the golden result computed for THAT input: the local variable latches the golden at issue, so the eventual match refers to the in-flight instruction rather
    than whatever golden_result happens to hold several cycles later.
property p_equivalence_liveness;
  logic [31:0] expected;
 @(posedge clk)
 (valid_inst, expected = golden_result) |-> s_eventually (writeback_result == expected);
endproperty


// System Model Execution
a_pipe_matches_golden: assert property (p_equivalence_liveness);

Figure 2.23 shows the pipeline and the two bypass paths. The synthesizable RTL implementation has to track the hazard conditions on those bypasses explicitly.

Hardware Implementation: pipelined_alu.sv

// Shared datapath types. The request carries register ADDRESSES; operand VALUES are read from the pipeline's internal register file by address.
package alu_pkg;
     localparam int unsigned XLEN               = 32;
     localparam int unsigned NUM_REGS = 32;
     localparam int unsigned AWIDTH              = 5;           // $clog2(NUM_REGS)


     typedef struct packed {
       logic                      valid;
       logic [AWIDTH-1:0] rs1_addr;
       logic [AWIDTH-1:0] rs2_addr;
       logic [AWIDTH-1:0] rd_addr;
     } alu_req_t;


     typedef struct packed {
       logic                      valid;
       logic [AWIDTH-1:0] rd_addr;
       logic [XLEN-1:0]           alu_result;
     } alu_rsp_t;
endpackage


module pipelined_alu
     import alu_pkg::*;
(
     input       logic     clk,
     input       logic     rst_n,
     input       alu_req_t req,
     output alu_rsp_t rsp
);
     logic [XLEN-1:0] rf [NUM_REGS];              // architectural register file (x0 = 0)
     alu_req_t             ex_reg;                // EX   stage: decoded addresses
     alu_rsp_t             mem_reg;               // MEM stage: carries the computed result


     function automatic logic [XLEN-1:0] rf_read(logic [AWIDTH-1:0] a);
       return (a == '0) ? '0 : rf[a];                             // x0 always reads as zero
     endfunction


     // Operand read with two forwarding paths. MEM (one stage ahead) is more recent than WB (two ahead), so it wins; both override the register file. Together they cover every read-after-write hazard distance.
     function automatic logic [XLEN-1:0] fwd(logic [AWIDTH-1:0] src);
       if (src == '0)                                                             return '0;
       if (mem_reg.valid && mem_reg.rd_addr != '0 && mem_reg.rd_addr == src)
           return mem_reg.alu_result;
       if (rsp.valid && rsp.rd_addr != '0 && rsp.rd_addr == src)
           return rsp.alu_result;
       return rf_read(src);
     endfunction


     // Continuous assignments -- NOT "logic op_a = fwd(...)", which would be a one-time static initializer rather than live combinational logic.
     logic [XLEN-1:0] op_a, op_b;
     assign op_a = fwd(ex_reg.rs1_addr);
     assign op_b = fwd(ex_reg.rs2_addr);


     always_ff @(posedge clk or negedge rst_n) begin
       if (!rst_n) begin
           ex_reg <= '0; mem_reg <= '0; rsp <= '0;
           for (int i = 0; i < NUM_REGS; i++) rf[i] <= XLEN'(i);             // known start state
       end else begin
           ex_reg                   <= req;                     // issue -> EX
           mem_reg.valid            <= ex_reg.valid;            // EX -> MEM: the ADD
           mem_reg.rd_addr          <= ex_reg.rd_addr;
           mem_reg.alu_result <= op_a + op_b;
           rsp                      <= mem_reg;                 // MEM -> WB (output)
           if (rsp.valid && rsp.rd_addr != '0) // commit to the register file
             rf[rsp.rd_addr] <= rsp.alu_result;
       end
     end
endmodule

Sequential Equivalence Checking

Finding a bug in deeply nested if (mem_reg.rd_addr == ex_reg.rs1_addr) forwarding logic using constrained random simulation takes millions of cycles to hit the exact consecutive instruction sequence required to trigger the hazard.

Using sequential equivalence checking, the formal tool compares the pipeline against an unpipelined reference: an architectural register file that does one add per cycle, in order. Because a result appears exactly three cycles after its instruction issues, the equivalence is rsp.alu_result == $past(golden_result, 3) — the in-flight forwarding network must reproduce, at every hazard distance, exactly what the sequential reference computes. A single missing or mis-prioritized bypass makes the two diverge, and the tool exhibits the precise instruction sequence that exposes it.

Formal Contract: pipelined_alu_checker.sv

// Golden reference: a sequential, in-order register file -- one add per cycle, no pipeline, no forwarding. This is the architectural truth the pipeline must match. arch_rf updates one cycle after issue, so each result is visible to the
       next instruction, exactly as forwarding makes it.
module pipelined_alu_checker
     import alu_pkg::*;
(
     input   logic      clk,
     input   logic      rst_n,
     input    alu_req_t req,
     output alu_rsp_t rsp
);
     logic [XLEN-1:0] arch_rf [NUM_REGS];


     function automatic logic [XLEN-1:0] arch_read(logic [AWIDTH-1:0] a);
       return (a == '0) ? '0 : arch_rf[a];
     endfunction


     logic [XLEN-1:0] golden_result;                   // combinational, at issue
     assign golden_result = arch_read(req.rs1_addr) + arch_read(req.rs2_addr);


     always_ff @(posedge clk or negedge rst_n) begin
       if (!rst_n)
           for (int i = 0; i < NUM_REGS; i++) arch_rf[i] <= XLEN'(i);
       else if (req.valid && req.rd_addr != '0)
           arch_rf[req.rd_addr] <= golden_result;      // one result per cycle, in order
     end


     // (a) fixed latency: the result lands exactly three cycles after issue.
     property p_latency;
       @(posedge clk) disable iff (!rst_n)
       req.valid |-> ##3 rsp.valid;
     endproperty
     a_latency: assert property (p_latency);


     // (b) data equivalence: the pipeline output equals the sequential reference computed for the SAME instruction, three cycles earlier.
     property p_data_equiv;
       @(posedge clk) disable iff (!rst_n)
       rsp.valid |-> rsp.alu_result == $past(golden_result, 3);
     endproperty
     a_data_equiv: assert property (p_data_equiv);
endmodule


// Bind the checker onto every instance of the pipeline.
bind pipelined_alu pipelined_alu_checker chk (.*);

Three classes of sequential equivalence. The pipelined-ALU example above is one form of SEC: transformation equivalence, where two RTL models implement the same function but differ in pipeline depth, retiming, clock-gating, or other timing-preserving rewrites. Production SEC tools handle two further classes that are equally important in modern flows.

Arithmetic functional equivalence. The class where simulation cannot close at all. A 32-bit adder has $2^{64}$ distinct input pairs — about $1.8\times 10^{19}$ vectors — already beyond exhaustive simulation. A 32-bit multiplier has the same $2^{64}$ input pairs, but with far harder combinational logic to exercise. Floating-point is worse still: a 64-bit IEEE-754 add has $2^{128}$ input pairs, the result depends on the rounding mode, denormal handling, and exception flags, and “correct” allows several rounding outcomes per case — a simulation strategy of “try lots of random inputs and compare” simply cannot deliver confidence. SEC closes this gap by constructing a miter: the implementation and a mathematical reference (a single-cycle combinational adder, a known-good arithmetic IP, or a closed-form spec) are wired in parallel with a comparator on their outputs, and the formal tool proves (impl_out == ref_out) for every input. Datapath-specialized SEC engines use word-level reasoning — treating a $w$-bit vector as a single algebraic variable rather than $w$ Booleans — to keep these proofs tractable even on industrial-width arithmetic.

C-model versus RTL equivalence. The architectural reference comparison that Chapter 1’s Emergent Verification pipeline (§1.6, “Formal Logical Equivalence” in Figure 1.21) classifies as a sign-off step. The C-model encodes the architect’s intent at the algorithm level — a sequence of operations expressed as a C/C++ function; the RTL is the cycle-accurate implementation. SEC proves that for every input, the RTL output equals the C-model output, after the appropriate latency mapping (the RTL takes $N$ cycles to produce what the C-model computes in zero). The miter construction is the same as the arithmetic case, with the C-model imported through DPI-C or a transaction-level wrapper. The hard work the tool does is in mapping the RTL’s pipeline-internal state onto the C-model’s variables and in proving the equivalence under realistic interface assumptions (back-pressure, valid/ready handshakes, reset behavior).

What ties them together. All three classes share the same engine substrate — a miter, a state-mapping, an unbounded equivalence proof, developed in Chapter 3 (§3.5) — but they sit at different points in the design flow: arithmetic SEC at the datapath block level, transformation SEC after each synthesis or pipelining rewrite, C-model SEC at the architecture-to-implementation handoff. Each one is what unblocks formal sign-off for a class of designs that pure model checking on properties cannot reach: arithmetic blocks have too few interesting properties to enumerate but a single equivalence claim that is exactly what the engineer wants; pipeline rewrites preserve function by definition but tooling has to confirm it; the C-model contains the spec the team wrote and there is no other reference to compare against. The simulation effort to verify these classes by exhaustive stimulus is exponentially out of reach; the SEC effort is a single proof obligation per class, and the proof regime is the same as any other formal sign-off (§3.9).

2.9 MSI Cache Coherence Controller RTL Design

The four designs so far each owned their correctness outright; the fifth confronts a protocol, where correctness lives across multiple agents. Multi-core architectures rely on cache coherence protocols, such as MSI (Modified, Shared, Invalid), to keep distributed caches consistent. Verifying a complete coherence interconnect is a large undertaking, but formal verification works well at the level of an individual cache controller — proving that the controller adheres to the protocol on every legal bus transaction.

The interface separates local read/write commands from system coherence bus broadcasts. On the local CPU side we use standard memory transactions (read, write). On the system coherence bus side those same operations are translated into broadcast snoops, because other caches may hold newer modified copies of the data. A local cache miss broadcasts a Bus Read (snoop_rd); a local write broadcasts a Bus Read Exclusive (snoop_rd_exclusive). The snoop_rd_exclusive broadcast invalidates all other shared copies, so the requesting node has exclusive ownership before it writes.

Before implementing the node structure, we capture the core coherence invariant as a SystemVerilog temporal specification. The defining rule of MSI is that no node can stay in the Modified (exclusive) state once another node issues a read for that address.

SystemVerilog Temporal Specification: abstract_msi_props.sv

// Formal Temporal Properties for Cache Coherence
property p_safety_exclusivity_downgrade;
  @(posedge clk)
  // Always: If a snoop read hits a Modified line, it MUST immediately downgrade
  (snoop_hit && state_is_modified && external_bus_read) |=>
  (state_is_shared || state_is_invalid);
endproperty


property p_safety_flush_dirty_data;
  @(posedge clk)
  // Always: A downgrade from Modified to Shared/Invalid MUST trigger a bus flush
  (snoop_hit && state_is_modified && external_bus_read) |-> bus_flush_req;
endproperty


// System Model Execution
a_must_downgrade: assert property (p_safety_exclusivity_downgrade);
a_must_flush:      assert property (p_safety_flush_dirty_data);

In this capstone example, we design a single node representing a Cache Coherence Controller managing an internal 4-way set-associative cache.

Figure 2.24 shows the node-level architecture. The coherence FSM at the center reacts to local CPU events (cpu_req.read, cpu_req.write) and external bus snoops (bus_req.snoop_rd, bus_req.snoop_rd_exclusive).

Figure 2.25 captures the per-line state machine. The listing below gives the node in full, so it compiles and runs as-is: the module interface, the coherence types, the 4-way combinational tag-match and victim selection, and the sequential update logic in which a snooped bus read downgrades a Modified line to Shared and a snooped read-exclusive invalidates it outright — writing the dirty data back in both cases. The fill path is single-cycle and the victim pointer is a round-robin stand-in for a true PLRU, but the M/S/I transitions are real RTL, and they are exactly what the checker that follows pins down.

Hardware Implementation: msi_cache_node.sv

// Global encapsulation of system-level coherence types
package cache_pkg;
     typedef enum logic [1:0] {I=2'b00, S=2'b01, M=2'b10} msi_t;
     parameter TAG_W = 24;


     typedef struct packed {
       logic                       read;
       logic                       write;
       logic [TAG_W-1:0] tag;
       logic [3:0]                 set;
       logic [31:0]                wdata;
     } cpu_req_t;


     typedef struct packed {
       logic [31:0]                rdata;
       logic                       stall;
     } cpu_rsp_t;


     typedef struct packed {
       logic                       snoop_rd;
       logic                       snoop_rd_exclusive;
       logic [TAG_W-1:0] tag;
       logic [3:0]                 set;
       logic [31:0]                fill_data;
     } bus_req_t;


     typedef struct packed {
       logic [31:0]                evict_data;
       logic                       flush;
     } bus_rsp_t;


     typedef struct packed {
       msi_t                       state;
       logic [TAG_W-1:0] tag;
       logic [31:0]                data;
     } cache_line_t;
endpackage


import cache_pkg::*;


module msi_cache_node #(
     parameter SETS = 16
)(
     input       logic clk,
     input       logic rst_n,


     // Dual Aggregate Bus Interface Maps
     input       cpu_req_t cpu_req,
     output cpu_rsp_t cpu_rsp,


     input       bus_req_t bus_req,
     output bus_rsp_t bus_rsp
);


     // 4-Way Associative Cache Memory Map (Cleanly encapsulated)
     cache_line_t cache_array [0:3][0:SETS-1];


     // Simple 2-bit round-robin victim pointer per set
     logic [1:0] lru_array [0:SETS-1];


     // Combinational Lookup
     logic [3:0] cpu_hit_way;
     logic              cpu_hit;
     logic [3:0] snoop_hit_way;
     logic              snoop_hit;


     always_comb begin
       // Parallel tag match: a way hits only when it holds a VALID line (state != I) whose tag equals the requested tag. Gating on validity is what makes the single-residence (no multi-hit) invariant hold.
       cpu_hit_way            = 4'b0;
       snoop_hit_way = 4'b0;
       for (int w = 0; w < 4; w++) begin
           if (cache_array[w][cpu_req.set].state != I &&
                  cache_array[w][cpu_req.set].tag          == cpu_req.tag)
             cpu_hit_way[w] = 1'b1;
           if (cache_array[w][bus_req.set].state != I &&
                  cache_array[w][bus_req.set].tag          == bus_req.tag)
             snoop_hit_way[w] = 1'b1;
       end
       cpu_hit          = |cpu_hit_way;
       snoop_hit = |snoop_hit_way;
     end


     // Victim way from the round-robin LRU pointer
     logic [1:0] victim_way;
     assign victim_way = lru_array[cpu_req.set];


     // Main Controller Logic
     always_ff @(posedge clk or negedge rst_n) begin
       if (!rst_n) begin
           // Reset: every line Invalid, LRU pointers cleared
           for (int w = 0; w < 4; w++)
             for (int s = 0; s < SETS; s++)
                  cache_array[w][s] <= '{state: I, tag: '0, data: '0};
           for (int s = 0; s < SETS; s++)
             lru_array[s] <= 2'd0;
           cpu_rsp <= '{rdata: '0, stall: 1'b0};
           bus_rsp <= '{evict_data: '0, flush: 1'b0};
       end else begin
           // Default clear (single-cycle response pulses)
           bus_rsp.flush <= 1'b0;
           cpu_rsp.stall <= 1'b0;


           // 1. Snoop requests first (coherence has priority)
           if (snoop_hit) begin
             for (int w = 0; w < 4; w++) begin
                  if (snoop_hit_way[w]) begin
                    // A remote read-exclusive revokes ownership outright (-> I); a remote read only downgrades exclusivity (M -> S). Handle exclusive first so a both-bits-set request lands on I.
                    if (bus_req.snoop_rd_exclusive) begin
                        if (cache_array[w][bus_req.set].state == M) begin
                          bus_rsp.flush            <= 1'b1;
                          bus_rsp.evict_data <= cache_array[w][bus_req.set].data;
                        end
                        cache_array[w][bus_req.set].state <= I;
                    end else if (bus_req.snoop_rd) begin
                        if (cache_array[w][bus_req.set].state == M) begin
                          bus_rsp.flush            <= 1'b1;
                          bus_rsp.evict_data <= cache_array[w][bus_req.set].data;
                        end
                        cache_array[w][bus_req.set].state <= S;
                    end
                  end
             end
           end


           // 2. Process CPU Requests
           else if (cpu_req.read || cpu_req.write) begin
             if (cpu_hit) begin
                  for (int w = 0; w < 4; w++) begin
                    if (cpu_hit_way[w]) begin
                        if (cpu_req.write) begin
                          // Local write upgrades to Modified
                          cache_array[w][cpu_req.set].state <= M;
                          cache_array[w][cpu_req.set].data        <= cpu_req.wdata;
                          cpu_rsp.rdata <= cpu_req.wdata;
                        end else begin
                          cpu_rsp.rdata <= cache_array[w][cpu_req.set].data;
                        end
                        lru_array[cpu_req.set] <= 2'(w + 1);
                    end
                  end
             end else begin
                  // Miss: allocate into the victim way (flush first if it is dirty). cpu_hit == 0 means the tag is in no valid way, so the fill leaves exactly one resident copy -- which is what preserves single residence.
                  automatic logic [31:0] fill = cpu_req.write ? cpu_req.wdata : bus_req.fill_data;
                  if (cache_array[victim_way][cpu_req.set].state == M) begin
                    bus_rsp.flush              <= 1'b1;
                    bus_rsp.evict_data <= cache_array[victim_way][cpu_req.set].data;
                  end
                  cache_array[victim_way][cpu_req.set].tag            <= cpu_req.tag;
                  cache_array[victim_way][cpu_req.set].data           <= fill;
                  cache_array[victim_way][cpu_req.set].state <= cpu_req.write ? M : S;
                  cpu_rsp.rdata <= fill;
                  cpu_rsp.stall <= 1'b1; // miss costs a fill cycle
                  lru_array[cpu_req.set] <= 2'(victim_way + 2'd1);
             end
           end
       end
     end
endmodule

Exclusivity and Local-Node Abstraction

Formal verification is well-suited to coherence protocols. Rather than running long multi-core simulations to hit rare races, we apply local-node abstraction: prove that a single node never violates protocol invariants against an arbitrary, unconstrained bus environment. Combined with a separate compositional argument over the interconnect, the per-node proof scales to the full system without modeling all the cores simultaneously.

The two safety properties are predicate abstraction by hand. A predicate is simply a yes/no question asked of the design’s state — is the line valid?, is it dirty?, is it the only copy? A register holds a value; a predicate is a question about that value whose answer is one bit. Figure 2.26 shows what that buys: read through three such questions, a ${\sim }60$-bit cache line reduces to the three answer bits that name its M/S/I state, and lines that answer alike collapse into the same abstract state no matter how many tag and data bits separate them. The two safety properties above are written entirely over such answers (state_is_modified, snoop_hit, external_bus_read) — never over the data word or the tag. Keeping the answers and discarding the values is predicate abstraction, the most general abstraction in the formal toolbox; here we did it by hand, because the protocol handed us the predicates, and that is exactly why the single-node proof closes over M, S, and I alone.

When the predicates are not obvious: CEGAR. At single-node scope we knew the predicates in advance. Lift the goal to the whole interconnect — no two nodes may hold the same cache line Modified at the same time — and first price the direct proof. Each node carries $4\ \mathrm {ways} \times 16\ \mathrm {sets} = 64$ cache lines of ${\sim }60$ bits each (state, tag, data), about $3{,}700$ bits of state; four nodes are ${\sim }15{,}000$ bits before counting the in-flight bus messages, a reachable space bounded by $2^{15{,}000}$ — a number with some four and a half thousand decimal digits, and that is with this example’s 32-bit data block; an industrial 512-bit line widens it further. No engine can take this head-on: BMC (Bounded Model Checking) pays those $15{,}000$ bits again for every unrolled cycle, and IC3 (Incremental Construction of Inductive Clauses for Indubitable Correctness) must generalize clauses over $15{,}000$ candidate literals — the engines themselves are developed in Chapter 3. And the genuinely new difficulty is that we no longer know which handful of predicates would shrink it: the protocol names each node’s local states, but the facts that decide the global invariant are cross-node facts nobody wrote down.

CEGAR (counterexample-guided abstraction refinement; Clarke et al., CAV 2000) is the loop of Figure 2.27 that discovers those predicates, one per round, from the model checker’s own wrong guesses. Four moves. Abstract: pick a few predicates and build the model that tracks only their answers, under one soundness rule — where the abstraction cannot tell, it must allow both behaviors, so it always over-approximates the RTL (existential abstraction: an abstract step exists wherever any concrete step it could stand for does). Check: model-check the abstract model; with $k$ predicates it has at most $2^k$ states, so this is cheap. Conclude or replay: a proof on the abstraction is final, because the abstraction has every behavior the RTL has and more; a counterexample, though, may live in the “and more,” so it is replayed on the concrete design. Refine: if the trace replays, it is a real bug with a minimal stimulus; if it does not, it is spurious, and the failed replay names the exact distinction the abstraction dropped — add that one predicate and go again.

Now run the loop on the interconnect. Round 1. Seed the predicates the protocol hands us: is node $i$ Modified? — four bits, addresses ignored. The checker answers instantly: counterexample, node 0 and node 1 Modified at once. Replayed on the RTL the trace is legal — the two nodes own different addresses, and the invariant only forbids two writers of the same line. Spurious, and the failed replay says exactly what was dropped: the address. We do not add the 24-bit tags to the abstraction; we add one symbolic address $A$ — arbitrary but held stable, the same witness move the arbiter’s fairness proof used (§2.5) — and prove exclusivity at $A$; symmetry extends the conclusion to every address.

Round 2. The abstraction now tracks each node’s M/S/I state at $A$: eight bits. The checker returns a finer counterexample: node 1 reaches Modified-at-$A$ while node 0 still holds it. The replay fails again — on the real bus, node 1’s upgrade must broadcast snoop_rd_exclusive, node 0 snoops it, and the single-node contract we proved above — a snooped Modified line is surrendered at once, the M$\rightarrow $I edge of the FSM — forces node 0 out before node 1’s write completes. The local proof has just become a lemma of the global one. What the abstraction dropped this time is the message in transit, so the round contributes one more predicate: an invalidation for $A$ is in flight.

Round 3. Nine predicate bits now — $2^9 = 512$ abstract states standing in for $2^{15{,}000}$ concrete ones. The checker sweeps them in milliseconds and finds no counterexample: with the in-flight bit visible, node 1 cannot reach Modified-at-$A$ until node 0 has left it. The property holds on the abstraction, and therefore on the RTL. The proof is closed — and the whole effort was the discovery of which nine questions to ask. Everything else in the $2^{15{,}000}$-state space was irrelevant to exclusivity, and CEGAR is the procedure that finds that out without ever visiting it.

The loop is also runnable. cegar.py in the Chapter 3 companion applies it to this chapter’s cache node with localization as the abstraction: every register outside a kept set becomes a free input the abstract model may drive arbitrarily, and a spurious counterexample names the first freed bit whose invented value the real run contradicts — the replay is a deterministic simulation, since the trace fixes reset and every input. At the node’s full geometry — four ways $\times $ sixteen sets, 288 state bits, where a direct unbounded run drowns in the all-sets snoop cone — single residence closes in one round keeping sixteen bits: the probed set’s state/tag pairs, which are the checker’s own predicates. The same discovery the three rounds above narrate, performed by the machine.

The engineer still owns the proof. Round 1’s seeding was an engineer’s move: known predicates go in up front so the loop does not spend rounds rediscovering the protocol. Two more controls keep it in hand: the iteration budget is capped, so a refinement that will not converge is caught rather than left spinning; and if the loop diverges — predicates multiplying without the proof closing — the engineer breaks it with a manual abstraction or a helper invariant. The loop discovers predicates; the engineer chooses the strategy. And it is the same predicate abstraction the single node received by hand — only the source of the predicates has changed: there the protocol named them, here the spurious counterexamples do.

We build the formal checker module and bind it as in the earlier examples. Notice what the properties read: each way’s state and tag, snoop_hit, snoop_hit_way — never the data word. These are the predicates of Figure 2.26 reappearing as code, and the checker’s first always_comb exists only to name two more of them. The properties encode the snoop contracts whose single-node proof the CEGAR run above borrowed as its lemma — a snooped read downgrades a Modified line, a snooped read-exclusive revokes it outright — plus a structural law: a tag can hit in at most one valid way per set.

Formal Contract: msi_cache_node_checker.sv

import cache_pkg::*;


module msi_cache_node_checker #(
     parameter SETS = 16,
     parameter TAG_W = 24
)(
     input       logic clk,


     // Directly bonded Dual-Bus aggregates
     input       cpu_req_t cpu_req,
     input       bus_req_t bus_req,


     // Probing internal architecture structurally
     input       cache_line_t         cache_array [0:3][0:SETS-1],
     input       logic [3:0]          snoop_hit_way,
     input       logic                snoop_hit
);


     // -------------------------------------------------------------
     // SVA Contracts
     // -------------------------------------------------------------


     // Name the predicates the properties read: a snooped hit per way, split by snoop type (read vs read-exclusive)
     logic valid_snoop_read [0:3];
     logic valid_snoop_excl [0:3];
     always_comb begin
       for (int w=0; w<4; w++) begin
           valid_snoop_read[w] = snoop_hit && snoop_hit_way[w] && bus_req.snoop_rd;
           valid_snoop_excl[w] = snoop_hit && snoop_hit_way[w] && bus_req.snoop_rd_exclusive;
       end
     end


     // 1. Downgrade contract: if an external Bus Read (snoop) hits a line held in M, the line MUST NOT still be M on the next cycle.
     property p_exclusivity_downgrade(way);
       @(posedge clk)
       valid_snoop_read[way] |=>
       (cache_array[way][$past(bus_req.set)].state == S ||
       cache_array[way][$past(bus_req.set)].state == I);
     endproperty


     generate
       for (genvar w=0; w<4; w++) begin : gen_snoop_chk
           assert property (p_exclusivity_downgrade(w));
       end
     endgenerate


     // 2. Invalidate contract: a snooped Read-Exclusive revokes the line outright. Whatever the way held, it MUST be Invalid next cycle. This M->I edge is the lemma Round 2 of the CEGAR run borrowed.
     property p_exclusivity_invalidate(way);
       @(posedge clk)
       valid_snoop_excl[way] |=>
       (cache_array[way][$past(bus_req.set)].state == I);
     endproperty


     generate
       for (genvar w=0; w<4; w++) begin : gen_excl_chk
           assert property (p_exclusivity_invalidate(w));
       end
     endgenerate


     // 3. Structure Property: No Multi-Hit (Single Residence). A fundamental structural law of associative caches: a specific address (Tag + Set) can only reside in AT MOST ONE valid way at a time. If this fails, the replacement FSM logic
           has catastrophically failed.
     always_comb begin
       // Meaningful local variables for structural checking
       logic                     way_valid, other_valid;
       logic [TAG_W-1:0] way_tag, other_tag;


       for (int s=0; s<SETS; s++) begin
           for (int w=0; w<4; w++) begin
             way_valid = (cache_array[w][s].state != I);
             way_tag        = cache_array[w][s].tag;


             if (way_valid) begin
                  for (int w2=w+1; w2<4; w2++) begin
                    other_valid = (cache_array[w2][s].state != I);
                    other_tag     = cache_array[w2][s].tag;


                    if (other_valid) begin
                         assert(way_tag != other_tag) else
                           $error("Structural Coherence Failure: Multi-Hit in Set %0d", s);
                    end
                  end
             end
           end
       end
     end


endmodule


// Bind statement connecting the formal checker structurally
bind msi_cache_node msi_cache_node_checker #(.SETS(SETS)) chk (.*);

2.10 Memory Controller with Refresh RTL Design

A memory controller is the design class where the formal verification challenge is least about which property to write and most about which abstraction makes the proof close. The controller has three counters running at three time scales — a refresh counter that ticks every few thousand cycles, a per-burst refresh-phase counter of around a hundred cycles, and a per-row activation counter of around fifty cycles — and a backing store of thousands of addressable entries. A naive unbounded proof would have to enumerate the state-space diameter of all three counters multiplied with all reachable memory contents: of order $\mathrm {REFRESH\_PERIOD} \times \mathrm {tRFC} \times \mathrm {tRAS} \times 2^{\,2^{\mathrm {ADDR\_W}} \cdot \mathrm {DATA\_W}}$, which is comfortably out of reach for any solver. The same proof with counter abstraction on the three counters and memory abstraction on the backing store closes in seconds on a production formal engine (the chapter’s companion closes it too, liveness included — make abs). Chapter 3 names these techniques in the abstract; this section instantiates them on a concrete design that every SoC ships with.

The interface is the standard memory-mapped contract: a command transaction carries the read/write operation, address, and (for writes) data; a handshake cmd_ready gates new commands; a response transaction returns read data; a side-band refresh_busy signals upstream when the controller is unavailable. Internally the controller cycles a small FSM through IDLE, ACTIVATING, ACTIVE, PRECHARGING, and REFRESHING, gated by the three counters.

The timing parameters, the FSM state encoding, the operation type, and the two transaction structs are collected in a shared package so the design module, the verification checker, and any future testbench code use the same types. The refactor is small but pays back several times over once the proof environment grows.

Definition File: mem_ctrl_pkg.sv

package mem_ctrl_pkg;
 // Default sizing + timing (override per-instance as needed)
  parameter int ADDR_W              = 12;    // 4K-entry backing store
  parameter int DATA_W              = 32;
  parameter int REFRESH_PERIOD = 8192;       // tREFI (cycles between refreshes)
  parameter int TRFC                = 100;   // refresh duration (cycles)
  parameter int TRAS                = 50;    // row activate hold (cycles)


 // FSM states
  typedef enum logic [2:0] {
   IDLE, ACTIVATING, ACTIVE, PRECHARGING, REFRESHING
 } state_e;


 // Operation type
  typedef enum logic { OP_READ = 1'b0, OP_WRITE = 1'b1 } op_e;


 // Command transaction
  typedef struct packed {
   logic                    valid;
   op_e                     op;
   logic [ADDR_W-1:0]       addr;
   logic [DATA_W-1:0]       data;
 } cmd_t;


 // Response transaction
  typedef struct packed {
   logic                    valid;
   logic [DATA_W-1:0]       data;
 } rsp_t;
endpackage : mem_ctrl_pkg

The five-state FSM and the three counters that gate its transitions are shown in Figure 2.28. Each non-IDLE state has a counter associated with it (active_counter during ACTIVE, refresh_phase_counter during REFRESHING, and so on); each transition is gated on either a counter expiring or the global cycles_since_refresh reaching REFRESH_PERIOD.

With these definitions in place, we capture the rules the controller must obey as a SystemVerilog temporal specification. These four properties are the contract any compliant memory controller satisfies; the implementation below is one realization of that contract.

SystemVerilog Temporal Specification: abstract_mem_ctrl_props.sv

import mem_ctrl_pkg::*;


// Formal Temporal Properties for the Memory Controller (observational)


// 1. Refresh periodicity: between successive refresh events, at most REFRESH_PERIOD cycles pass.
property p_periodicity_refresh;
  @(posedge clk) disable iff (!rst_n)
  $fell(refresh_busy) |-> ##[1:REFRESH_PERIOD] $rose(refresh_busy);
endproperty


// 2. Mutual exclusion: no response while a refresh is in flight.
property p_safety_no_rw_during_refresh;
  @(posedge clk) disable iff (!rst_n)
  refresh_busy |-> !rsp.valid;
endproperty


// 3. Write-then-read data integrity: a read of an address returns the last value written there. The write-to-read distance is unbounded, so the expected value is held in a small shadow register -- auxiliary verification state, not
    design logic. A plain register (no SVA local variable) is what every formal tool handles cleanly. Shown here for address 0; the formal harness below generalizes it to an arbitrary address by symmetry.
logic [DATA_W-1:0] wr_shadow;       // last value written to address 0
logic                 wr_seen;      // a write to address 0 has been accepted
always_ff @(posedge clk or negedge rst_n)
  if (!rst_n)
    wr_seen <= 1'b0;
  else if (cmd.valid && cmd_ready && cmd.op == OP_WRITE && cmd.addr == '0) begin
    wr_shadow <= cmd.data;
    wr_seen      <= 1'b1;
  end


property p_safety_write_then_read;
  @(posedge clk) disable iff (!rst_n)
  (cmd.valid && cmd_ready && cmd.op == OP_READ && cmd.addr == '0 && wr_seen)
  |-> ##[1:$] (rsp.valid && rsp.data == wr_shadow);
endproperty


// 4. Liveness: every accepted command eventually produces a response.
property p_liveness_command_completes;
  @(posedge clk) disable iff (!rst_n)
  (cmd.valid && cmd_ready) |-> s_eventually(rsp.valid);
endproperty


// System Model Execution
a_periodicity:        assert property (p_periodicity_refresh);
a_mutex_refresh:      assert property (p_safety_no_rw_during_refresh);
a_write_then_read: assert property (p_safety_write_then_read);
a_liveness:           assert property (p_liveness_command_completes);

Property 3 is written for a single concrete address 0; the data-integrity proof generalizes over all addresses by the memory-abstraction argument developed below, so the controller does not need a separate assertion per address.

The implementation is a single-cycle abstraction of DRAM timing — one cycle per FSM transition, with the three counters carrying the multi-scale state-diameter shape that the verification will have to abstract away.

Hardware Implementation: mem_ctrl.sv

import mem_ctrl_pkg::*;


module mem_ctrl #(
     parameter int REFRESH_PERIOD = mem_ctrl_pkg::REFRESH_PERIOD,
     parameter int TRFC                   = mem_ctrl_pkg::TRFC,
     parameter int TRAS                   = mem_ctrl_pkg::TRAS
)(
     input       logic clk,
     input       logic rst_n,
     input       cmd_t cmd,               // command channel (struct from mem_ctrl_pkg)
     output logic cmd_ready,
     output rsp_t rsp,                    // response channel
     output logic refresh_busy
);


     state_e state, next_state;


     logic [$clog2(REFRESH_PERIOD):0] cycles_since_refresh;
     logic [$clog2(TRFC)-1:0]                      refresh_phase_counter;
     logic [$clog2(TRAS)-1:0]                      active_counter;


     // Backing store (the memory-abstraction target in the proof environment)
     logic [DATA_W-1:0] mem [0:(1<<ADDR_W)-1];


     // Latched copy of the accepted command (see the capture below)
     cmd_t cmd_q;


     // refresh is due when the long counter reaches REFRESH_PERIOD
     logic refresh_due;
     assign refresh_due = (cycles_since_refresh == REFRESH_PERIOD);


     always_comb begin
       next_state = state;
       unique case (state)
           IDLE:              next_state = refresh_due ? REFRESHING :
                                           cmd.valid     ? ACTIVATING : IDLE;
           ACTIVATING:        next_state = ACTIVE;
           ACTIVE:            next_state = (active_counter == 0) ? PRECHARGING : ACTIVE;
           PRECHARGING: next_state = IDLE;
           REFRESHING:
             next_state = (refresh_phase_counter == 0) ? IDLE : REFRESHING;
           default:           next_state = IDLE;
       endcase
     end


     always_ff @(posedge clk or negedge rst_n) begin
       if (!rst_n) begin
           state                       <= IDLE;
           cycles_since_refresh        <= '0;
           refresh_phase_counter <= '0;
           active_counter             <= '0;
           rsp                         <= '{valid: 1'b0, data: '0};
       end else begin
           state        <= next_state;
           rsp.valid <= 1'b0;


           // Capture the command at the accept cycle (IDLE -> ACTIVATING); the producer may change cmd once the handshake completes, so the access tRAS cycles later must use the latched copy, not the live bus.
           if (state == IDLE && next_state == ACTIVATING)
             cmd_q <= cmd;


           // Long counter: tick every cycle; clear when a refresh completes
           if (state == REFRESHING && next_state == IDLE)
             cycles_since_refresh <= '0;
           else if (!refresh_due)
             cycles_since_refresh <= cycles_since_refresh + 1'b1;


           // Refresh-phase counter: load tRFC on entry; decrement in REFRESHING
           if (state == IDLE && next_state == REFRESHING)
             refresh_phase_counter <= TRFC - 1;
           else if (state == REFRESHING && refresh_phase_counter != 0)
             refresh_phase_counter <= refresh_phase_counter - 1'b1;


           // Active (row-open) counter: load tRAS on entry; decrement in ACTIVE
           if (state == ACTIVATING && next_state == ACTIVE)
             active_counter <= TRAS - 1;
           else if (state == ACTIVE && active_counter != 0)
             active_counter <= active_counter - 1'b1;


           // Memory port: on ACTIVE with row open, perform the access using the command captured at acceptance (not the live, possibly-changed bus)
           if (state == ACTIVE && active_counter == 0) begin
             if (cmd_q.op == OP_WRITE) mem[cmd_q.addr] <= cmd_q.data;
             rsp.valid <= 1'b1;
             rsp.data     <= (cmd_q.op == OP_READ) ? mem[cmd_q.addr] : '0;
           end
       end
     end


     // Status outputs
     assign refresh_busy = (state == REFRESHING);
     assign cmd_ready           = (state == IDLE) && !refresh_due;


endmodule

The implementation is a single-cycle abstraction of DRAM timing: $\mathrm {tRCD}$, $\mathrm {tRP}$, and the bank-conflict checks of a real datasheet collapse to one cycle per FSM transition, while the three counters (cycles_since_refresh, refresh_phase_counter, active_counter) preserve the multi-scale state-diameter shape that makes the verification challenge realistic. The four properties of abstract_mem_ctrl_props.sv hold for this implementation by inspection, but proving them formally is what the next subsection demonstrates.

Counter Abstraction and Memory Abstraction Applied

The four properties cannot be proven directly. Property 1 (refresh periodicity) requires the engine to traverse the full REFRESH_PERIOD cycles of cycles_since_refresh; Property 3 (write-then-read) requires the engine to track every entry of the mem array across every refresh phase; Property 4 (liveness) compounds both, because the liveness loop must walk the full $\mathrm {REFRESH\_PERIOD} \times \mathrm {tRFC} \times \mathrm {tRAS}$ state product before it can return to IDLE. Two abstractions, applied to the formal checker rather than to the design itself, collapse this proof obligation to something modern engines close in seconds.

Counter abstraction on cycles_since_refresh, refresh_phase_counter, and active_counter. Each concrete counter is cut and replaced with a small expected-counter FSM whose states are exactly the values the properties refer to: { $0$, $0 < x < \mathrm {REFRESH\_PERIOD}$, $\mathrm {REFRESH\_PERIOD}$ } for the long refresh counter, { $0$, $\mathrm {tRFC}$ } for the refresh-phase counter, { $0$, $\mathrm {tRAS}$ } for the activate counter. The three concrete counter cones drop to a constant total state count regardless of how the parameters scale.

Figure 2.29 shows the three abstract counter FSMs that replace the three concrete counters in the proof environment. Each one has the minimum number of states the corresponding property mentions: the refresh counter needs to distinguish $0$, “below the limit,” and “at the limit”; the phase and active counters only need “done” versus “busy.” The product of the three abstract state spaces is a small constant, regardless of the concrete REFRESH_PERIOD, TRFC, or TRAS values.

Memory abstraction on the mem array. Four rigid free variables fv_active_addr[0..3] track four arbitrary-but-stable addresses across the proof; a four-slot ABSTRACT_MEM captures the values written to those addresses; reads to other addresses return a free fv_garbage_data and set a garbage flag the assertion gates on. The proof obligation for Property 3 becomes “for these four rigid addresses, every write is later read back correctly,” and by the universal-quantification-over-rigid-variables argument the conclusion generalizes to all $2^{\mathrm {ADDR\_W}}$ addresses.

Formal Contract with Abstractions: mem_ctrl_abs_checker.sv

import mem_ctrl_pkg::*;


module mem_ctrl_abs_checker (
     input logic        clk,
     input logic        rst_n,
     input cmd_t        cmd,
     input logic        cmd_ready,
     input rsp_t        rsp,
     input logic        refresh_busy,
     // Probing the concrete refresh counter structurally (via bind .*): the abstraction's advance is GLUED to it, which is what "non-deterministic but consistent with the concrete counter" means operationally.
     input logic [$clog2(REFRESH_PERIOD):0] cycles_since_refresh
);


     // ----------------------------------
     // Counter abstraction: replace the three concrete counters with small expected-counter state machines. Each abstract counter has only the values that any property refers to; the formal tool sees O(1) states regardless of the concrete
            REFRESH_PERIOD / TRFC / TRAS parameters.
     // ----------------------------------
     typedef enum logic [1:0] {
       REF_ZERO,                   // cycles_since_refresh == 0
       REF_BELOW,                  // 0 < cycles_since_refresh < REFRESH_PERIOD
       REF_AT_LIMIT                // cycles_since_refresh == REFRESH_PERIOD
     } abs_refresh_e;
     abs_refresh_e abs_refresh_state;


     typedef enum logic { PHASE_DONE, PHASE_BUSY } abs_phase_e;
     abs_phase_e abs_phase_state;


     typedef enum logic { ROW_DONE, ROW_BUSY } abs_active_e;
     abs_active_e abs_active_state;


     // Abstract FSMs (non-deterministic but consistent with concrete counter)
     logic fv_refresh_advance, fv_phase_advance, fv_active_advance;
     always_ff @(posedge clk or negedge rst_n) begin
       if (!rst_n) begin
            abs_refresh_state <= REF_ZERO;
            abs_phase_state        <= PHASE_DONE;
            abs_active_state       <= ROW_DONE;
       end else begin
            unique case (abs_refresh_state)
              REF_ZERO:          if (fv_refresh_advance) abs_refresh_state <= REF_BELOW;
              REF_BELOW:         if (fv_refresh_advance) abs_refresh_state <= REF_AT_LIMIT;
              REF_AT_LIMIT: if (refresh_busy)            abs_refresh_state <= REF_ZERO;
            endcase
            // (similar for abs_phase_state and abs_active_state, omitted for space; the companion file fills in the two arms)
       end
     end


     // Consistency glue: the abstract refresh state tracks the concrete counter's milestones. Without it the abstraction free-runs ahead of the design and the periodicity property is vacuously refutable.
     assume property (@(posedge clk) disable iff (!rst_n)
       (abs_refresh_state == REF_AT_LIMIT) == (cycles_since_refresh == REFRESH_PERIOD));
     assume property (@(posedge clk) disable iff (!rst_n)
       (abs_refresh_state == REF_ZERO) |-> (cycles_since_refresh < REFRESH_PERIOD));


     // ----------------------------------
     // Memory abstraction: track 4 arbitrary-but-stable addresses; reads to other addresses return garbage and set a flag for the assertion to gate on. The proof generalizes by universal quantification over fv_active_addr[].
     // ----------------------------------
     logic [ADDR_W-1:0] fv_active_addr [3:0];
     logic [DATA_W-1:0] fv_garbage_data;
     logic [DATA_W-1:0] ABSTRACT_MEM           [3:0];
     logic                     garbage;


     // Active addresses are rigid (constant across the proof). $stable needs a singular expression, so apply it per element.
     generate
       for (genvar k = 0; k < 4; k++) begin : g_addr_rigid
            assume property (@(posedge clk) $stable(fv_active_addr[k]));
       end
     endgenerate


     always @(posedge clk) begin
       garbage <= 1'b0;
       for (int i = 0; i < 4; i++) begin
            if (cmd.valid && cmd_ready
                && cmd.op == OP_WRITE
                && cmd.addr == fv_active_addr[i])
              ABSTRACT_MEM[i] <= cmd.data;
       end
       // Read to a non-tracked address: garbage flag asserted; tracked reads should satisfy the data-integrity assertion.
       if (rsp.valid
              && cmd.addr != fv_active_addr[0]
              && cmd.addr != fv_active_addr[1]
              && cmd.addr != fv_active_addr[2]
              && cmd.addr != fv_active_addr[3])
            garbage <= 1'b1;
     end


     // ----------------------------------
     // Properties, expressed against the abstract counter / memory state
     // ----------------------------------


     // Periodicity: when the abstract counter reaches the limit, refresh must fire promptly -- but the implementation may need to complete the in-flight command first, which takes up to ACTIVATING(1) + ACTIVE(TRAS) + PRECHARGING(1) cycles
            before the IDLE -> REFRESHING transition. TRAS+3 is that worst case, and it is tight: TRAS+2 is refutable on this design.
     a_periodicity: assert property (@(posedge clk) disable iff (!rst_n)
       (abs_refresh_state == REF_AT_LIMIT) |-> ##[1:TRAS+3] refresh_busy);


     // Mutual exclusion: no response while refreshing.
     a_mutex_refresh: assert property (@(posedge clk) disable iff (!rst_n)
       refresh_busy |-> !rsp.valid);


     // Write-then-read data integrity, gated on the address being tracked. For every i in 0..3, if a write to fv_active_addr[i] is accepted, then the next accepted read to the same address must return the recorded value. The assertion
            fires only on the tracked-address subset (so the garbage flag does not matter for soundness; it is informational).
     genvar i;
     generate
       for (i = 0; i < 4; i++) begin : g_mem_integrity
            a_write_then_read_i: assert property (@(posedge clk)
              disable iff (!rst_n)
              (cmd.valid && cmd_ready && cmd.op == OP_WRITE
               && cmd.addr == fv_active_addr[i])
              ##[1:$] (cmd.valid && cmd_ready
                         && cmd.op == OP_READ
                         && cmd.addr == fv_active_addr[i])
              |-> ##[1:$] (rsp.valid && rsp.data == ABSTRACT_MEM[i]));
       end
     endgenerate


     // Liveness: every accepted command eventually completes. With the counter abstraction in place, the engine sees a small number of abstract-FSM transitions rather than ~REFRESH_PERIOD * TRFC * TRAS concrete cycles, so this property
            closes under standard IC3/PDR.
     a_liveness: assert property (@(posedge clk) disable iff (!rst_n)
       (cmd.valid && cmd_ready) |-> s_eventually(rsp.valid));


     // ----------------------------------
     // Sanity covers (run these first; if they do not hit, the assumption set is vacuous and the proofs above are not checking anything).
     // ----------------------------------
     c_refresh_fires: cover property (@(posedge clk) refresh_busy);
     c_read_completes: cover property (@(posedge clk)
       (cmd.valid && cmd_ready && cmd.op == OP_READ) ##[1:$] rsp.valid);
     c_write_then_read_for_addr0: cover property (@(posedge clk)
       (cmd.valid && cmd_ready && cmd.op == OP_WRITE
           && cmd.addr == fv_active_addr[0])
       ##[1:$] (cmd.valid && cmd_ready && cmd.op == OP_READ
                      && cmd.addr == fv_active_addr[0])
       ##[1:$] (rsp.valid && rsp.data == ABSTRACT_MEM[0]));


endmodule


// Bind the checker into the verification environment
bind mem_ctrl mem_ctrl_abs_checker chk (.*);

The proof obligation, after the two abstractions, is over a state space of: three abstract-counter FSMs with two to three states each, four tracked memory slots, the implementation’s bank-state FSM, and the command/response handshake variables. That product is small enough for IC3/PDR to close — and the companion proves it as printed: make abs in ch2_rtl_fv_examples/07_mem_ctrl/ confirms the three sanity covers non-vacuous, proves both safety contracts (the periodicity bound ##[1:TRAS+3] is tight — TRAS+2 is refutable, because the in-flight command must complete first), and closes all five liveness obligations (the four unbounded write-then-read chains and the s_eventually completion) through the liveness-to-safety reduction, SAFE at eight frames each — where a concrete proof would have faced a millions-of-cycles state diameter. The same two abstractions, named in the abstract by §3.2 (Stage 4 Move 1), are what reduces every real industrial memory-controller proof to something a formal team can run nightly.

2.11 Summary and What Comes Next

This chapter built the RTL foundation the rest of the book stands on:

• SystemVerilog’s concurrency model and the simulator event-scheduling regions that make parallel always blocks behave deterministically.
• RTL primitives — nets vs. variables, the logic type, packed/unpacked arrays, structs, enums, packages.
• Combinational and sequential modeling, including the blocking vs. non-blocking assignment rule and the Mealy/Moore decision.
• Worked RTL examples (round-robin arbiter, asynchronous handshake / CDC, FIFO, pipelined ALU, MSI cache coherence, memory controller) that exercise arbitration, clock-domain crossing, storage, datapath, coherence, and abstraction-driven proof respectively. Every one is also a runnable artifact in ch2_rtl_fv_examples/, each in its own directory with its own Makefile and README: the design, its bound checker, a Verilator testbench (make sim) — and the formal proof itself, make prove, which closes every design–checker pair as printed with the from-scratch engines of Chapter 3’s companion (the fv/ subdirectories add only the assume-only environment contracts a formal run supplies, and a bug-injected mutation twin per design, make bug).
• An abstraction toolbox earned one design at a time — symbolic clients, assume–guarantee contracts, data-independence tokens, sequential-equivalence miters, local-node and predicate abstraction with the CEGAR loop, counter and memory abstraction — the catalogue Chapter 3 cites by name.
• SystemVerilog Assertions at the level needed to express a property and bind it to a design, including the metastability MTBF argument that motivates the two-flop synchronizer in CDC.

What comes next. Chapter 3 takes the SVA properties written alongside the RTL above and shows how a tool proves them: the end-to-end formal flow, the proof engines (CDCL, BMC, $k$-induction, Craig interpolation, IC3/PDR), SMT for datapaths, abstraction for SoC scale, theorem proving for the properties that outrun every push-button engine, and the signoff artifact a tape-out review will ask for. Chapter 4 then ascends into the verification language: SystemVerilog OOP, parameterized classes, virtual interfaces, constraint-driven randomization, and functional coverage — the toolkit a testbench needs to stimulate the designs above. Chapter 5 assembles those primitives into UVM. Chapter 6 takes the same primitives and assembles them differently, producing the actor-based framework this book proposes. The connection between this chapter’s per-module properties and the per-topology properties Chapter 6 will introduce is direct: every operator we used here (|->, |=>, ##n) generalizes from one module to a network of mailbox-connected actors, and that generalization is what lets the actor model carry temporal contracts at integration scale.