Skip to content

cache: fix LLC AMO writeback collision; L2/L3 writeback, per-cache MPM classes, atomic tests#363

Open
tinebp wants to merge 9 commits into
masterfrom
fix_llc_amo_gating
Open

cache: fix LLC AMO writeback collision; L2/L3 writeback, per-cache MPM classes, atomic tests#363
tinebp wants to merge 9 commits into
masterfrom
fix_llc_amo_gating

Conversation

@tinebp

@tinebp tinebp commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fixes a data-corruption bug in last-level-cache atomics, enables L2/L3 writeback with eviction tracking, restructures performance counters into per-cache MPM classes, and adds atomic benchmarks plus a Synopsys back-end flow.

Cache / AMO

  • Fix different-line AMO writeback collision. The single LLC AMO writeback register could be clobbered by a second atomic to a different line before its writeback drained, silently dropping the write (tripped the VX_cache_amo writeback assertion under back-to-back AMOs). Replaced it with a depth-2 coalescing writeback queue: forwards in-flight values to dependent reads, coalesces same-line updates, and is deadlock-free (off the critical path — no timing or response-latency change). Validated 13/13 across base + dcache-writeback + 8-thread configs.
  • L2/L3 writeback + eviction tracking with a new evictions field in cache_perf_t.
  • AMO over L2+L3 uses a write-through L2 (-DVX_CFG_L2_WRITEBACK=0): every cache above the LLC must be write-through, else SC can spuriously succeed.

Performance counters (MPM)

  • Split memory counters into per-cache MPM classes (icache / dcache / L2 / L3 / mem), selected by DCR.
  • Folded VM/MMU (TLB/PTW) counters into the MEM class.
  • Fixed vx_dump_perf gating so it no longer prints values for disabled features (e.g. VM) or unconfigured caches.

Tests & back-end flow

  • OpenCL and HIP atomic benchmarks (atomicreduce, histogram).
  • Synopsys post-synthesis layout / power / GDS flow (run_icc2.tcl, dump_power.tcl, powergrid.py, gds_render.py, SAIF filter).
  • Cache synthesis DUT gains an AMO_ENABLE parameter (tracks the A extension by default).

Validation

  • Full regression --regression --unittest --amo on both XLEN=32 and XLEN=64: 267/267 PASS.
  • Standalone cache DUT synthesis closes timing at 300 MHz on Alveo U55C with AMO enabled.

🤖 Generated with Claude Code

tinebp and others added 9 commits June 9, 2026 19:54
Atomic support had spread across the VX_cache_bank pipeline: writeback
state, commit gating, byte-offset alignment, response muxing, and the
non-LLC passthrough were threaded through arbitration, the sel mux, and
the response path, with the LLC commit datapath only partly gated.

Consolidate all bank-side AMO logic into one VX_cache_amo module that
selects its role on IS_LLC:
  - IS_LLC=1: RMW commit (wraps VX_amo_unit) + per-hart reservation table
    + single-outstanding writeback FSM + response formatting
  - IS_LLC=0: forward/replay passthrough + same-hart age-ordering

The bank instantiates it once under `if (AMO_ENABLE)` with a single
compact tie-off, dropping VX_cache_bank from 1244 to ~1000 lines and
keeping the whole AMO datapath out of the netlist when atomics are off.
Behavior-preserving.

Validated under rtlsim with EXT_A (single-core L1=LLC commit branch,
cycle-identical; writeback+dirtybytes; threads=8; multi-core L2 and
L2+L3 passthrough branch), hw unit-test elaboration, and Yosys synthesis
with AMO enabled.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The AMO ALU was a fixed 64-bit adder/comparator/logic block selected at
runtime by the .W/.D width bit. On the 32-bit-word L1 dcache (RV32) the
.D path can never occur, yet the 64-bit datapath was still synthesized,
putting a 64-bit add + 64-bit signed compares on the single-cycle
BRAM-read -> align -> RMW -> writeback path that misses 300 MHz timing.

Parameterize VX_amo_alu by DATA_WIDTH (= cache word width, capped at 64)
and build the datapath at that width; VX_cache_amo passes AMO_OLD_BITS so
a 32-bit-word cache synthesizes a 32-bit RMW. The .W/.D distinction is
kept only when DATA_WIDTH > 32. Behavior-preserving.

Validated under rtlsim with EXT_A (all atomic ops, signed/unsigned):
single-core L1=LLC, threads=8, and multi-core L2 all pass cycle-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The atomic commit did the whole RMW in one cycle on the S1 critical path:
cache_data BRAM read -> byte-align shift -> ALU -> re-align shift ->
writeback register. At 250 MHz this was the sole failing path (all 10
worst endpoints, WNS -1.302 ns, 19 logic levels), and narrowing the ALU
alone wasn't enough.

Split it into two cycles: S1 latches the byte-aligned operands into a
compute stage; the ALU + re-align + writeback-register update run the
next cycle, off the S1 path. The AMO response (old value) still fires at
S1 from the aligned read, so load-use latency is unchanged.

Correctness around the single-outstanding writeback:
  - Same-line chained AMOs (coalesced MSHR replays stream back-to-back)
    are paced one cycle by chain_stall so the prior result is in
    wb_data_r and forwards cleanly; replays are NOT blocked, so the
    MSHR's coalesced-replay accounting is untouched.
  - The response is suppressed while an op is chain-stalled at S1, so a
    held read enqueues its response exactly once.
  - Runtime assertions guard against a store-AMO displacing an in-flight
    compute or a different-line pending writeback.

Strictly AMO-gated: commit_busy/chain_stall are 0 for all non-AMO
traffic. Baseline cache performance is unaffected -- sgemm under rtlsim
(EXT_A) is cycle-identical with and without this change (1,201,350
cycles). AMO correctness re-validated: single-core L1=LLC, threads=8,
multi-core L2, and writeback+dirtybytes all pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lasses

- Default L2/L3 to write-back and add a dirty-line `evictions` perf
  counter (cache_perf_t + bank/cache/wrap wiring).
- Split the memory MPM counters into per-cache classes (icache/dcache/
  L2/L3) re-based at the standard hpmcounter addresses; fold the VM/MMU
  (TLB/PTW) counters into the MEM class.
- Runtime per-class perf dump with the new evicts field and
  capability-correct gating; coalescer printed unconditionally.
- Docs + blackbox usage updated for the new class mapping / VM location.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…h L2

- Add atomicreduce and histogram benchmarks under tests/opencl and
  tests/hip (atomic_add -> RVA amo*.w); OpenCL passes simx/rtlsim, HIP
  excluded pending pocl-vortex generic-AS atomics.
- The 4-core L2+L3 amo regression must request write-through L2: atomics
  resolve at the LLC, so caches above it must be write-through.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ueue

A second AMO to a different line could reach the commit stage while the
prior writeback was still draining, overwriting its pending state and
dropping the writeback. Replace the single writeback register with a
depth-2 coalescing queue: completed AMOs push instead of clobbering,
same-line results coalesce into their entry (only the latest reaches the
array; earlier ones forward), different-line results queue. No replay is
stalled (coalescer-safe) and the drain is not pipe_stall-gated
(deadlock-free); closes 300 MHz on U55C unchanged (WNS +0.067 ns).

Also tidies the AMO ALU/unit ports (compute_unsigned / is_unsigned).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Synopsys backend flow: layout images, spatial power-density map, and
  GDS render (run_icc2.tcl / dump_power.tcl / powergrid.py / gds_render.py
  + Makefile targets + ASAP7 physical libs).
- SAIF subtree filter for power annotation across hw/syn flows.
- Synopsys timing-summary report + parameterized XRT kernel frequency.
- Wire AMO_ENABLE (a parameter defaulting to VX_CFG_EXT_A_ENABLED, per the
  wrapper's convention) into VX_cache_top so the standalone cache DUT
  synthesizes the AMO datapath when the A extension is enabled.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The evictions field added to cache_perf_t was not summed in the
per-cache perf aggregation macro, leaving cache_perf[evictions]
undriven in VX_cache_cluster. Verilator rejected every rtlsim build
with PERF_ENABLE (UNDRIVEN). Add the missing field to the macro.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The simx cache modeled a single dirty bit per line and wrote back the
full line with an all-ones byte-enable. With write-back L2/L3 and false
sharing across clusters, the last writer's full-line writeback clobbered
words it never modified (filled as poison), corrupting the shared lower
level. Mirror the RTL DIRTY_BYTES behavior: accumulate a per-byte
dirty_mask on every store and use it as the writeback byte-enable so only
modified bytes propagate. Reset the mask on fill and invalidate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@tinebp

tinebp commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator Author

Pushed two regression fixes surfaced while validating the --dxa, --cache, and --debug suites on XLEN 32/64 with L2/L3 write-back now default:

  • cache: aggregate the evictions counter in PERF_CACHE_ADD — the new evictions field in cache_perf_t wasn't summed in the per-cache aggregation macro, leaving those bits undriven in VX_cache_cluster. Verilator rejected every rtlsim build with PERF_ENABLE. (Caught by the dxa suite.)
  • simx: track per-byte dirty bits for write-back cache writebacks — simx modeled a single dirty bit per line and wrote back the full line with an all-ones byte-enable. Under write-back L2/L3 with cross-cluster false sharing, the last writer clobbered words it never modified, corrupting the shared lower level (half-line baadf00d). Now mirrors the RTL DIRTY_BYTES behavior — only modified bytes propagate. (Caught by diverge over L2+L3 in the cache suite.)

Validated: XLEN=32 dxa 36/36 and cache 34/34 green. XLEN=32 --debug and all XLEN=64 suites still in progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant