cache: fix LLC AMO writeback collision; L2/L3 writeback, per-cache MPM classes, atomic tests by tinebp · Pull Request #363 · vortexgpgpu/vortex

tinebp · 2026-06-13T23:29:10Z

Summary

Fixes a data-corruption bug in last-level-cache atomics, enables L2/L3 writeback with eviction tracking, restructures performance counters into per-cache MPM classes, and adds atomic benchmarks plus a Synopsys back-end flow.

Cache / AMO

Fix different-line AMO writeback collision. The single LLC AMO writeback register could be clobbered by a second atomic to a different line before its writeback drained, silently dropping the write (tripped the VX_cache_amo writeback assertion under back-to-back AMOs). Replaced it with a depth-2 coalescing writeback queue: forwards in-flight values to dependent reads, coalesces same-line updates, and is deadlock-free (off the critical path — no timing or response-latency change). Validated 13/13 across base + dcache-writeback + 8-thread configs.
L2/L3 writeback + eviction tracking with a new evictions field in cache_perf_t.
AMO over L2+L3 uses a write-through L2 (-DVX_CFG_L2_WRITEBACK=0): every cache above the LLC must be write-through, else SC can spuriously succeed.

Performance counters (MPM)

Split memory counters into per-cache MPM classes (icache / dcache / L2 / L3 / mem), selected by DCR.
Folded VM/MMU (TLB/PTW) counters into the MEM class.
Fixed vx_dump_perf gating so it no longer prints values for disabled features (e.g. VM) or unconfigured caches.

Tests & back-end flow

OpenCL and HIP atomic benchmarks (atomicreduce, histogram).
Synopsys post-synthesis layout / power / GDS flow (run_icc2.tcl, dump_power.tcl, powergrid.py, gds_render.py, SAIF filter).
Cache synthesis DUT gains an AMO_ENABLE parameter (tracks the A extension by default).

Validation

Full regression --regression --unittest --amo on both XLEN=32 and XLEN=64: 267/267 PASS.
Standalone cache DUT synthesis closes timing at 300 MHz on Alveo U55C with AMO enabled.

🤖 Generated with Claude Code

Atomic support had spread across the VX_cache_bank pipeline: writeback state, commit gating, byte-offset alignment, response muxing, and the non-LLC passthrough were threaded through arbitration, the sel mux, and the response path, with the LLC commit datapath only partly gated. Consolidate all bank-side AMO logic into one VX_cache_amo module that selects its role on IS_LLC: - IS_LLC=1: RMW commit (wraps VX_amo_unit) + per-hart reservation table + single-outstanding writeback FSM + response formatting - IS_LLC=0: forward/replay passthrough + same-hart age-ordering The bank instantiates it once under `if (AMO_ENABLE)` with a single compact tie-off, dropping VX_cache_bank from 1244 to ~1000 lines and keeping the whole AMO datapath out of the netlist when atomics are off. Behavior-preserving. Validated under rtlsim with EXT_A (single-core L1=LLC commit branch, cycle-identical; writeback+dirtybytes; threads=8; multi-core L2 and L2+L3 passthrough branch), hw unit-test elaboration, and Yosys synthesis with AMO enabled. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The AMO ALU was a fixed 64-bit adder/comparator/logic block selected at runtime by the .W/.D width bit. On the 32-bit-word L1 dcache (RV32) the .D path can never occur, yet the 64-bit datapath was still synthesized, putting a 64-bit add + 64-bit signed compares on the single-cycle BRAM-read -> align -> RMW -> writeback path that misses 300 MHz timing. Parameterize VX_amo_alu by DATA_WIDTH (= cache word width, capped at 64) and build the datapath at that width; VX_cache_amo passes AMO_OLD_BITS so a 32-bit-word cache synthesizes a 32-bit RMW. The .W/.D distinction is kept only when DATA_WIDTH > 32. Behavior-preserving. Validated under rtlsim with EXT_A (all atomic ops, signed/unsigned): single-core L1=LLC, threads=8, and multi-core L2 all pass cycle-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The atomic commit did the whole RMW in one cycle on the S1 critical path: cache_data BRAM read -> byte-align shift -> ALU -> re-align shift -> writeback register. At 250 MHz this was the sole failing path (all 10 worst endpoints, WNS -1.302 ns, 19 logic levels), and narrowing the ALU alone wasn't enough. Split it into two cycles: S1 latches the byte-aligned operands into a compute stage; the ALU + re-align + writeback-register update run the next cycle, off the S1 path. The AMO response (old value) still fires at S1 from the aligned read, so load-use latency is unchanged. Correctness around the single-outstanding writeback: - Same-line chained AMOs (coalesced MSHR replays stream back-to-back) are paced one cycle by chain_stall so the prior result is in wb_data_r and forwards cleanly; replays are NOT blocked, so the MSHR's coalesced-replay accounting is untouched. - The response is suppressed while an op is chain-stalled at S1, so a held read enqueues its response exactly once. - Runtime assertions guard against a store-AMO displacing an in-flight compute or a different-line pending writeback. Strictly AMO-gated: commit_busy/chain_stall are 0 for all non-AMO traffic. Baseline cache performance is unaffected -- sgemm under rtlsim (EXT_A) is cycle-identical with and without this change (1,201,350 cycles). AMO correctness re-validated: single-core L1=LLC, threads=8, multi-core L2, and writeback+dirtybytes all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…lasses - Default L2/L3 to write-back and add a dirty-line `evictions` perf counter (cache_perf_t + bank/cache/wrap wiring). - Split the memory MPM counters into per-cache classes (icache/dcache/ L2/L3) re-based at the standard hpmcounter addresses; fold the VM/MMU (TLB/PTW) counters into the MEM class. - Runtime per-class perf dump with the new evicts field and capability-correct gating; coalescer printed unconditionally. - Docs + blackbox usage updated for the new class mapping / VM location. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…h L2 - Add atomicreduce and histogram benchmarks under tests/opencl and tests/hip (atomic_add -> RVA amo*.w); OpenCL passes simx/rtlsim, HIP excluded pending pocl-vortex generic-AS atomics. - The 4-core L2+L3 amo regression must request write-through L2: atomics resolve at the LLC, so caches above it must be write-through. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ueue A second AMO to a different line could reach the commit stage while the prior writeback was still draining, overwriting its pending state and dropping the writeback. Replace the single writeback register with a depth-2 coalescing queue: completed AMOs push instead of clobbering, same-line results coalesce into their entry (only the latest reaches the array; earlier ones forward), different-line results queue. No replay is stalled (coalescer-safe) and the drain is not pipe_stall-gated (deadlock-free); closes 300 MHz on U55C unchanged (WNS +0.067 ns). Also tidies the AMO ALU/unit ports (compute_unsigned / is_unsigned). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- Synopsys backend flow: layout images, spatial power-density map, and GDS render (run_icc2.tcl / dump_power.tcl / powergrid.py / gds_render.py + Makefile targets + ASAP7 physical libs). - SAIF subtree filter for power annotation across hw/syn flows. - Synopsys timing-summary report + parameterized XRT kernel frequency. - Wire AMO_ENABLE (a parameter defaulting to VX_CFG_EXT_A_ENABLED, per the wrapper's convention) into VX_cache_top so the standalone cache DUT synthesizes the AMO datapath when the A extension is enabled. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The evictions field added to cache_perf_t was not summed in the per-cache perf aggregation macro, leaving cache_perf[evictions] undriven in VX_cache_cluster. Verilator rejected every rtlsim build with PERF_ENABLE (UNDRIVEN). Add the missing field to the macro. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The simx cache modeled a single dirty bit per line and wrote back the full line with an all-ones byte-enable. With write-back L2/L3 and false sharing across clusters, the last writer's full-line writeback clobbered words it never modified (filled as poison), corrupting the shared lower level. Mirror the RTL DIRTY_BYTES behavior: accumulate a per-byte dirty_mask on every store and use it as the writeback byte-enable so only modified bytes propagate. Reset the mask on fill and invalidate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

tinebp · 2026-06-14T10:28:24Z

Pushed two regression fixes surfaced while validating the --dxa, --cache, and --debug suites on XLEN 32/64 with L2/L3 write-back now default:

cache: aggregate the evictions counter in PERF_CACHE_ADD — the new evictions field in cache_perf_t wasn't summed in the per-cache aggregation macro, leaving those bits undriven in VX_cache_cluster. Verilator rejected every rtlsim build with PERF_ENABLE. (Caught by the dxa suite.)
simx: track per-byte dirty bits for write-back cache writebacks — simx modeled a single dirty bit per line and wrote back the full line with an all-ones byte-enable. Under write-back L2/L3 with cross-cluster false sharing, the last writer clobbered words it never modified, corrupting the shared lower level (half-line baadf00d). Now mirrors the RTL DIRTY_BYTES behavior — only modified bytes propagate. (Caught by diverge over L2+L3 in the cache suite.)

Validated: XLEN=32 dxa 36/36 and cache 34/34 green. XLEN=32 --debug and all XLEN=64 suites still in progress.

tinebp and others added 9 commits June 9, 2026 19:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache: fix LLC AMO writeback collision; L2/L3 writeback, per-cache MPM classes, atomic tests#363

cache: fix LLC AMO writeback collision; L2/L3 writeback, per-cache MPM classes, atomic tests#363
tinebp wants to merge 9 commits into
masterfrom
fix_llc_amo_gating

tinebp commented Jun 13, 2026

Uh oh!

tinebp commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tinebp commented Jun 13, 2026

Summary

Cache / AMO

Performance counters (MPM)

Tests & back-end flow

Validation

Uh oh!

tinebp commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant