cache: fix LLC AMO writeback collision; L2/L3 writeback, per-cache MPM classes, atomic tests#363
Open
tinebp wants to merge 9 commits into
Open
cache: fix LLC AMO writeback collision; L2/L3 writeback, per-cache MPM classes, atomic tests#363tinebp wants to merge 9 commits into
tinebp wants to merge 9 commits into
Conversation
Atomic support had spread across the VX_cache_bank pipeline: writeback
state, commit gating, byte-offset alignment, response muxing, and the
non-LLC passthrough were threaded through arbitration, the sel mux, and
the response path, with the LLC commit datapath only partly gated.
Consolidate all bank-side AMO logic into one VX_cache_amo module that
selects its role on IS_LLC:
- IS_LLC=1: RMW commit (wraps VX_amo_unit) + per-hart reservation table
+ single-outstanding writeback FSM + response formatting
- IS_LLC=0: forward/replay passthrough + same-hart age-ordering
The bank instantiates it once under `if (AMO_ENABLE)` with a single
compact tie-off, dropping VX_cache_bank from 1244 to ~1000 lines and
keeping the whole AMO datapath out of the netlist when atomics are off.
Behavior-preserving.
Validated under rtlsim with EXT_A (single-core L1=LLC commit branch,
cycle-identical; writeback+dirtybytes; threads=8; multi-core L2 and
L2+L3 passthrough branch), hw unit-test elaboration, and Yosys synthesis
with AMO enabled.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The AMO ALU was a fixed 64-bit adder/comparator/logic block selected at runtime by the .W/.D width bit. On the 32-bit-word L1 dcache (RV32) the .D path can never occur, yet the 64-bit datapath was still synthesized, putting a 64-bit add + 64-bit signed compares on the single-cycle BRAM-read -> align -> RMW -> writeback path that misses 300 MHz timing. Parameterize VX_amo_alu by DATA_WIDTH (= cache word width, capped at 64) and build the datapath at that width; VX_cache_amo passes AMO_OLD_BITS so a 32-bit-word cache synthesizes a 32-bit RMW. The .W/.D distinction is kept only when DATA_WIDTH > 32. Behavior-preserving. Validated under rtlsim with EXT_A (all atomic ops, signed/unsigned): single-core L1=LLC, threads=8, and multi-core L2 all pass cycle-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The atomic commit did the whole RMW in one cycle on the S1 critical path:
cache_data BRAM read -> byte-align shift -> ALU -> re-align shift ->
writeback register. At 250 MHz this was the sole failing path (all 10
worst endpoints, WNS -1.302 ns, 19 logic levels), and narrowing the ALU
alone wasn't enough.
Split it into two cycles: S1 latches the byte-aligned operands into a
compute stage; the ALU + re-align + writeback-register update run the
next cycle, off the S1 path. The AMO response (old value) still fires at
S1 from the aligned read, so load-use latency is unchanged.
Correctness around the single-outstanding writeback:
- Same-line chained AMOs (coalesced MSHR replays stream back-to-back)
are paced one cycle by chain_stall so the prior result is in
wb_data_r and forwards cleanly; replays are NOT blocked, so the
MSHR's coalesced-replay accounting is untouched.
- The response is suppressed while an op is chain-stalled at S1, so a
held read enqueues its response exactly once.
- Runtime assertions guard against a store-AMO displacing an in-flight
compute or a different-line pending writeback.
Strictly AMO-gated: commit_busy/chain_stall are 0 for all non-AMO
traffic. Baseline cache performance is unaffected -- sgemm under rtlsim
(EXT_A) is cycle-identical with and without this change (1,201,350
cycles). AMO correctness re-validated: single-core L1=LLC, threads=8,
multi-core L2, and writeback+dirtybytes all pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lasses - Default L2/L3 to write-back and add a dirty-line `evictions` perf counter (cache_perf_t + bank/cache/wrap wiring). - Split the memory MPM counters into per-cache classes (icache/dcache/ L2/L3) re-based at the standard hpmcounter addresses; fold the VM/MMU (TLB/PTW) counters into the MEM class. - Runtime per-class perf dump with the new evicts field and capability-correct gating; coalescer printed unconditionally. - Docs + blackbox usage updated for the new class mapping / VM location. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…h L2 - Add atomicreduce and histogram benchmarks under tests/opencl and tests/hip (atomic_add -> RVA amo*.w); OpenCL passes simx/rtlsim, HIP excluded pending pocl-vortex generic-AS atomics. - The 4-core L2+L3 amo regression must request write-through L2: atomics resolve at the LLC, so caches above it must be write-through. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ueue A second AMO to a different line could reach the commit stage while the prior writeback was still draining, overwriting its pending state and dropping the writeback. Replace the single writeback register with a depth-2 coalescing queue: completed AMOs push instead of clobbering, same-line results coalesce into their entry (only the latest reaches the array; earlier ones forward), different-line results queue. No replay is stalled (coalescer-safe) and the drain is not pipe_stall-gated (deadlock-free); closes 300 MHz on U55C unchanged (WNS +0.067 ns). Also tidies the AMO ALU/unit ports (compute_unsigned / is_unsigned). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Synopsys backend flow: layout images, spatial power-density map, and GDS render (run_icc2.tcl / dump_power.tcl / powergrid.py / gds_render.py + Makefile targets + ASAP7 physical libs). - SAIF subtree filter for power annotation across hw/syn flows. - Synopsys timing-summary report + parameterized XRT kernel frequency. - Wire AMO_ENABLE (a parameter defaulting to VX_CFG_EXT_A_ENABLED, per the wrapper's convention) into VX_cache_top so the standalone cache DUT synthesizes the AMO datapath when the A extension is enabled. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The evictions field added to cache_perf_t was not summed in the per-cache perf aggregation macro, leaving cache_perf[evictions] undriven in VX_cache_cluster. Verilator rejected every rtlsim build with PERF_ENABLE (UNDRIVEN). Add the missing field to the macro. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The simx cache modeled a single dirty bit per line and wrote back the full line with an all-ones byte-enable. With write-back L2/L3 and false sharing across clusters, the last writer's full-line writeback clobbered words it never modified (filled as poison), corrupting the shared lower level. Mirror the RTL DIRTY_BYTES behavior: accumulate a per-byte dirty_mask on every store and use it as the writeback byte-enable so only modified bytes propagate. Reset the mask on fill and invalidate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Collaborator
Author
|
Pushed two regression fixes surfaced while validating the
Validated: XLEN=32 dxa 36/36 and cache 34/34 green. XLEN=32 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a data-corruption bug in last-level-cache atomics, enables L2/L3 writeback with eviction tracking, restructures performance counters into per-cache MPM classes, and adds atomic benchmarks plus a Synopsys back-end flow.
Cache / AMO
VX_cache_amowriteback assertion under back-to-back AMOs). Replaced it with a depth-2 coalescing writeback queue: forwards in-flight values to dependent reads, coalesces same-line updates, and is deadlock-free (off the critical path — no timing or response-latency change). Validated 13/13 across base + dcache-writeback + 8-thread configs.evictionsfield incache_perf_t.-DVX_CFG_L2_WRITEBACK=0): every cache above the LLC must be write-through, else SC can spuriously succeed.Performance counters (MPM)
vx_dump_perfgating so it no longer prints values for disabled features (e.g. VM) or unconfigured caches.Tests & back-end flow
atomicreduce,histogram).run_icc2.tcl,dump_power.tcl,powergrid.py,gds_render.py, SAIF filter).AMO_ENABLEparameter (tracks the A extension by default).Validation
--regression --unittest --amoon both XLEN=32 and XLEN=64: 267/267 PASS.🤖 Generated with Claude Code