Skip to content

[AMD] refactor: engine-neutral aiperf plotter + fill sglang panels#1774

Open
AMD-yanfeiwang wants to merge 1 commit into
SemiAnalysisAI:mainfrom
AMD-yanfeiwang:aiperf-plots-engine-neutral-metrics
Open

[AMD] refactor: engine-neutral aiperf plotter + fill sglang panels#1774
AMD-yanfeiwang wants to merge 1 commit into
SemiAnalysisAI:mainfrom
AMD-yanfeiwang:aiperf-plots-engine-neutral-metrics

Conversation

@AMD-yanfeiwang

@AMD-yanfeiwang AMD-yanfeiwang commented Jun 15, 2026

Copy link
Copy Markdown

Summary

  • Replace the vLLM-anchored flat alias table in utils/generate_aiperf_plots.py with a semantic Metric enum + per-engine ENGINE_METRICS registry. Panels reference engine-neutral keys; the engine is auto-detected from the metric namespace and the registry resolves each key to the concrete series that engine exports. Adding a backend (e.g. TensorRT-LLM) is now a single new table with no panel changes.
  • Support composite metrics via lightweight adapters in the registry (e.g. a ratio of two counters), dispatched transparently by aggregate_timeseries.
  • Fill previously-blank sglang panels using upstream (main-branch) metrics only:
    • External/host KV usage from hicache_host_used_tokens / hicache_host_total_tokens (ratio adapter) → drives the External line in the KV Cache Utilization panel.
    • KV offload transfer rate + cumulative from backuped_tokens (GPU→CPU) and load_back_tokens (CPU→GPU). sglang exposes no per-token KV byte size upstream, so these render in tokens; offload panels are now unit-aware (bytes/MB/GB for vLLM, tokens for sglang).
  • The per-tier prefix-cache hit-rate split (GPU/External/Combined) and the prefill-source breakdown stackplot remain blank for sglang because they depend on fork-only counters (cache_hit_tokens_l1/l2/l3, cache_miss_tokens) that are not yet merged upstream.

Notes

  • vLLM behavior is unchanged: all vLLM metrics stay string mappings, aggregation goes through the same code path, and offload units remain bytes — verified byte-identical output on an existing vLLM-style export.

Test plan

  • Re-ran generate_aiperf_plots.py on an sglang result dir; engine auto-detected as sglang, new panels (External KV usage + 3 KV offload panels) populate with real data, fork-dependent panels stay blank.
  • Confirmed Metric resolution: CPU_KV_CACHE_USAGE → ratio adapter, KV_OFFLOAD_G2Csglang:backuped_tokens, KV_OFFLOAD_C2Gsglang:load_back_tokens.
  • Sanity-check on a vLLM export in CI/local to reconfirm unchanged rendering.

Note

Low Risk
Offline plotting utility only; changes affect visualization labels/units for sglang and refactor metric resolution without touching runtime inference or auth.

Overview
Refactors generate_aiperf_plots.py so panels use a semantic Metric enum and per-engine ENGINE_METRICS registry instead of hard-coded vllm:* names. Engine detection infers vLLM vs sglang from metric namespace prefixes (cached on the export); aggregate_timeseries resolves metrics and can dispatch ratio adapters for composite series (e.g. HiCache host used/total → external KV usage %).

sglang gains upstream-only mappings: external KV line, queue/throughput/preemptions, aggregate prefix hit rate via PREFIX_CACHE_HIT_RATE when hit/query counters are missing, and HiCache offload panels using token units. KV offload panels are unit-aware (MB/s & GB for vLLM, tokens/s & M tokens for sglang). Panels that need fork-only sglang counters (GPU/External prefix split, prefill source stackplot) still render empty. Figure suptitle is generalized to "LLM Server Metrics During Benchmark"; vLLM string mappings are intended to preserve prior vLLM plot behavior.

Reviewed by Cursor Bugbot for commit 4f33930. Bugbot is set up for automated code reviews on this repo. Configure here.

Replace the vLLM-anchored flat alias table in generate_aiperf_plots.py with
a semantic Metric enum plus a per-engine ENGINE_METRICS registry. Panels now
reference engine-neutral metric keys and the registry resolves them to the
concrete series each engine exports (detected from the metric namespace).
Adding a new backend becomes a single new table with no panel changes.

Also fill previously-blank panels for sglang using upstream (main-branch)
metrics only:
- External/host KV usage from hicache_host_used/total_tokens (ratio adapter)
- KV offload transfer + cumulative from backuped_tokens / load_back_tokens,
  rendered in tokens (sglang exposes no per-token KV byte size upstream;
  offload panels are now unit-aware: bytes for vLLM, tokens for sglang)

The per-tier prefix-cache hit-rate split and prefill-source breakdown stay
blank for sglang since they depend on fork-only counters.

vLLM behavior is unchanged (string mappings + identical byte units).
@cquil11

cquil11 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

@AMD-yanfeiwang can you rebase this onto agentx-v0.4 branch please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants