[AMD] refactor: engine-neutral aiperf plotter + fill sglang panels by AMD-yanfeiwang · Pull Request #1774 · SemiAnalysisAI/InferenceX

AMD-yanfeiwang · 2026-06-15T06:48:47Z

Summary

Replace the vLLM-anchored flat alias table in utils/generate_aiperf_plots.py with a semantic Metric enum + per-engine ENGINE_METRICS registry. Panels reference engine-neutral keys; the engine is auto-detected from the metric namespace and the registry resolves each key to the concrete series that engine exports. Adding a backend (e.g. TensorRT-LLM) is now a single new table with no panel changes.
Support composite metrics via lightweight adapters in the registry (e.g. a ratio of two counters), dispatched transparently by aggregate_timeseries.
Fill previously-blank sglang panels using upstream (main-branch) metrics only:
- External/host KV usage from hicache_host_used_tokens / hicache_host_total_tokens (ratio adapter) → drives the External line in the KV Cache Utilization panel.
- KV offload transfer rate + cumulative from backuped_tokens (GPU→CPU) and load_back_tokens (CPU→GPU). sglang exposes no per-token KV byte size upstream, so these render in tokens; offload panels are now unit-aware (bytes/MB/GB for vLLM, tokens for sglang).
The per-tier prefix-cache hit-rate split (GPU/External/Combined) and the prefill-source breakdown stackplot remain blank for sglang because they depend on fork-only counters (cache_hit_tokens_l1/l2/l3, cache_miss_tokens) that are not yet merged upstream.

Notes

vLLM behavior is unchanged: all vLLM metrics stay string mappings, aggregation goes through the same code path, and offload units remain bytes — verified byte-identical output on an existing vLLM-style export.

Test plan

Re-ran generate_aiperf_plots.py on an sglang result dir; engine auto-detected as sglang, new panels (External KV usage + 3 KV offload panels) populate with real data, fork-dependent panels stay blank.
Confirmed Metric resolution: CPU_KV_CACHE_USAGE → ratio adapter, KV_OFFLOAD_G2C → sglang:backuped_tokens, KV_OFFLOAD_C2G → sglang:load_back_tokens.
Sanity-check on a vLLM export in CI/local to reconfirm unchanged rendering.

Note

Low Risk
Offline plotting utility only; changes affect visualization labels/units for sglang and refactor metric resolution without touching runtime inference or auth.

Overview
Refactors generate_aiperf_plots.py so panels use a semantic Metric enum and per-engine ENGINE_METRICS registry instead of hard-coded vllm:* names. Engine detection infers vLLM vs sglang from metric namespace prefixes (cached on the export); aggregate_timeseries resolves metrics and can dispatch ratio adapters for composite series (e.g. HiCache host used/total → external KV usage %).

sglang gains upstream-only mappings: external KV line, queue/throughput/preemptions, aggregate prefix hit rate via PREFIX_CACHE_HIT_RATE when hit/query counters are missing, and HiCache offload panels using token units. KV offload panels are unit-aware (MB/s & GB for vLLM, tokens/s & M tokens for sglang). Panels that need fork-only sglang counters (GPU/External prefix split, prefill source stackplot) still render empty. Figure suptitle is generalized to "LLM Server Metrics During Benchmark"; vLLM string mappings are intended to preserve prior vLLM plot behavior.

^{Reviewed by Cursor Bugbot for commit 4f33930. Bugbot is set up for automated code reviews on this repo. Configure here.}

Replace the vLLM-anchored flat alias table in generate_aiperf_plots.py with a semantic Metric enum plus a per-engine ENGINE_METRICS registry. Panels now reference engine-neutral metric keys and the registry resolves them to the concrete series each engine exports (detected from the metric namespace). Adding a new backend becomes a single new table with no panel changes. Also fill previously-blank panels for sglang using upstream (main-branch) metrics only: - External/host KV usage from hicache_host_used/total_tokens (ratio adapter) - KV offload transfer + cumulative from backuped_tokens / load_back_tokens, rendered in tokens (sglang exposes no per-token KV byte size upstream; offload panels are now unit-aware: bytes for vLLM, tokens for sglang) The per-tier prefix-cache hit-rate split and prefill-source breakdown stay blank for sglang since they depend on fork-only counters. vLLM behavior is unchanged (string mappings + identical byte units).

cquil11 · 2026-06-15T16:26:26Z

@AMD-yanfeiwang can you rebase this onto agentx-v0.4 branch please?

AMD-yanfeiwang requested a review from a team June 15, 2026 06:48

github-project-automation Bot added this to InferenceMAX Board Jun 15, 2026

functionstackx requested a review from cquil11 June 15, 2026 16:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] refactor: engine-neutral aiperf plotter + fill sglang panels#1774

[AMD] refactor: engine-neutral aiperf plotter + fill sglang panels#1774
AMD-yanfeiwang wants to merge 1 commit into
SemiAnalysisAI:mainfrom
AMD-yanfeiwang:aiperf-plots-engine-neutral-metrics

AMD-yanfeiwang commented Jun 15, 2026 •

edited by cursor Bot

Loading

Uh oh!

cquil11 commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AMD-yanfeiwang commented Jun 15, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notes

Test plan

Uh oh!

cquil11 commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AMD-yanfeiwang commented Jun 15, 2026 •

edited by cursor Bot

Loading