perf(vllm): compact MiniMax M3 EP decode routes on MI300X#1782
perf(vllm): compact MiniMax M3 EP decode routes on MI300X#1782Oseltamivir wants to merge 23 commits into
Conversation
# Conflicts: # perf-changelog.yaml
# Conflicts: # perf-changelog.yaml
Co-authored-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com> Co-authored-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com> # Conflicts: # perf-changelog.yaml
Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>
# Conflicts: # perf-changelog.yaml
Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>
# Conflicts: # perf-changelog.yaml
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Claude finished @Oseltamivir's task in 7m 0s —— View job Review of PR #1782LGTM — no blocking issues found. Summary: This is a well-structured incremental optimization for MiniMax M3 EP8 on MI300X. The patch adds two complementary paths (short-context native/BF16 dispatch via the oracle, long-context sparse BF16 emulation with fused SwiGLU), and the gating logic correctly ensures each path is only reached for the profiled configuration. Key observations:
|
|
Claude finished @Oseltamivir's task in 4m 57s —— View job Review of PR #1782
LGTM — no blocking issues found. Summary: This is a well-scoped incremental EP8 decode optimization stacked on #1753. The patch adds two complementary paths: short-context EP8 routes to
|
d1638a0 to
465ff47
Compare
95e79da to
27510c4
Compare
Summary
TritonExpertspathThis PR is stacked on #1753 and contains only the incremental EP8 runtime
optimization. It does not include the profiling branch, temporary benchmark
configuration, AITER work, or
perf-changelog.yamlchanges.Regression analysis
The original fused candidate improved 1k/c256 but regressed 8k/c256 against
main:
At 8k/c256, mean TTFT rose from 46.55 s to 57.33 s and mean TPOT rose from
185.39 ms to 223.16 ms. GPU power also fell from about 712 W to 652 W while
gfx utilization remained near 100%, indicating inefficient MFMA execution
rather than a bandwidth bottleneck.
The regression had two causes:
products serially. Removing an activation launch did not repay the lost
matrix-core efficiency.
larger prefill configuration selected by the existing generic path.
Bad run:
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27569397626/attempts/7
Main comparison:
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27510667862
Profile-based optimization
At c256, the profiled generation batch has about 216 active tokens and top-k
4, or 864 global routes. EP8 retains about 108 local routes across 16 experts,
roughly 6.75 rows per local expert. Once remote routes are removed, a 64-row M
tile wastes most of each active expert block. A 16-row tile better matches the
actual local occupancy.
Across the 57 sparse layers in the selected decode step:
The relevant MoE path falls from about 29.80 ms to 28.06 ms, a 5.8% reduction.
BM32 and BM64 controls were slower, confirming that local-route padding, not
the generic total-token selector, should drive this specialized decode tile.
Profile controls:
Scope
The path is gated to the exact gfx94x MiniMax M3 EP8 BF16 shape and decode
batches of at most 256 tokens. Prefill and larger mixed batches use the
existing generic implementation and its established MI300X configurations.
Other models and platforms are unchanged.
The prepared vLLM follow-up branch is stacked on the native gfx94x MXFP8 MoE
work in vLLM #45726 and is not opened as a PR. It will be rebased onto main
after that prerequisite merges:
https://github.com/Oseltamivir/vllm/tree/codex/minimax-m3-mi300x-ep-mxfp8
Validation
python -m pytest utils/matrix_logic/ -q: 156 passedcompileall, andgit diff --checkexpert-map-aware reduction, including skipped remote rows
The requested MI300X serving matrix completed at c1, c16, and c256 for 1k1k
and 8k1k:
Values are total throughput in tok/s/GPU. At 8k1k c256, mean TTFT improves
from 46.55 s on main to 45.36 s, and mean TPOT improves from 185.39 ms to
179.37 ms. Average GPU power is unchanged at about 712 W, unlike the
inefficient regressed fusion's 652 W.
Benchmark runs:
The final runs checked out benchmark commit
efe99e11, whose runtime patchmatches this PR. An earlier 8k1k c256 attempt failed before container startup
because the assigned node could not create an enroot user namespace; no
benchmark result from that attempt is included above.