Skip to content

[Experimental][DNM till upstream PR merges][AMD] perf: load-time block FP8 MoE for MiniMax M3 on MI300X#1753

Open
Oseltamivir wants to merge 4 commits into
mainfrom
feat/m3-mi300x-mxfp8
Open

[Experimental][DNM till upstream PR merges][AMD] perf: load-time block FP8 MoE for MiniMax M3 on MI300X#1753
Oseltamivir wants to merge 4 commits into
mainfrom
feat/m3-mi300x-mxfp8

Conversation

@Oseltamivir

@Oseltamivir Oseltamivir commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Apply a runtime patch to the pinned ROCm image revision that converts
    checkpoint MXFP8 MoE weights once at load time into 128x128 block FP8.
  • Normalize OCP E4M3 values to the gfx942 FNUZ representation and run the
    regular Triton block-FP8 backend.
  • Use measured low-token TP tiles and a tuned E16 local-expert table.
  • Preserve the existing TP8 and TP8+EP8 parallelism and concurrency matrix.
  • Fail the launcher if patch application fails or the expected backend marker
    is absent.

The patch is scoped to
benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi300x.sh. The separate
MTP recipe remains unchanged.

Validation

  • Runtime patch applies cleanly to image revision
    4a560dd8db67c270f5e2afb614558271b76f2294.
  • Patched Python files compile and both tuning JSON files parse.
  • Upstream changed-file pre-commit hooks pass, including mypy.
  • Targeted MI300X kernel module: 48 passed, 5 skipped.
  • python -m pytest utils/matrix_logic/ utils/test_process_changelog.py -q:
    158 passed.
  • MI300X matrix generation includes c1 through c128 TP8, c256 TP8+EP8,
    both sequence lengths, and the expected eval points.
  • Launcher shell validation and patch reverse-application checks pass.
  • Full MI300X TP8 and TP8+EP8 sweep:
    https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27725228435
    • All 18 serving points and both eval jobs passed after retrying three
      node-local Pyxis failures that occurred before model startup.
  • Independent accuracy run:
    https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27725256963
    • Across the two full 1,319-example GSM8K runs, strict exact match was
      95.53%-95.98% for TP8 and 95.30%-95.91% for TP8+EP8.
  • MI355X matched control/patched guard:
    https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27733137495
    • Both variants retained the native gfx950 MXFP8 backend; the numerical
      smoke test reported 0.037185 relative error.
    • Patched aggregate throughput was within 0.43% at concurrency 4
      (453.13 vs. 455.10 tok/s) and 0.84% at concurrency 64
      (4018.45 vs. 4052.69 tok/s) of the matched control.

The E16 table was measured twice. Its independent 100-iteration rerun reduced
kernel latency by 13.5% to 26.4% versus the built-in fallback for every tested
batch size from 64 through 8192.

End-to-End Interpretation

The unofficial chart overlay is not a before/after comparison. The branch
series uses MI300X with TP8 or TP8+EP8, while the adjacent MI355X series uses
TP4 on a different GPU generation.

Against the previous MI300X result with the same 8K/1K TP/EP and concurrency
shapes
(https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27510667862),
the patched path improves total throughput per GPU:

Parallelism Concurrency Previous Patched Change
TP8 1 99.88 104.94 +5.1%
TP8 8 463.69 502.55 +8.4%
TP8 32 880.10 981.21 +11.5%
TP8 64 976.30 1236.57 +26.7%
TP8+EP8 128 1110.16 1273.92 +14.8%
TP8+EP8 256 1199.22 1469.05 +22.5%

These are real same-hardware gains, but they do not close the end-to-end gap
to the MI355X TP4 curve in the throughput-oriented region. An earlier MI300X
TP4/DP2 experiment
(https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27664746568)
reached 1393.42 tok/s/GPU at 8K/1K concurrency 256, below the patched
TP8+EP8 result of 1469.05 tok/s/GPU, so changing parallelism alone does not
close that gap.

The runtime patch optimizes MoE weight representation and MoE kernel dispatch.
The chart also includes attention, sparse indexing, KV-cache handling,
scheduling, prefill/decode balance, collectives, and hardware differences.
The supported performance claim is therefore improved MI300X performance
relative to its previous path, not parity with MI355X.

MI300X end-to-end serving results, in total tokens/s/GPU:

Sequence Parallelism Concurrency Total tok/s/GPU
1K/1K TP8 1 24.37
1K/1K TP8 2 46.85
1K/1K TP8 4 83.22
1K/1K TP8 8 136.60
1K/1K TP8 16 219.91
1K/1K TP8 32 346.28
1K/1K TP8 64 522.52
1K/1K TP8 128 738.25
1K/1K TP8+EP8 256 917.31
8K/1K TP8 1 104.94
8K/1K TP8 2 197.59
8K/1K TP8 4 336.79
8K/1K TP8 8 502.55
8K/1K TP8 16 715.23
8K/1K TP8 32 981.21
8K/1K TP8 64 1236.57
8K/1K TP8+EP8 128 1273.92
8K/1K TP8+EP8 256 1469.05

Note

Medium Risk
Runtime patching of vLLM quantization/MoE affects numerical behavior on MI300X serving paths until upstream lands; benchmark scope is limited but load-time requantization is accuracy-sensitive.

Overview
Enables the MiniMax-M3 MXFP8 MI300X vLLM benchmark by patching the pinned ROCm image at job start instead of waiting on upstream. minimaxm3_fp8_mi300x.sh locates the installed vllm tree, idempotently applies minimaxm3_mi300x_mxfp8.patch, verifies it, and fails the run if patch state is ambiguous.

The patch adds load-time conversion of checkpoint MXFP8 MoE weights to 128×128 block FP8 on gfx942 (OCP → FNUZ), routes MoE through the Triton block-FP8 backend, and ships retuned MI300X fused-MoE JSON (including a new E=16,N=3072 table). BF16 KV and the existing TP8 / TP8+EP8 sweep matrix are unchanged; amd-master.yaml comments now describe this path.

Also records the change in perf-changelog.yaml and hardens get_added_lines so append-only changelog edits (e.g. missing trailing newline) are detected correctly, with new tests in utils/test_process_changelog.py.

Reviewed by Cursor Bugbot for commit 6f5a399. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@Oseltamivir Oseltamivir marked this pull request as ready for review June 14, 2026 06:42
@Oseltamivir Oseltamivir requested a review from a team June 14, 2026 06:42
@claude

claude Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Claude finished @Oseltamivir's task in 3m 50s —— View job


Review of PR #1753

  • Gather context and read changed files
  • Validate master config changes
  • Validate perf-changelog entries
  • Validate benchmark script
  • Review runtime patch (Triton kernels, normalization, oracle changes)
  • Post review results

LGTM — no blocking issues found.

Summary: This PR adds a fused CDNA3 MXFP8 MoE backend for MiniMax-M3 on MI300X via a runtime vLLM patch. I reviewed all four changed files and the 656-line Triton kernel patch in detail. Key findings:

  • E4M3FN → E4M3FNUZ normalization (normalize_mxfp8_e4m3fn_to_e4m3fnuz): The math is correct — FNUZ's exponent bias is 1 higher, so incrementing the E8M0 scale compensates; negative zero (0x80) is properly canonicalized. The overflow guard (scales.max() >= 254) correctly prevents incrementing into E8M0 NaN (255).
  • FNUZ kernel scale computation (tl.exp2(asc + bsc - 254.0)): Correct — two E8M0 dequants 2^(a-127) * 2^(b-127) = 2^(a+b-254).
  • Split-K: Occupancy-aware selection, FP32 accumulation buffer when split_k > 1, tl.atomic_add for reduction — all consistent.
  • Grid bound (min(sorted_token_ids.shape[0], M_routed * block_m)): Valid upper bound since active experts ≤ M_routed, and the kernel's num_post guard handles any overestimate.
  • Benchmark script: Expert parallelism correctly conditioned on EP_SIZE, vllm serve arguments properly formatted on separate lines, patch application is idempotent via the grep guard + --forward.
  • Perf-changelog: New entry correctly appended at the end of the file.
  • Master config: Only the comment was updated; no functional config or image changes.

@functionstackx functionstackx left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz create upstream PR and have it reviewed before merging this patch

@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

Comment thread benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi300x.sh
@Oseltamivir

Copy link
Copy Markdown
Collaborator Author

Opened the requested upstream vLLM PR: vllm-project/vllm#45567. It is stacked on the active MiniMax M3 model branch/PR (#45381), includes the tested gfx94x MXFP8 kernel and benchmark, and passes all vLLM pre-commit hooks. The InferenceX patch has also been updated to the optimized tile selection and no longer uses split-K.

@github-actions

Copy link
Copy Markdown
Contributor

@Oseltamivir Oseltamivir changed the title [AMD] feat: native MXFP8 MoE for MiniMax M3 on MI300X [AMD] perf: hybrid MXFP8 MoE for MiniMax M3 on MI300X Jun 14, 2026
@github-actions

Copy link
Copy Markdown
Contributor

MXFP8_ORACLE="$VLLM_PACKAGE_ROOT/vllm/model_executor/layers/fused_moe/oracle/mxfp8.py"
if ! grep -q "Using fused CDNA3 (gfx94x)" "$MXFP8_ORACLE"; then
patch --batch --forward -d "$VLLM_PACKAGE_ROOT" -p1 < "$MXFP8_PATCH"
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MTP script skips MXFP8 patch

Medium Severity

Runtime MXFP8 patching was added only to the non-MTP MI300X benchmark script. launch_mi300x-amds.sh runs minimaxm3_fp8_mi300x_mtp.sh for spec-decoding: mtp configs, so those jobs never apply minimaxm3_mi300x_mxfp8.patch despite the MTP script claiming it mirrors this recipe.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c3cdc37. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

@Oseltamivir

Copy link
Copy Markdown
Collaborator Author

Packed-scale follow-up is pushed in 7678b0bc (merge refresh 684b6a3d). Matched local results, with parallelism unchanged:

  • 1K/1K TP8: 214.239 / 329.709 / 491.056 / 707.749 tok/s/GPU at concurrency 16/32/64/128
  • 8K/1K TP8 c64: 1199.146 tok/s/GPU
  • 8K c64 is +5.60% vs the previous hybrid sweep and +22.74% vs BF16 emulation

Full validation sweep: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27511311644

@functionstackx

Copy link
Copy Markdown
Collaborator

@Oseltamivir 's AI agent, remember to have ur search space start at conc=1 like i am fixing it rn #1760

@functionstackx functionstackx changed the title [AMD] perf: hybrid MXFP8 MoE for MiniMax M3 on MI300X [Experimental][DNM till upstream PR merges][AMD] perf: hybrid MXFP8 MoE for MiniMax M3 on MI300X Jun 14, 2026
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

@Oseltamivir Oseltamivir reopened this Jun 15, 2026
@Oseltamivir Oseltamivir force-pushed the feat/m3-mi300x-mxfp8 branch from d1638a0 to 465ff47 Compare June 17, 2026 20:51
@Oseltamivir Oseltamivir changed the title [Experimental][DNM till upstream PR merges][AMD] perf: hybrid MXFP8 MoE for MiniMax M3 on MI300X [Experimental][DNM till upstream PR merges][AMD] perf: load-time block FP8 MoE for MiniMax M3 on MI300X Jun 17, 2026
@Oseltamivir Oseltamivir force-pushed the feat/m3-mi300x-mxfp8 branch 2 times, most recently from 3d35ece to 6abc0eb Compare June 17, 2026 20:56
Comment thread utils/process_changelog.py Fixed
Comment thread benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi300x.sh
Comment thread benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi300x.sh
@github-actions

Copy link
Copy Markdown
Contributor

@Oseltamivir Oseltamivir force-pushed the feat/m3-mi300x-mxfp8 branch 2 times, most recently from 9031110 to 0a01999 Compare June 17, 2026 21:20
@github-actions

Copy link
Copy Markdown
Contributor

@Oseltamivir Oseltamivir force-pushed the feat/m3-mi300x-mxfp8 branch from 0a01999 to 95e79da Compare June 17, 2026 21:30
@github-actions

Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 95e79da. Configure here.

Comment thread perf-changelog.yaml
- "Convert checkpoint MXFP8 MoE weights once at load time to 128x128 block FP8 on gfx942."
- "Normalize OCP E4M3 values to FNUZ and use the regular Triton block-FP8 backend."
- "Use measured low-token TP tiles and a tuned local-expert table without changing the TP8/TP8+EP8 matrix."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1753

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog fix triggers deletion error

Medium Severity

This commit replaces the malformed glm5-fp4-gb300-dynamo-trt pr-link line and adds the minimaxm3-fp8-mi300x-vllm entry. process_changelog.get_added_lines treats the removed pr-link line as a non-whitespace deletion and raises, so labeled PR sweep setup can fail before benchmarks run.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 95e79da. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

Co-authored-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>
@Oseltamivir Oseltamivir force-pushed the feat/m3-mi300x-mxfp8 branch from 95e79da to 27510c4 Compare June 17, 2026 21:47
@github-actions

Copy link
Copy Markdown
Contributor

Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

@Oseltamivir

Copy link
Copy Markdown
Collaborator Author

/reuse-sweep-run 27725228435

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

2 participants