[Klaud Cold][Experimental][DNM] minimaxm3-fp8-mi355x-vllm-disagg: day-zero MoRI-IO disagg smoke test (1P TP8 + 1D TP8, conc 1) by functionstackx · Pull Request #1762 · SemiAnalysisAI/InferenceX

functionstackx · 2026-06-14T23:12:31Z

What

MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3):

1 prefill worker (TP8) + 1 decode worker (TP8) at conc 1, ISL/OSL 1k/1k
Validates the MoRI-IO KV-transfer disagg pipeline end-to-end for M3 before scaling out the search space
New config key: minimaxm3-fp8-mi355x-vllm-disagg

Layered on #1585 (remove vLLM-disagg MoRI patches)

This PR brings in #1585's MoRI-patch-removal infra (that PR is very stale vs main, so the changes are applied selectively rather than by merge):

amd_utils/{setup_deps.sh, server_vllm.sh, submit.sh, models_vllm.yaml} — taken from [Fix] Remove MoRI-IO patches from vLLM Disagg benchmarks #1585 (main is untouched here since the merge-base, so these equal main + the mori removal). Includes --all2all-backend mori → mori_low_latency for the existing M2.5/Kimi entries.
amd_utils/job.slurm — [Fix] Remove MoRI-IO patches from vLLM Disagg benchmarks #1585's two vLLM-disagg hunks applied onto current main (keeping main's atom-disagg support): vllm-router image nightly-20260511-e667ebb → nightly-20260603-e667ebb, and drop the VLLM_MORIIO_CONNECTOR_READ_MODE env from the vllm-disagg container block.

M3 recipe

benchmarks/multi_node/minimaxm3_fp8_mi355x_vllm-disagg.sh — model-agnostic disagg boilerplate (byte-identical to the M2.5 disagg script; the launcher resolves the per-SKU script by name).
models_vllm.yaml MiniMax-M3-MXFP8 — per-worker serve flags: --block-size 128 (MSA sparse/index cache), --language-model-only (text-only benchmark), --kv-cache-dtype fp8 (gfx950), --attention-backend TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts TP-sharded as in the single-node M3 TP8 recipe). Env: VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_USE_BREAKABLE_CUDAGRAPH=0 VLLM_ENGINE_READY_TIMEOUT_S=3600.

Scope guard

perf-changelog.yaml and .github/configs/amd-master.yaml contain only M3 changes vs main.

Validation

bash -n on the disagg script ✓
YAML parses (models_vllm / amd-master / perf-changelog) ✓
generate_sweep_configs test-config → exactly 1 disagg config (exp-name minimaxm3_1k1k, runner mi355x-disagg, 1P TP8 + 1D TP8, conc 1) ✓
launcher routes minimaxm3 / fp8 / vllm-disagg → benchmarks/multi_node/minimaxm3_fp8_mi355x_vllm-disagg.sh ✓
process_changelog.py selects minimaxm3-fp8-mi355x-vllm-disagg ✓

🤖 Generated with Claude Code

Note

Medium Risk
Removes MoRI-IO workarounds for all vllm-disagg jobs and changes router/MoE backend settings, which could regress existing Kimi/M2.5 disagg runs if the pinned images lack equivalent fixes.

Overview
Adds minimaxm3-fp8-mi355x-vllm-disagg on vllm/vllm-openai-rocm:minimax-m3: 1 prefill + 1 decode (TP8, no EP), conc 1–16 at 1k/1k and 8k/1k, plus a launcher that points MODEL_PATH at /it-share/hf-hub-cache for the staged M3 checkpoint. models_vllm.yaml gains MiniMax-M3-MXFP8 serve flags (block-size 128, language-model-only, FP8 KV, TRITON_ATTN).

vLLM-disagg MoRI-IO no longer applies runtime Python patches in setup_deps.sh (timeouts, scheduler fixes, idle KV reaper); READ mode is enabled via read_mode: true in server_vllm.sh --kv-transfer-config instead of VLLM_MORIIO_CONNECTOR_READ_MODE in job.slurm / submit.sh. Default vllm-router image is bumped; Kimi/M2.5 disagg entries switch --all2all-backend from mori to mori_low_latency. perf-changelog.yaml documents the new config.

^{Reviewed by Cursor Bugbot for commit 549fb1b. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-06-14T23:12:40Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-06-14T23:12:40Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-06-14T23:12:40Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-06-14T23:18:26Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27515117946
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27515117946

github-actions · 2026-06-15T01:37:24Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27515119215
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27515119215

functionstackx · 2026-06-15T01:42:49Z

First sweep failure — diagnosed & fixed

The first disagg sweep (run 27515119215) failed — not a recipe bug. The day-zero MiniMax-M3-MXFP8 checkpoint isn't staged on the MI355X disagg cluster, and the disagg path only searches pre-staged shared-storage paths (no in-container hf download like the single-node recipes):

FATAL: Model 'MiniMax-M3-MXFP8' not found. Searched:
  - /it-share/data/models--MiniMaxAI--MiniMax-M3-MXFP8
  - /it-share/data/MiniMax-M3-MXFP8
  - /nfsdata/hf_hub_cache-0/models--MiniMaxAI--MiniMax-M3-MXFP8
  - /nfsdata/hf_hub_cache-0/MiniMax-M3-MXFP8

server.sh exited immediately; the step then polled the (queued-then-dead) slurm job ~2h before failing.

Fix: amd_utils/job.slurm now auto-downloads the checkpoint when it isn't pre-staged, instead of a hard FATAL:

derives the HF repo id from hf_dir (models--org--name → org/name)
downloads into MODEL_DIR in HF cache layout (keeps MODEL_PATH under the -v ${MODEL_DIR}:/models mount / DOCKER_MODEL_PATH remap)
runs in a one-shot container of the serving image (host has no hf CLI), flock-serialized across prefill/decode nodes, idempotent re-check, 3 retries, huggingface-cli fallback, HF_TOKEN passthrough

Scoped to the vllm-disagg branch; pre-staged models (M2.5/Kimi) never reach this path. Re-running the sweep.

github-actions · 2026-06-15T02:34:56Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27519206250
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27519206250

github-actions · 2026-06-15T02:54:45Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27520697241
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27520697241

github-actions · 2026-06-15T04:50:41Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27521167091
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27521167091

…tness) The conc-1 1k1k smoke test never triggered an eval — the multi-node eval policy only marks 8k1k entries with conc >= MIN_EVAL_CONC (16). Add an 8k1k conc-16 row (same 1P TP8 + 1D TP8 layout) so mark_eval_entries marks it run-eval=true (eval-conc=16), running lm-eval through the MoRI-IO disagg pipeline to validate correctness. The conc-1 1k1k row stays the latency smoke test. Run with non-canary-full-sweep-enabled so the (non-min-conc) eval entry runs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

cursor · 2026-06-15T05:27:40Z

-                logger.error("Transfer %s failed: %s", status, e)
-                raise"""
-
-    new_wait = """    def waiting_for_transfer_complete(self):


Patches removed for pinned images

Medium Severity

This PR drops all runtime MoRI-IO vLLM patches from setup_deps.sh and switches Kimi/M2.5 serve flags to all2all-backend mori_low_latency, while amd-master.yaml still pins minimaxm2.5-fp8-mi355x-vllm-disagg and kimik2.5-fp4-mi355x-vllm-disagg to older nightly digests. Those jobs share the same vllm-disagg path, so they may hit unfixed hangs/assertions or unsupported CLI values without an image bump in this change.

Additional Locations (1)

benchmarks/multi_node/amd_utils/models_vllm.yaml#L28-L37

^{Reviewed by Cursor Bugbot for commit 01ed5b8. Configure here.}

Widen the 1k1k disagg latency/throughput sweep from conc 1 to conc 1,2,4,8,16 (1P TP8 + 1D TP8). The 8k1k conc-16 eval row is unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Widen the disagg sweep from conc 1 to conc 1,2,4,8,16 for both seq-len scenarios (1P TP8 + 1D TP8). The 8k1k conc-16 point keeps the multi-node eval marked (eval-conc=16) so lm-eval still validates the MoRI-IO disagg pipeline. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions · 2026-06-15T05:31:12Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27525928087
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27525928087

github-actions · 2026-06-15T10:53:26Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27526046669
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27526046669

github-actions · 2026-06-15T17:20:28Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27526046669
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27526046669

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…g smoke test MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) + 1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline end-to-end for M3. Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env), while keeping main's atom-disagg support intact. Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128 (MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts TP-sharded as in the single-node M3 TP8 recipe). perf-changelog.yaml and amd-master.yaml contain only M3 changes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The first MI355X disagg sweep (run 27515119215) failed: the day-zero MiniMax-M3-MXFP8 checkpoint is not staged on the disagg cluster's shared FS, so job.slurm's model search hit a hard FATAL ("Model 'MiniMax-M3-MXFP8' not found. Searched: ...") before the engine ever started. The single-node recipes hf-download inside the serving container, but the disagg path historically required ops to pre-stage checkpoints. Add an on-demand fallback to the vllm-disagg model-resolution block: when the checkpoint isn't found, derive the HF repo id from the hf_dir (models--org--name -> org/name) and download into MODEL_DIR in HF cache layout, then resolve the snapshot as MODEL_PATH. Staging into MODEL_DIR keeps MODEL_PATH under the dir that is bind-mounted into the serving container as /models, so the existing -v ${MODEL_DIR}:/models mount and DOCKER_MODEL_PATH (/models) remap both resolve. Implementation notes: - The host has no hf CLI, so the download runs in a one-shot container of the serving image (DOCKER_IMAGE_NAME), which ships huggingface_hub. - flock on a lockfile in MODEL_DIR serializes the prefill/decode nodes; a re-check of snapshots/ under the lock makes it idempotent (resumable). - hf download with a huggingface-cli fallback; 3 retries; HF_TOKEN passed through for gated repos. - Scoped to the vllm-disagg branch only; pre-staged models never reach this path (the search finds them first), so sglang/atom and existing vLLM disagg models (M2.5/Kimi) are unaffected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The disagg auto-download reached hf download but failed all 3 attempts: the one-shot `docker run "$DOCKER_IMAGE_NAME" bash -lc "hf download ..."` did not override the image ENTRYPOINT, so the vllm-openai API server ran with the bash command as its args and died with "Failed to infer device type" (no GPU mounted in the download container). Add --entrypoint "" (as the serving container does) so bash actually runs hf download. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…wnload Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's shared HF cache where the ~414 GB MXFP8 checkpoint is already staged (/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the launcher default /it-share/data. Scoped to M3 only via the M3 disagg script: export MODEL_PATH=/it-share/hf-hub-cache submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode serving containers. Other disagg models keep /it-share/data. This supersedes the earlier job.slurm auto-download approach, which is reverted: job.slurm now differs from main only by the #1585 mori-removal hunks (router image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…tness) The conc-1 1k1k smoke test never triggered an eval — the multi-node eval policy only marks 8k1k entries with conc >= MIN_EVAL_CONC (16). Add an 8k1k conc-16 row (same 1P TP8 + 1D TP8 layout) so mark_eval_entries marks it run-eval=true (eval-conc=16), running lm-eval through the MoRI-IO disagg pipeline to validate correctness. The conc-1 1k1k row stays the latency smoke test. Run with non-canary-full-sweep-enabled so the (non-min-conc) eval entry runs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Widen the 1k1k disagg latency/throughput sweep from conc 1 to conc 1,2,4,8,16 (1P TP8 + 1D TP8). The 8k1k conc-16 eval row is unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Widen the disagg sweep from conc 1 to conc 1,2,4,8,16 for both seq-len scenarios (1P TP8 + 1D TP8). The 8k1k conc-16 point keeps the multi-node eval marked (eval-conc=16) so lm-eval still validates the MoRI-IO disagg pipeline. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

vllm/vllm-router only retains ~16 recent nightlies on Docker Hub; older dated tags are garbage-collected (manifest unknown), which makes `docker run` fail with exit 125 on any node that has not already cached the image.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 549fb1b. Configure here.}

cursor · 2026-06-17T16:57:08Z

-  prefill_flags: "--max-num-batched-tokens 4K --tensor-parallel-size 8 --enable-expert-parallel --all2all-backend mori --no-enable-prefix-caching --gpu-memory-utilization 0.95 --block-size 32"
-  decode_flags: "--max-num-batched-tokens 4K --tensor-parallel-size 8 --enable-expert-parallel --all2all-backend mori --no-enable-prefix-caching --gpu-memory-utilization 0.95 --block-size 32"
+  prefill_flags: "--max-num-batched-tokens 4K --tensor-parallel-size 8 --enable-expert-parallel --all2all-backend mori_low_latency --no-enable-prefix-caching --gpu-memory-utilization 0.95 --block-size 32"
+  decode_flags: "--max-num-batched-tokens 4K --tensor-parallel-size 8 --enable-expert-parallel --all2all-backend mori_low_latency --no-enable-prefix-caching --gpu-memory-utilization 0.95 --block-size 32"


Backend rename breaks old vLLM

Medium Severity

Kimi-K2.5-MXFP4 and MiniMax-M2.5 decode/prefill flags now pass --all2all-backend mori_low_latency instead of mori, but MI355X vLLM disagg configs for those models still use older pinned ROCm images that may only register the legacy mori backend. Serve startup can fail with an invalid backend before the new M3 disagg path is reached.

^{Reviewed by Cursor Bugbot for commit 549fb1b. Configure here.}

…age to atom0.1.4 (#1717) * [AMD] dsv4-fp4-mi355x-atom: enable DPA TBO at high concurrency, update image to atom0.1.4 - Enable --enable-tbo for ISL=1024/OSL=1024 at CONC>=1024 and ISL=8192/OSL=1024 at CONC>=256 - Update image to atom0.1.4_20260612 - Update ISL=8192 search-space to start at conc=4 and use DPA from conc=128 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] perf-changelog: dsv4-fp4-mi355x-atom DPA TBO + image atom0.1.4 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] perf-changelog: add PR link #1717 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] dsv4_fp4_mi355x_atom.sh: disable prefix caching Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] dsv4-fp4-mi355x-atom: add max-model-len, eval context, extend conc range - Pass --max-model-len to server using SERVE_MAX_MODEL_LEN - Add EVAL_ONLY path: compute eval context length via compute_eval_context_length - Extend conc-end to 8192 (isl=1024) and 4096 (isl=8192) in amd-master.yaml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] dsv4-fp4-mi355x-atom: narrow eval to single conc=1024 point, disable max-model-len Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] dsv4_fp4_mi355x_atom.sh: add cudagraph-capture-sizes and max-num-seqs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] dsv4-fp4-mi355x-atom: bump to nightly image, expand search space, enable max-model-len Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] set GPU_MAX_HW_QUEUES=5 in dsv4_fp4_mi355x_atom.sh Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] dsv4-fp4-mi355x-atom: disable TBO, add TP4 rows for isl=8192, cap conc ranges Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] dsv4_fp4_mi355x_atom.sh: quote SERVER_LOG variable Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] dsv4_fp4_mi355x_atom.sh: comment out dense cudagraph sizes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] dsv4_fp4_mi355x_atom.sh: fix --hf-overrides JSON escaping Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] dsv4_fp4_mi355x_atom.sh: comment out dense cudagraph sizes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] dsv4-fp4-mi355x-atom: expand search space, restore isl=1024 rows Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] perf-changelog: update dsv4-fp4-mi355x-atom image and search-space description Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] dsv4_fp4_mi355x_atom.sh: restore sparse cudagraph capture sizes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] perf-changelog: revert dsv4-fp4-mi355x-atom image/search-space, remove stale entries Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] perf-changelog: add dsv4-fp4-mi355x-sglang entry for PR #1762 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * update dsv4-fp4-mi355x-atom: bump image, enable TBO conditionally, fix mem frac Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * expand dsv4-fp4-mi355x-atom search space: restore ISL1024 scenarios, add TP4/TP8 conc lists for ISL8192 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Update perf-changelog.yaml * Update perf-changelog.yaml * Update perf-changelog.yaml * Update perf-changelog.yaml * update perf-changelog: move dsv4-fp4-mi355x-atom entry to end Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * narrow dsv4-fp4-mi355x-atom to DPA conc=256-2048 ISL8192, fix TBO branch override Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * restore full dsv4-fp4-mi355x-atom search space: ISL1024 + ISL8192 TP4/TP8/DPA Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Update perf-changelog.yaml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Update perf-changelog.yaml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: resolve PR 1717 changelog conflict --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>

* [AMD] dsv4-fp4-mi355x-atom: enable DPA TBO at high concurrency, update image to atom0.1.4 - Enable --enable-tbo for ISL=1024/OSL=1024 at CONC>=1024 and ISL=8192/OSL=1024 at CONC>=256 - Update image to atom0.1.4_20260612 - Update ISL=8192 search-space to start at conc=4 and use DPA from conc=128 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] perf-changelog: dsv4-fp4-mi355x-atom DPA TBO + image atom0.1.4 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] perf-changelog: add PR link #1717 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] dsv4_fp4_mi355x_atom.sh: disable prefix caching Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] dsv4-fp4-mi355x-atom: add max-model-len, eval context, extend conc range - Pass --max-model-len to server using SERVE_MAX_MODEL_LEN - Add EVAL_ONLY path: compute eval context length via compute_eval_context_length - Extend conc-end to 8192 (isl=1024) and 4096 (isl=8192) in amd-master.yaml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] dsv4-fp4-mi355x-atom: narrow eval to single conc=1024 point, disable max-model-len Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] dsv4_fp4_mi355x_atom.sh: add cudagraph-capture-sizes and max-num-seqs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] dsv4-fp4-mi355x-atom: bump to nightly image, expand search space, enable max-model-len Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] set GPU_MAX_HW_QUEUES=5 in dsv4_fp4_mi355x_atom.sh Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] dsv4-fp4-mi355x-atom: disable TBO, add TP4 rows for isl=8192, cap conc ranges Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] dsv4_fp4_mi355x_atom.sh: quote SERVER_LOG variable Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] dsv4_fp4_mi355x_atom.sh: comment out dense cudagraph sizes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] dsv4_fp4_mi355x_atom.sh: fix --hf-overrides JSON escaping Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] dsv4_fp4_mi355x_atom.sh: comment out dense cudagraph sizes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] dsv4-fp4-mi355x-atom: expand search space, restore isl=1024 rows Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] perf-changelog: update dsv4-fp4-mi355x-atom image and search-space description Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] dsv4_fp4_mi355x_atom.sh: restore sparse cudagraph capture sizes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] perf-changelog: revert dsv4-fp4-mi355x-atom image/search-space, remove stale entries Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [AMD] perf-changelog: add dsv4-fp4-mi355x-sglang entry for PR #1762 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * update dsv4-fp4-mi355x-atom: bump image, enable TBO conditionally, fix mem frac Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * expand dsv4-fp4-mi355x-atom search space: restore ISL1024 scenarios, add TP4/TP8 conc lists for ISL8192 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Update perf-changelog.yaml * Update perf-changelog.yaml * Update perf-changelog.yaml * Update perf-changelog.yaml * update perf-changelog: move dsv4-fp4-mi355x-atom entry to end Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * narrow dsv4-fp4-mi355x-atom to DPA conc=256-2048 ISL8192, fix TBO branch override Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * restore full dsv4-fp4-mi355x-atom search space: ISL1024 + ISL8192 TP4/TP8/DPA Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: retrigger dsv4 atom benchmark sweep --------- Co-authored-by: seungrokj <seungrok.jung@amd.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: seungrokj <144636725+seungrokj@users.noreply.github.com>

github-actions · 2026-06-18T02:14:51Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27705604170
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27705604170

functionstackx requested a review from a team June 14, 2026 23:12

functionstackx requested review from billishyahao and chunfangamd as code owners June 14, 2026 23:12

github-project-automation Bot added this to InferenceMAX Board Jun 14, 2026

functionstackx requested review from 1am9trash, seungrokj and yctseng0211 as code owners June 14, 2026 23:12

functionstackx added the sweep-enabled label Jun 14, 2026

functionstackx force-pushed the feat/minimax-m3-mi355-disagg branch from 8118fa3 to a4f66bd Compare June 15, 2026 01:45

functionstackx changed the title ~~[Klaud Cold] minimaxm3-fp8-mi355x-vllm-disagg: day-zero MoRI-IO disagg smoke test (1P TP8 + 1D TP8, conc 1)~~ [Klaud Cold][Experimental][DNM] minimaxm3-fp8-mi355x-vllm-disagg: day-zero MoRI-IO disagg smoke test (1P TP8 + 1D TP8, conc 1) Jun 15, 2026

cursor Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread benchmarks/multi_node/amd_utils/job.slurm Outdated

functionstackx force-pushed the feat/minimax-m3-mi355-disagg branch from a4f66bd to 409561f Compare June 15, 2026 02:37

cursor Bot mentioned this pull request Jun 15, 2026

[AMD] dsv4-fp4-mi355x-sglang: switch fixed-seq-len search space to TP4 #1768

Merged

2 tasks

functionstackx added non-canary-full-sweep-enabled Run the full sweep without the canary gate (full search space, no trim) and removed sweep-enabled labels Jun 15, 2026

functionstackx force-pushed the feat/minimax-m3-mi355-disagg branch from 7b33cf1 to 01ed5b8 Compare June 15, 2026 05:25

cursor Bot reviewed Jun 15, 2026

View reviewed changes

cursor Bot mentioned this pull request Jun 16, 2026

[AMD] perf-changelog: duplicate dsv4-fp4-mi355x-sglang TP4 fixed-seq-len entry #1795

Merged

1 task

seungrokj added a commit that referenced this pull request Jun 16, 2026

[AMD] perf-changelog: add dsv4-fp4-mi355x-sglang entry for PR #1762

a4828cb

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

functionstackx mentioned this pull request Jun 17, 2026

[Bug]: ROCm MiniMax M3 MXFP8 Disagg not working vllm-project/vllm#45885

Open

1 task

functionstackx and others added 9 commits June 17, 2026 12:55

perf-changelog: fill in PR link for minimaxm3-fp8-mi355x-vllm-disagg

cb783d8

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

disagg #1762: sweep conc 1,2,4,8,16 (not just conc 1)

c17db5d

Widen the 1k1k disagg latency/throughput sweep from conc 1 to conc 1,2,4,8,16 (1P TP8 + 1D TP8). The 8k1k conc-16 eval row is unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Update the vLLM external router container

549fb1b

vllm/vllm-router only retains ~16 recent nightlies on Docker Hub; older dated tags are garbage-collected (manifest unknown), which makes `docker run` fail with exit 125 on any node that has not already cached the image.

functionstackx force-pushed the feat/minimax-m3-mi355-disagg branch from 0af6316 to 549fb1b Compare June 17, 2026 16:55

cursor Bot reviewed Jun 17, 2026

View reviewed changes

Conversation

functionstackx commented Jun 14, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Layered on #1585 (remove vLLM-disagg MoRI patches)

M3 recipe

Scope guard

Validation

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

functionstackx commented Jun 15, 2026

First sweep failure — diagnosed & fixed

Uh oh!

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

cursor Bot Jun 15, 2026

Choose a reason for hiding this comment

Patches removed for pinned images

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 17, 2026

Choose a reason for hiding this comment

Backend rename breaks old vLLM

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

functionstackx commented Jun 14, 2026 •

edited by cursor Bot

Loading