Fix for https://github.com/sgl-project/sglang/issues/22072#1806
Fix for https://github.com/sgl-project/sglang/issues/22072#1806davzhuAMD wants to merge 2 commits into
Conversation
| if [[ -n "${_ibdev_sorted}" ]]; then | ||
| echo "[ibdev-auto] $(hostname -s): IBDEVICES (orig) = ${_ibdev_orig}" | ||
| echo "[ibdev-auto] $(hostname -s): IBDEVICES (sorted) = ${_ibdev_sorted}" | ||
| IBDEVICES="${_ibdev_sorted}" |
There was a problem hiding this comment.
IB sort drops device list
Medium Severity
When RoCE GID sorting finds any matching bnxt_re device, it replaces IBDEVICES with only sysfs devices whose GID index 3 has a non-zero subnet, ignoring the launcher’s original list. Active HCAs without that GID, or non-bnxt_re names, are dropped, which can under-provision MoRI/NCCL and cause init hangs.
Reviewed by Cursor Bugbot for commit 7821a3a. Configure here.
… mentioned in #22072.
| PREFILL_IP="$(resolve_ip "${PREFILL_SSH}")" | ||
| DECODE1_IP="$(resolve_ip "${DECODE_SSH_1}")" | ||
| DECODE2_IP="$(resolve_ip "${DECODE_SSH_2}")" | ||
| fi |
There was a problem hiding this comment.
Partial IP env vars overwritten
Medium Severity
When only one of PREFILL_IP, DECODE_IP (or DECODE1_IP / DECODE2_IP in the 1P+2D launcher) is unset, the script still re-resolves every address and overwrites any IPs the operator already exported.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit c8a0e03. Configure here.
| PREFILL_EP="${PREFILL_EP:-1}" | ||
| PREFILL_DP_ATTN="${PREFILL_DP_ATTN:-false}" | ||
| DECODE_TP="${DECODE_TP:-8}" | ||
| DECODE_EP="${DECODE_EP:-1}" |
There was a problem hiding this comment.
TP16 launcher wrong default TP
High Severity
This script always starts three nodes for a two-node TP16 decode worker, but DECODE_TP defaults to 8, so DECODE_NODES_PER_WORKER stays 1 unless the caller overrides it—rank 2 is not part of the intended TP16 worker layout.
Reviewed by Cursor Bugbot for commit c8a0e03. Configure here.
| export REBUILD_LIBBNXT_IN_CONTAINER=1 | ||
| export PATH_TO_BNXT_TAR_PACKAGE=/workspace/driver/libbnxt_re-237.1.137.0.tar.gz | ||
|
|
||
| export PREFILL_NODE="45.63.71.103" |
There was a problem hiding this comment.
The start_test_*.sh scripts hardcode many cluster-specific values (real node IPs like 45.63.71.103 / 137.220.60.12 and Docker image tags) directly in a public repo. Maybe we could parameterize these (e.g. via env vars) instead of committing concrete cluster details.
There was a problem hiding this comment.
Thanks for your comments. These test scripts themselves (including the new start_test_, run_1p, and start_sglang*) don't necessarily need to make it into the final repo, but I added them to illustrate how we investigated and reproduced the original issue on our side. If it makes things clearer, I can remove the helper scripts from the commit and attach them somewhere else instead, leaving just the actual fixes in this PR.
| # IPADDRS order: prefill_ip,decode1_ip,decode2_ip. NNODES=3, xP=1, yD=1. | ||
| # | ||
| # Prerequisites: | ||
| # - Passwordless SSH from this machine to both nodes |
There was a problem hiding this comment.
Should this say 3 nodes? This script launches 1 prefill + 2 decode nodes (NNODES=3), but the comments/header still say "both nodes", which were copy-pasted from the single-decode launcher and read as if there are only 2 nodes total.
…er concurrencies. - Increase the decode dispatch buffer size from 512 to 4096 and correctly pass it along for the DP and EP cases, avoiding a stall under load. - Correctly set MORI parameters when prioritizing traffic as lossless, which otherwise interferes with the data transfers that the tests generate.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 4 total unresolved issues (including 3 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 2e4cce0. Configure here.
| # Override for a cluster with a different GID layout / PFC class. | ||
| export NCCL_IB_GID_INDEX="${NCCL_IB_GID_INDEX:-3}" | ||
| export NCCL_IB_TC="${NCCL_IB_TC:-96}" | ||
| export NCCL_IB_SL="${NCCL_IB_SL:-3}" |
There was a problem hiding this comment.
NCCL QoS not aligned MoRI
Medium Severity
New defaults pin NCCL_IB_TC/NCCL_IB_SL to 96/3 for all runs, while MoRI traffic class still comes from hostname or MORI_RDMA_TC (e.g. 104 on mia1* hosts). NCCL and MoRI can then use different lossless QoS classes on the same bnxt fabric, reproducing stalled ncclCommInitRank or RDMA errors.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 2e4cce0. Configure here.


Fix some SGLang MORI disagg tests hanging on MI325X.
Uses derived helper scripts from https://github.com/JohnQinAMD/InferenceX/tree/sglang-pd-mi300-mi325x to aid investigation and repro.
To validate, update the PREFILL_MODEL_HOST_DIR, DECODE_MODEL_HOST_DIR, PREFILL_NODE, and DECODE_NODE IPs per test in these scripts, then run
Thanks to Akash Dhaka who investigated this issue with me.
Note
Medium Risk
Changes multi-node RDMA device ordering, NCCL/MoRI QoS defaults, and decode concurrency heuristics—cluster-specific mis-tuning could still cause init stalls or memory pressure; new launch scripts embed example node IPs.
Overview
Addresses SGLang MoRI PD disaggregation hangs on MI300X/MI325X (Broadcom
bnxt_reRoCE) by tuning RDMA/NCCL/MoRI env defaults, fixing decode DP+EP sizing, and adding SSH + Docker orchestration with repro scripts.Runtime / tuning fixes in
env.shandserver_sglang.sh: default NCCL RoCE settings (NCCL_IB_GID_INDEX,NCCL_IB_TC,NCCL_IB_SL); raiseMORI_MAX_DISPATCH_TOKENS_DECODEto 4096; derive MoRI RDMA SL/IO TC+SL from TC so DSCP/SL match lossless PFC; stop shrinking decode dispatch tokens when DP+EP are on—usemax_running_requests = conc × dp_ranks(floored atdp_ranks) while keeping env dispatch/MoE limits.DeepSeek-R1-0528prefillmem_fraction_static→ 0.9.New cluster helpers:
detect_ibdevices_bnxt.sh;rebuild_bnxt.sh+ container hook for in-image libbnxt rebuild;_disagg_ssh_remote_inner.shsortsIBDEVICESby RoCEv2 GID subnet per node;_disagg_container_entry.sh/install_mori_in_container.shfor optional MoRI rebuild. Root launchersrun_1p1d_*,run_1p2d_*,run_1p1d_tp16_*,start_test_1–5, docsSGLANG_PD_MI300_MI325X.md.Reviewed by Cursor Bugbot for commit 2e4cce0. Bugbot is set up for automated code reviews on this repo. Configure here.