Skip to content

Fix for https://github.com/sgl-project/sglang/issues/22072#1806

Open
davzhuAMD wants to merge 2 commits into
SemiAnalysisAI:mainfrom
davzhuAMD:main
Open

Fix for https://github.com/sgl-project/sglang/issues/22072#1806
davzhuAMD wants to merge 2 commits into
SemiAnalysisAI:mainfrom
davzhuAMD:main

Conversation

@davzhuAMD

@davzhuAMD davzhuAMD commented Jun 16, 2026

Copy link
Copy Markdown

Fix some SGLang MORI disagg tests hanging on MI325X.

Uses derived helper scripts from https://github.com/JohnQinAMD/InferenceX/tree/sglang-pd-mi300-mi325x to aid investigation and repro.

To validate, update the PREFILL_MODEL_HOST_DIR, DECODE_MODEL_HOST_DIR, PREFILL_NODE, and DECODE_NODE IPs per test in these scripts, then run

  • start_test_1.sh
  • start_test_2.sh
  • start_test_3.sh
  • start_test_4.sh
  • start_test_5.sh

Thanks to Akash Dhaka who investigated this issue with me.


Note

Medium Risk
Changes multi-node RDMA device ordering, NCCL/MoRI QoS defaults, and decode concurrency heuristics—cluster-specific mis-tuning could still cause init stalls or memory pressure; new launch scripts embed example node IPs.

Overview
Addresses SGLang MoRI PD disaggregation hangs on MI300X/MI325X (Broadcom bnxt_re RoCE) by tuning RDMA/NCCL/MoRI env defaults, fixing decode DP+EP sizing, and adding SSH + Docker orchestration with repro scripts.

Runtime / tuning fixes in env.sh and server_sglang.sh: default NCCL RoCE settings (NCCL_IB_GID_INDEX, NCCL_IB_TC, NCCL_IB_SL); raise MORI_MAX_DISPATCH_TOKENS_DECODE to 4096; derive MoRI RDMA SL/IO TC+SL from TC so DSCP/SL match lossless PFC; stop shrinking decode dispatch tokens when DP+EP are on—use max_running_requests = conc × dp_ranks (floored at dp_ranks) while keeping env dispatch/MoE limits. DeepSeek-R1-0528 prefill mem_fraction_static → 0.9.

New cluster helpers: detect_ibdevices_bnxt.sh; rebuild_bnxt.sh + container hook for in-image libbnxt rebuild; _disagg_ssh_remote_inner.sh sorts IBDEVICES by RoCEv2 GID subnet per node; _disagg_container_entry.sh / install_mori_in_container.sh for optional MoRI rebuild. Root launchers run_1p1d_*, run_1p2d_*, run_1p1d_tp16_*, start_test_15, docs SGLANG_PD_MI300_MI325X.md.

Reviewed by Cursor Bugbot for commit 2e4cce0. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread utils/find_reusable_sweep_run.py
Comment thread benchmarks/multi_node/amd_utils/server.sh
Comment thread run_1p1d_tp16_sglang_mi300_mi325x.sh
Comment thread runners/launch_gb300-nv.sh
if [[ -n "${_ibdev_sorted}" ]]; then
echo "[ibdev-auto] $(hostname -s): IBDEVICES (orig) = ${_ibdev_orig}"
echo "[ibdev-auto] $(hostname -s): IBDEVICES (sorted) = ${_ibdev_sorted}"
IBDEVICES="${_ibdev_sorted}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IB sort drops device list

Medium Severity

When RoCE GID sorting finds any matching bnxt_re device, it replaces IBDEVICES with only sysfs devices whose GID index 3 has a non-zero subnet, ignoring the launcher’s original list. Active HCAs without that GID, or non-bnxt_re names, are dropped, which can under-provision MoRI/NCCL and cause init hangs.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 7821a3a. Configure here.

PREFILL_IP="$(resolve_ip "${PREFILL_SSH}")"
DECODE1_IP="$(resolve_ip "${DECODE_SSH_1}")"
DECODE2_IP="$(resolve_ip "${DECODE_SSH_2}")"
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial IP env vars overwritten

Medium Severity

When only one of PREFILL_IP, DECODE_IP (or DECODE1_IP / DECODE2_IP in the 1P+2D launcher) is unset, the script still re-resolves every address and overwrites any IPs the operator already exported.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c8a0e03. Configure here.

PREFILL_EP="${PREFILL_EP:-1}"
PREFILL_DP_ATTN="${PREFILL_DP_ATTN:-false}"
DECODE_TP="${DECODE_TP:-8}"
DECODE_EP="${DECODE_EP:-1}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TP16 launcher wrong default TP

High Severity

This script always starts three nodes for a two-node TP16 decode worker, but DECODE_TP defaults to 8, so DECODE_NODES_PER_WORKER stays 1 unless the caller overrides it—rank 2 is not part of the intended TP16 worker layout.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c8a0e03. Configure here.

Comment thread start_test_1.sh
export REBUILD_LIBBNXT_IN_CONTAINER=1
export PATH_TO_BNXT_TAR_PACKAGE=/workspace/driver/libbnxt_re-237.1.137.0.tar.gz

export PREFILL_NODE="45.63.71.103"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The start_test_*.sh scripts hardcode many cluster-specific values (real node IPs like 45.63.71.103 / 137.220.60.12 and Docker image tags) directly in a public repo. Maybe we could parameterize these (e.g. via env vars) instead of committing concrete cluster details.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments. These test scripts themselves (including the new start_test_, run_1p, and start_sglang*) don't necessarily need to make it into the final repo, but I added them to illustrate how we investigated and reproduced the original issue on our side. If it makes things clearer, I can remove the helper scripts from the commit and attach them somewhere else instead, leaving just the actual fixes in this PR.

# IPADDRS order: prefill_ip,decode1_ip,decode2_ip. NNODES=3, xP=1, yD=1.
#
# Prerequisites:
# - Passwordless SSH from this machine to both nodes

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this say 3 nodes? This script launches 1 prefill + 2 decode nodes (NNODES=3), but the comments/header still say "both nodes", which were copy-pasted from the single-decode launcher and read as if there are only 2 nodes total.

…er concurrencies.

- Increase the decode dispatch buffer size from 512 to 4096 and correctly pass it along for the DP and EP cases, avoiding a stall under load.
- Correctly set MORI parameters when prioritizing traffic as lossless, which otherwise interferes with the data transfers that the tests generate.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2e4cce0. Configure here.

# Override for a cluster with a different GID layout / PFC class.
export NCCL_IB_GID_INDEX="${NCCL_IB_GID_INDEX:-3}"
export NCCL_IB_TC="${NCCL_IB_TC:-96}"
export NCCL_IB_SL="${NCCL_IB_SL:-3}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCCL QoS not aligned MoRI

Medium Severity

New defaults pin NCCL_IB_TC/NCCL_IB_SL to 96/3 for all runs, while MoRI traffic class still comes from hostname or MORI_RDMA_TC (e.g. 104 on mia1* hosts). NCCL and MoRI can then use different lossless QoS classes on the same bnxt fabric, reproducing stalled ncclCommInitRank or RDMA errors.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2e4cce0. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants