Skip to content

Add Qwen3.5-FP8 GB200 SGLang disaggregated benchmark#1810

Open
RohitNagraj wants to merge 4 commits into
mainfrom
qwen3.5-fp8-gb200-dynamo-sglang
Open

Add Qwen3.5-FP8 GB200 SGLang disaggregated benchmark#1810
RohitNagraj wants to merge 4 commits into
mainfrom
qwen3.5-fp8-gb200-dynamo-sglang

Conversation

@RohitNagraj

@RohitNagraj RohitNagraj commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Adds qwen3.5-fp8-gb200-dynamo-sglang: Qwen3.5-397B-A17B-FP8 disaggregated SGLang-via-Dynamo on GB200.

  • 6 topologies across 1k/1k and 8k/1k: 1P1D TP4 STP plus wide-EP (DEP4 prefill / DEP16 decode), from 1P1D up to 8P1D
  • Recipes under benchmarks/multi_node/srt-slurm-recipes/sglang/qwen3.5/gb200-fp8/
  • Image: lmsysorg/sglang:nightly-dev-cu13-20260608-303757cc
  • Adds the qwen3.5/fp8 model-path branch to launch_gb200-nv.sh

Note

Low Risk
Benchmark and CI launch configuration only; no application runtime or auth/data-path changes.

Overview
Introduces qwen3.5-fp8-gb200-dynamo-sglang in nvidia-master.yaml for Qwen/Qwen3.5-397B-A17B-FP8 on GB200 with disaggregated multinode SGLang via Dynamo, covering 1k/1k and 8k/1k fixed-seq scenarios.

The search space adds six topologies: 1P1D TP4 (STP) and wide-EP layouts (DEP4 prefill / DEP16 decode), scaling from 1P1D through 2P1D, 4P1D, and 8P1D, each wired to a CONFIG_FILE under the new recipe tree.

Adds six Slurm recipe YAMLs under benchmarks/multi_node/srt-slurm-recipes/sglang/qwen3.5/gb200-fp8/ (Dynamo frontend, disagg prefill/decode, sa-bench concurrencies). launch_gb200-nv.sh maps qwen3.5 + fp8 to Lustre weights and overlays those recipes into srt-slurm like other dynamo-sglang models. perf-changelog.yaml documents the new config key.

Reviewed by Cursor Bugbot for commit de00324. Bugbot is set up for automated code reviews on this repo. Configure here.

Qwen3.5-397B-A17B-FP8 GB200 disaggregated SGLang-via-Dynamo, 6 topologies
across 1k/1k and 8k/1k (1P1D TP4 STP plus wide-EP DEP4 prefill / DEP16
decode from 1P1D up to 8P1D). Adds the recipe set, the nvidia-master entry,
the gb200 launch-script model-path and recipe-copy branches, and the
perf-changelog entry.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit de00324. Configure here.

osl: 1024
req_rate: "inf"
random_range_ratio: 0.8
concurrencies: "2048x4096"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4096 concurrency exceeds decode cap

Medium Severity

The 8P1D recipe and master config sweep concurrency 4096, but decode max-running-requests and max-mamba-cache-size stay at 2048. The 2P1D recipe uses 4096 for both when benchmarking at 4096, so the 4096 sweep point cannot honor intended load.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit de00324. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant