chore(runners): add TensorWave MI300X docker runners (mi300x-tw)#1793
chore(runners): add TensorWave MI300X docker runners (mi300x-tw)#1793cquil11 wants to merge 1 commit into
Conversation
Hand-off of two TensorWave MI300X nodes (tw018, tw032). Unlike the
amds/325x/355x AMD fleets these are standalone docker nodes, not Slurm,
so they need a `docker run`-based launch script rather than salloc+enroot.
- runners/launch_mi300x-tw.sh: ROCm docker launcher modeled on the
recovered AMD docker templates and launch_h100-cr.sh. Passes the GPUs via
/dev/kfd,/dev/dri,/dev/mem; auto-uses `sudo docker` (runner user isn't in
the docker group but has passwordless sudo); node-local HF cache under
$HOME; full env passthrough (the slurm launchers get this via
`srun --export=ALL`); dispatches to
benchmarks/single_node/${SCENARIO_SUBDIR}<model>_<prec>_mi300x[_mtp].sh.
- .github/configs/runners.yaml: register mi300x-tw_00 (tw018) and
mi300x-tw_01 (tw032) in the mi300x group and a dedicated mi300x-tw
subfleet, mirroring the b200-dgxc/h200-dgxc dual-listing pattern.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 02869cd. Configure here.
| "${ENV_FLAGS[@]}" \ | ||
| --entrypoint=/bin/bash \ | ||
| "$IMAGE" \ | ||
| benchmarks/single_node/${SCENARIO_SUBDIR}"${EXP_NAME%%_*}_${PRECISION}_mi300x${SPEC_SUFFIX}.sh" |
There was a problem hiding this comment.
Pre-run cleanup skips sudo docker
Medium Severity
The launcher falls back to sudo docker when the runner user cannot run docker ps, but it never removes a leftover bmk-server container before docker run. The benchmark workflow’s pre-run Docker cleanup only invokes non-sudo docker, so on these nodes that step is skipped and a stale fixed-name container can make the next job fail immediately.
Reviewed by Cursor Bugbot for commit 02869cd. Configure here.
| RESULT_DIR RESULT_FILENAME RUNNER_TYPE RUN_EVAL EVAL_ONLY | ||
| EVAL_CONTEXT_ARGS EVAL_MAX_MODEL_LEN | ||
| PROFILE SGLANG_TORCH_PROFILER_DIR VLLM_TORCH_PROFILER_DIR VLLM_RPC_TIMEOUT | ||
| ) |
There was a problem hiding this comment.
MODEL_PREFIX not passed container
Low Severity
The explicit PASS_ENV list omits MODEL_PREFIX even though the benchmark workflow sets it on the runner host. Slurm MI300X jobs inherit it via --export=ALL, but this docker launcher only forwards named variables, so benchmark_lib.sh metadata will record infmax_model_prefix as unknown for TensorWave runs.
Reviewed by Cursor Bugbot for commit 02869cd. Configure here.


What
Onboards the two TensorWave MI300X nodes that were just handed off (tw018 /
64.139.222.218→mi300x-tw_00, tw032 /64.139.222.212→mi300x-tw_01) as self-hosted GitHub Actions runners, followingutils/runner_setup/RUNNER_SETUP.md.These are docker nodes, not Slurm — different from every current AMD fleet (amds/325x/355x), which run
salloc+enrootsquash on a Slurm worker. So they need adocker run-based launch script, in the spirit of the docker nodes we used to run (the deletedlaunch_mi300x-amd.sh/launch_mi300x-cr.sh, and the still-present NVIDIAlaunch_h100-cr.sh).Changes
runners/launch_mi300x-tw.sh(new) — ROCmdocker runlauncher:--device=/dev/kfd --device=/dev/dri --device=/dev/mem+--privileged/SYS_PTRACE/seccomp=unconfined(standard AMD docker flags).sudo docker: the runner user (cam) isn't in thedockergroup but has passwordless sudo (verified on both nodes). Falls back to plaindockerif a future setup adds the group.$HOME/hf_hub_cache/— these nodes have a single ~14T local root and no shared/NFS mount, so weights download on first use (stage them ahead of time for big models).numa_balancing(AMD throughput tuning), like the old AMD docker launchers.--export=ALLequivalent) covering everything the*_mi300x.shscripts +benchmark_lib.shread.benchmarks/single_node/${SCENARIO_SUBDIR}<model>_<prec>_mi300x[_mtp].sh(SCENARIO_SUBDIR + SPEC_SUFFIX, parity with the h100/b200 launchers)..github/configs/runners.yaml— registermi300x-tw_00(tw018) andmi300x-tw_01(tw032) in themi300xgroup, plus a dedicatedmi300x-twsubfleet (mirrors theb200-dgxc/h200-dgxcdual-listing).Validation
bash -non the launch script ✅mi300xandmi300x-tw✅mi300x-tw_NN→launch_mi300x-tw.shexists) ✅utils/matrix_logic/test_generate_sweep_configs.py— 91 passed ✅sudo dockerworks ✅On-node registration — DONE ✅
Runners are registered and live (runner v2.335.1), one per node, started under
tmux(sessiongithub-actions) viastart_runners.sh:Notes / follow-ups:
./run.shin tmux (not systemd) — they don't survive a reboot; re-runstart_runners.shafter node maintenance (standard for this fleet).camto thedockergroup on both nodes to drop thesudoin the launcher.~/hf_hub_cacheon first run unless pre-staged.🤖 Generated with Claude Code