chore(runners): add TensorWave MI300X docker runners (mi300x-tw) by cquil11 · Pull Request #1793 · SemiAnalysisAI/InferenceX

cquil11 · 2026-06-16T00:23:49Z

What

Onboards the two TensorWave MI300X nodes that were just handed off (tw018 / 64.139.222.218 → mi300x-tw_00, tw032 / 64.139.222.212 → mi300x-tw_01) as self-hosted GitHub Actions runners, following utils/runner_setup/RUNNER_SETUP.md.

These are docker nodes, not Slurm — different from every current AMD fleet (amds/325x/355x), which run salloc + enroot squash on a Slurm worker. So they need a docker run-based launch script, in the spirit of the docker nodes we used to run (the deleted launch_mi300x-amd.sh / launch_mi300x-cr.sh, and the still-present NVIDIA launch_h100-cr.sh).

Changes

runners/launch_mi300x-tw.sh (new) — ROCm docker run launcher:
- GPUs via --device=/dev/kfd --device=/dev/dri --device=/dev/mem + --privileged/SYS_PTRACE/seccomp=unconfined (standard AMD docker flags).
- Auto-selects sudo docker: the runner user (cam) isn't in the docker group but has passwordless sudo (verified on both nodes). Falls back to plain docker if a future setup adds the group.
- Node-local HF cache under $HOME/hf_hub_cache/ — these nodes have a single ~14T local root and no shared/NFS mount, so weights download on first use (stage them ahead of time for big models).
- Disables numa_balancing (AMD throughput tuning), like the old AMD docker launchers.
- Explicit env passthrough array (docker has no --export=ALL equivalent) covering everything the *_mi300x.sh scripts + benchmark_lib.sh read.
- Dispatches to benchmarks/single_node/${SCENARIO_SUBDIR}<model>_<prec>_mi300x[_mtp].sh (SCENARIO_SUBDIR + SPEC_SUFFIX, parity with the h100/b200 launchers).
.github/configs/runners.yaml — register mi300x-tw_00 (tw018) and mi300x-tw_01 (tw032) in the mi300x group, plus a dedicated mi300x-tw subfleet (mirrors the b200-dgxc/h200-dgxc dual-listing).

Validation

bash -n on the launch script ✅
YAML parses; both nodes present in mi300x and mi300x-tw ✅
Launch-script-name invariant holds (mi300x-tw_NN → launch_mi300x-tw.sh exists) ✅
utils/matrix_logic/test_generate_sweep_configs.py — 91 passed ✅
Both nodes: confirmed gfx942 (MI300X), 192 CPUs, Docker 29.1.3, passwordless sudo, sudo docker works ✅

On-node registration — DONE ✅

Runners are registered and live (runner v2.335.1), one per node, started under tmux (session github-actions) via start_runners.sh:

mi300x-tw_00  status=online  busy=false  labels=[self-hosted,Linux,X64,mi300x,mi300x-tw,mi300x-tw_00]   # tw018
mi300x-tw_01  status=online  busy=false  labels=[self-hosted,Linux,X64,mi300x,mi300x-tw,mi300x-tw_01]   # tw032

Notes / follow-ups:

Runners run via ./run.sh in tmux (not systemd) — they don't survive a reboot; re-run start_runners.sh after node maintenance (standard for this fleet).
Optionally add cam to the docker group on both nodes to drop the sudo in the launcher.
Big-model weights will download to ~/hf_hub_cache on first run unless pre-staged.

🤖 Generated with Claude Code

Hand-off of two TensorWave MI300X nodes (tw018, tw032). Unlike the amds/325x/355x AMD fleets these are standalone docker nodes, not Slurm, so they need a `docker run`-based launch script rather than salloc+enroot. - runners/launch_mi300x-tw.sh: ROCm docker launcher modeled on the recovered AMD docker templates and launch_h100-cr.sh. Passes the GPUs via /dev/kfd,/dev/dri,/dev/mem; auto-uses `sudo docker` (runner user isn't in the docker group but has passwordless sudo); node-local HF cache under $HOME; full env passthrough (the slurm launchers get this via `srun --export=ALL`); dispatches to benchmarks/single_node/${SCENARIO_SUBDIR}<model>_<prec>_mi300x[_mtp].sh. - .github/configs/runners.yaml: register mi300x-tw_00 (tw018) and mi300x-tw_01 (tw032) in the mi300x group and a dedicated mi300x-tw subfleet, mirroring the b200-dgxc/h200-dgxc dual-listing pattern. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 02869cd. Configure here.}

cursor · 2026-06-16T00:25:12Z

+"${ENV_FLAGS[@]}" \
+--entrypoint=/bin/bash \
+"$IMAGE" \
+benchmarks/single_node/${SCENARIO_SUBDIR}"${EXP_NAME%%_*}_${PRECISION}_mi300x${SPEC_SUFFIX}.sh"


Pre-run cleanup skips sudo docker

Medium Severity

The launcher falls back to sudo docker when the runner user cannot run docker ps, but it never removes a leftover bmk-server container before docker run. The benchmark workflow’s pre-run Docker cleanup only invokes non-sudo docker, so on these nodes that step is skipped and a stale fixed-name container can make the next job fail immediately.

^{Reviewed by Cursor Bugbot for commit 02869cd. Configure here.}

cursor · 2026-06-16T00:25:13Z

+    RESULT_DIR RESULT_FILENAME RUNNER_TYPE RUN_EVAL EVAL_ONLY
+    EVAL_CONTEXT_ARGS EVAL_MAX_MODEL_LEN
+    PROFILE SGLANG_TORCH_PROFILER_DIR VLLM_TORCH_PROFILER_DIR VLLM_RPC_TIMEOUT
+)


MODEL_PREFIX not passed container

Low Severity

The explicit PASS_ENV list omits MODEL_PREFIX even though the benchmark workflow sets it on the runner host. Slurm MI300X jobs inherit it via --export=ALL, but this docker launcher only forwards named variables, so benchmark_lib.sh metadata will record infmax_model_prefix as unknown for TensorWave runs.

^{Reviewed by Cursor Bugbot for commit 02869cd. Configure here.}

cquil11 requested a review from a team June 16, 2026 00:23

github-project-automation Bot added this to InferenceMAX Board Jun 16, 2026

cursor Bot reviewed Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(runners): add TensorWave MI300X docker runners (mi300x-tw)#1793

chore(runners): add TensorWave MI300X docker runners (mi300x-tw)#1793
cquil11 wants to merge 1 commit into
mainfrom
chore/add-new-mi300x-tw

cquil11 commented Jun 16, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 16, 2026

Uh oh!

cursor Bot Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cquil11 commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Changes

Validation

On-node registration — DONE ✅

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 16, 2026

Choose a reason for hiding this comment

Pre-run cleanup skips sudo docker

Uh oh!

cursor Bot Jun 16, 2026

Choose a reason for hiding this comment

MODEL_PREFIX not passed container

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cquil11 commented Jun 16, 2026 •

edited

Loading