Skip to content

chore(runners): add TensorWave MI300X docker runners (mi300x-tw)#1793

Open
cquil11 wants to merge 1 commit into
mainfrom
chore/add-new-mi300x-tw
Open

chore(runners): add TensorWave MI300X docker runners (mi300x-tw)#1793
cquil11 wants to merge 1 commit into
mainfrom
chore/add-new-mi300x-tw

Conversation

@cquil11

@cquil11 cquil11 commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

What

Onboards the two TensorWave MI300X nodes that were just handed off (tw018 / 64.139.222.218mi300x-tw_00, tw032 / 64.139.222.212mi300x-tw_01) as self-hosted GitHub Actions runners, following utils/runner_setup/RUNNER_SETUP.md.

These are docker nodes, not Slurm — different from every current AMD fleet (amds/325x/355x), which run salloc + enroot squash on a Slurm worker. So they need a docker run-based launch script, in the spirit of the docker nodes we used to run (the deleted launch_mi300x-amd.sh / launch_mi300x-cr.sh, and the still-present NVIDIA launch_h100-cr.sh).

Changes

  • runners/launch_mi300x-tw.sh (new) — ROCm docker run launcher:
    • GPUs via --device=/dev/kfd --device=/dev/dri --device=/dev/mem + --privileged/SYS_PTRACE/seccomp=unconfined (standard AMD docker flags).
    • Auto-selects sudo docker: the runner user (cam) isn't in the docker group but has passwordless sudo (verified on both nodes). Falls back to plain docker if a future setup adds the group.
    • Node-local HF cache under $HOME/hf_hub_cache/ — these nodes have a single ~14T local root and no shared/NFS mount, so weights download on first use (stage them ahead of time for big models).
    • Disables numa_balancing (AMD throughput tuning), like the old AMD docker launchers.
    • Explicit env passthrough array (docker has no --export=ALL equivalent) covering everything the *_mi300x.sh scripts + benchmark_lib.sh read.
    • Dispatches to benchmarks/single_node/${SCENARIO_SUBDIR}<model>_<prec>_mi300x[_mtp].sh (SCENARIO_SUBDIR + SPEC_SUFFIX, parity with the h100/b200 launchers).
  • .github/configs/runners.yaml — register mi300x-tw_00 (tw018) and mi300x-tw_01 (tw032) in the mi300x group, plus a dedicated mi300x-tw subfleet (mirrors the b200-dgxc/h200-dgxc dual-listing).

Validation

  • bash -n on the launch script ✅
  • YAML parses; both nodes present in mi300x and mi300x-tw
  • Launch-script-name invariant holds (mi300x-tw_NNlaunch_mi300x-tw.sh exists) ✅
  • utils/matrix_logic/test_generate_sweep_configs.py — 91 passed ✅
  • Both nodes: confirmed gfx942 (MI300X), 192 CPUs, Docker 29.1.3, passwordless sudo, sudo docker works ✅

On-node registration — DONE ✅

Runners are registered and live (runner v2.335.1), one per node, started under tmux (session github-actions) via start_runners.sh:

mi300x-tw_00  status=online  busy=false  labels=[self-hosted,Linux,X64,mi300x,mi300x-tw,mi300x-tw_00]   # tw018
mi300x-tw_01  status=online  busy=false  labels=[self-hosted,Linux,X64,mi300x,mi300x-tw,mi300x-tw_01]   # tw032

Notes / follow-ups:

  • Runners run via ./run.sh in tmux (not systemd) — they don't survive a reboot; re-run start_runners.sh after node maintenance (standard for this fleet).
  • Optionally add cam to the docker group on both nodes to drop the sudo in the launcher.
  • Big-model weights will download to ~/hf_hub_cache on first run unless pre-staged.

🤖 Generated with Claude Code

Hand-off of two TensorWave MI300X nodes (tw018, tw032). Unlike the
amds/325x/355x AMD fleets these are standalone docker nodes, not Slurm,
so they need a `docker run`-based launch script rather than salloc+enroot.

- runners/launch_mi300x-tw.sh: ROCm docker launcher modeled on the
  recovered AMD docker templates and launch_h100-cr.sh. Passes the GPUs via
  /dev/kfd,/dev/dri,/dev/mem; auto-uses `sudo docker` (runner user isn't in
  the docker group but has passwordless sudo); node-local HF cache under
  $HOME; full env passthrough (the slurm launchers get this via
  `srun --export=ALL`); dispatches to
  benchmarks/single_node/${SCENARIO_SUBDIR}<model>_<prec>_mi300x[_mtp].sh.
- .github/configs/runners.yaml: register mi300x-tw_00 (tw018) and
  mi300x-tw_01 (tw032) in the mi300x group and a dedicated mi300x-tw
  subfleet, mirroring the b200-dgxc/h200-dgxc dual-listing pattern.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cquil11 cquil11 requested a review from a team June 16, 2026 00:23

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 02869cd. Configure here.

"${ENV_FLAGS[@]}" \
--entrypoint=/bin/bash \
"$IMAGE" \
benchmarks/single_node/${SCENARIO_SUBDIR}"${EXP_NAME%%_*}_${PRECISION}_mi300x${SPEC_SUFFIX}.sh"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-run cleanup skips sudo docker

Medium Severity

The launcher falls back to sudo docker when the runner user cannot run docker ps, but it never removes a leftover bmk-server container before docker run. The benchmark workflow’s pre-run Docker cleanup only invokes non-sudo docker, so on these nodes that step is skipped and a stale fixed-name container can make the next job fail immediately.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 02869cd. Configure here.

RESULT_DIR RESULT_FILENAME RUNNER_TYPE RUN_EVAL EVAL_ONLY
EVAL_CONTEXT_ARGS EVAL_MAX_MODEL_LEN
PROFILE SGLANG_TORCH_PROFILER_DIR VLLM_TORCH_PROFILER_DIR VLLM_RPC_TIMEOUT
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MODEL_PREFIX not passed container

Low Severity

The explicit PASS_ENV list omits MODEL_PREFIX even though the benchmark workflow sets it on the runner host. Slurm MI300X jobs inherit it via --export=ALL, but this docker launcher only forwards named variables, so benchmark_lib.sh metadata will record infmax_model_prefix as unknown for TensorWave runs.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 02869cd. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant