Skip to content

Fix multi-wire recv prefill deadlock in jaccl ring backend#3654

Open
jasonpaulso wants to merge 1 commit into
ml-explore:mainfrom
jasonpaulso:fix/ring-recv-prefill-upstream
Open

Fix multi-wire recv prefill deadlock in jaccl ring backend#3654
jasonpaulso wants to merge 1 commit into
ml-explore:mainfrom
jasonpaulso:fix/ring-recv-prefill-upstream

Conversation

@jasonpaulso

Copy link
Copy Markdown

Proposed changes

RingImpl::recv's prefill loop compares the relative chunk count (N * buff) against the absolute end offset of the wire's region (limits[lw]):

while (N * buff < limits[lw] && buff < PIPELINE) {

For every wire after the first, limits[lw] is large even when the wire's own region holds fewer than PIPELINE chunks — or zero chunks, for messages smaller than lw * bytes_per_wire. The prefill then posts receives that no matching send will ever fill (the send side correctly advances an absolute read_offset[lw] toward limits[lw]), so in_flight never drains and recv spins forever.

In practice: point-to-point recvs of small messages deadlock whenever a rank has more than one connection per direction. We hit this running distributed inference (exo) with MLX_JACCL_RING=1 over 3 Thunderbolt 5 links between two Macs — pipeline-parallel activation transfers (small tensors) hung 100% of the time, while all_reduce was unaffected because it falls back to a single wire for messages ≤ 64 KiB. Single-wire setups never see it because write_offset[0] == 0 makes the two comparisons coincide.

This fixes the comparison to use the region size, mirroring the send-side prefill:

while (N * buff < limits[lw] - write_offset[lw] && buff < PIPELINE) {

Verified on a 2-node M-series cluster with 3 TB5 links: pipeline-parallel send/recv over the ring backend now completes warmup and generates correctly; tensor-parallel throughput unchanged.

Checklist

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

🤖 Generated with Claude Code

RingImpl::recv's prefill compared the relative chunk count (N * buff)
against the absolute end offset of the wire's region (limits[lw]).
For every wire after the first, limits[lw] is large even when the
wire's own region holds fewer than PIPELINE chunks (or zero, for
messages smaller than lw * bytes_per_wire), so the prefill posted
receives that no matching send would ever fill. in_flight then never
drained and recv spun forever: point-to-point recvs of small messages
deadlocked whenever a rank had more than one connection per direction.

Compare against the region size (limits[lw] - write_offset[lw])
instead, mirroring the send-side prefill which advances the absolute
read_offset[lw] toward limits[lw].

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@zcbenz zcbenz left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks correct to me, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants