Fix multi-wire recv prefill deadlock in jaccl ring backend#3654
Open
jasonpaulso wants to merge 1 commit into
Open
Fix multi-wire recv prefill deadlock in jaccl ring backend#3654jasonpaulso wants to merge 1 commit into
jasonpaulso wants to merge 1 commit into
Conversation
RingImpl::recv's prefill compared the relative chunk count (N * buff) against the absolute end offset of the wire's region (limits[lw]). For every wire after the first, limits[lw] is large even when the wire's own region holds fewer than PIPELINE chunks (or zero, for messages smaller than lw * bytes_per_wire), so the prefill posted receives that no matching send would ever fill. in_flight then never drained and recv spun forever: point-to-point recvs of small messages deadlocked whenever a rank had more than one connection per direction. Compare against the region size (limits[lw] - write_offset[lw]) instead, mirroring the send-side prefill which advances the absolute read_offset[lw] toward limits[lw]. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
zcbenz
approved these changes
Jun 11, 2026
zcbenz
left a comment
Collaborator
There was a problem hiding this comment.
Looks correct to me, thanks!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposed changes
RingImpl::recv's prefill loop compares the relative chunk count (N * buff) against the absolute end offset of the wire's region (limits[lw]):For every wire after the first,
limits[lw]is large even when the wire's own region holds fewer thanPIPELINEchunks — or zero chunks, for messages smaller thanlw * bytes_per_wire. The prefill then posts receives that no matching send will ever fill (the send side correctly advances an absoluteread_offset[lw]towardlimits[lw]), soin_flightnever drains andrecvspins forever.In practice: point-to-point recvs of small messages deadlock whenever a rank has more than one connection per direction. We hit this running distributed inference (exo) with
MLX_JACCL_RING=1over 3 Thunderbolt 5 links between two Macs — pipeline-parallel activation transfers (small tensors) hung 100% of the time, whileall_reducewas unaffected because it falls back to a single wire for messages ≤ 64 KiB. Single-wire setups never see it becausewrite_offset[0] == 0makes the two comparisons coincide.This fixes the comparison to use the region size, mirroring the send-side prefill:
Verified on a 2-node M-series cluster with 3 TB5 links: pipeline-parallel send/recv over the ring backend now completes warmup and generates correctly; tensor-parallel throughput unchanged.
Checklist
pre-commit run --all-filesto format my code / installed pre-commit prior to committing changes🤖 Generated with Claude Code