[SPARK-36082][SQL] Restrict null-aware anti joins to broadcastable right sides#55678
Open
jbharadw-oai wants to merge 1 commit intoapache:masterfrom
Open
Conversation
e6c0c68 to
bcc5b7a
Compare
sunchao
approved these changes
May 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR restricts the single-column null-aware anti join optimization to cases where the right
side can actually be broadcast, following up on the earlier proposal in #33289.
It also makes adaptive query stage reuse mode-aware for hashed broadcast exchanges:
Why are the changes needed?
Single-column null-aware anti joins build the right side as a broadcast hash relation, but the
planner currently selects that path unconditionally once the logical pattern matches. That can
choose a broadcast hash join even when the right side is above the broadcast threshold.
In addition, adaptive planning was treating hashed broadcast stages as interchangeable without
checking whether they were null-aware. Null-aware and regular hashed relations have different
semantics, so they should not be reused across each other.
Does this PR introduce any user-facing change?
Yes. Queries that match the single-column null-aware anti join optimization no longer force a
broadcast hash join when the right side exceeds the broadcast threshold; they fall back to normal
join planning instead.
How was this patch tested?
Added regression coverage for:
JoinSelectionHelper.canPlanAsBroadcastHashJoinRan:
./build/sbt "catalyst/testOnly org.apache.spark.sql.catalyst.optimizer.JoinSelectionHelperSuite -- -z single-column"./build/sbt "sql/testOnly org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- -z hashed"./build/sbt "sql/testOnly org.apache.spark.sql.JoinSuite -- -z SPARK-36082"./build/sbt "sql/testOnly org.apache.spark.sql.SparkSessionExtensionSuite"./build/sbt "sql/testOnly org.apache.spark.sql.SubquerySuite -- -z SingleColumn"Was this patch authored or co-authored using generative AI tooling?
Generated-by: Codex GPT-5