Skip to content

[SPARK-36082][SQL] Restrict null-aware anti joins to broadcastable right sides#55678

Open
jbharadw-oai wants to merge 1 commit intoapache:masterfrom
jbharadw-oai:dev/jbharadw/spark-36082-null-aware-anti-join-oss
Open

[SPARK-36082][SQL] Restrict null-aware anti joins to broadcastable right sides#55678
jbharadw-oai wants to merge 1 commit intoapache:masterfrom
jbharadw-oai:dev/jbharadw/spark-36082-null-aware-anti-join-oss

Conversation

@jbharadw-oai
Copy link
Copy Markdown

What changes were proposed in this pull request?

This PR restricts the single-column null-aware anti join optimization to cases where the right
side can actually be broadcast, following up on the earlier proposal in #33289.

It also makes adaptive query stage reuse mode-aware for hashed broadcast exchanges:

  • regular equi joins only reuse non-null-aware hashed broadcast stages
  • null-aware anti joins only reuse null-aware hashed broadcast stages

Why are the changes needed?

Single-column null-aware anti joins build the right side as a broadcast hash relation, but the
planner currently selects that path unconditionally once the logical pattern matches. That can
choose a broadcast hash join even when the right side is above the broadcast threshold.

In addition, adaptive planning was treating hashed broadcast stages as interchangeable without
checking whether they were null-aware. Null-aware and regular hashed relations have different
semantics, so they should not be reused across each other.

Does this PR introduce any user-facing change?

Yes. Queries that match the single-column null-aware anti join optimization no longer force a
broadcast hash join when the right side exceeds the broadcast threshold; they fall back to normal
join planning instead.

How was this patch tested?

Added regression coverage for:

  • JoinSelectionHelper.canPlanAsBroadcastHashJoin
  • physical planning when a null-aware anti join right side is above the broadcast threshold
  • adaptive query stage reuse for null-aware vs regular hashed broadcast modes

Ran:

  • ./build/sbt "catalyst/testOnly org.apache.spark.sql.catalyst.optimizer.JoinSelectionHelperSuite -- -z single-column"
  • ./build/sbt "sql/testOnly org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- -z hashed"
  • ./build/sbt "sql/testOnly org.apache.spark.sql.JoinSuite -- -z SPARK-36082"
  • ./build/sbt "sql/testOnly org.apache.spark.sql.SparkSessionExtensionSuite"
  • ./build/sbt "sql/testOnly org.apache.spark.sql.SubquerySuite -- -z SingleColumn"

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Codex GPT-5

@jbharadw-oai jbharadw-oai force-pushed the dev/jbharadw/spark-36082-null-aware-anti-join-oss branch from e6c0c68 to bcc5b7a Compare May 5, 2026 00:40
Copy link
Copy Markdown
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(internally reviewed)

cc @cloud-fan @viirya

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants