[SPARK-54554][SQL] Enable Dynamic Partition Pruning with CommandResult #53263
+44
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR enables Dynamic Partition Pruning (DPP) optimization when joining with CommandResult nodes (e.g., results from SHOW PARTITIONS).
Changes made to sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala:
Added test coverage in sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala to verify DPP works correctly with CommandResult.
Built and tested against tag v4.0.1 locally to verify the results and Spark plan as well.
https://issues.apache.org/jira/browse/SPARK-54554
Why are the changes needed?
Previously, when using SHOW PARTITIONS results in a broadcast join, Spark would perform full table scans instead of applying Dynamic Partition Pruning.
Example scenario where this matters:
val partitions = spark.sql("SHOW PARTITIONS fact_table")
.selectExpr("cast(split(partition, '=')[1] as int) as partition_id")
.agg(max("partition_id"))
spark.table("fact_table")
.join(partitions, col("partition_id") === col("max(partition_id)"))
Before this fix: Full table scan of all partitions
After this fix: DPP prunes to only the relevant partition(s)
Does this PR introduce any user-facing change?
Yes. Queries that join partitioned tables with SHOW PARTITIONS results (or other commands returning CommandResult) will now benefit from Dynamic Partition Pruning, potentially improving performance by scanning fewer partitions.
The behavior change is transparent to users - existing queries will simply run faster without any code changes required.
How was this patch tested?
Added new test case "DPP with CommandResult from SHOW PARTITIONS in broadcast join" in DynamicPartitionPruningSuite that verifies:
- DPP is applied when joining with CommandResult
- Correct query results are returned
- Plan contains DynamicPruningSubquery operator
Ran full DynamicPartitionPruning test suite (73 tests total) - all passed
Tested manually with local Spark build using various CommandResult scenarios
Was this patch authored or co-authored using generative AI tooling?
No