[WIP][SPARK-55715][SQL] Keep outputOrdering when GroupPartitionsExec coalesces partitions#55116
Draft
peter-toth wants to merge 2 commits intoapache:masterfrom
Draft
Conversation
87b70e7 to
c6e685a
Compare
…canPartitioningAndOrdering ### What changes were proposed in this pull request? `V2ScanPartitioningAndOrdering.ordering` was calling `V2ExpressionUtils.toCatalystOrdering` without the `funCatalog` argument. This meant that function-based sort expressions reported by a data source via `SupportsReportOrdering` (e.g. transform functions like `bucket(n, col)`) could not be resolved against the function catalog and would be silently dropped. The fix passes `relation.funCatalog` as the third argument, consistent with how `toCatalystOpt` is already called in the `partitioning` rule of the same object. ### Why are the changes needed? Without the function catalog, sort orders involving catalog functions reported by `SupportsReportOrdering` are not resolved, causing them to be ignored by the planner even when the data source correctly reports them. ### Does this PR introduce _any_ user-facing change? Yes. Data sources implementing `SupportsReportOrdering` with function-based sort expressions that require the function catalog will now have those sort orders correctly recognized by Spark, potentially eliminating unnecessary sort operations. ### How was this patch tested? `WriteDistributionAndOrderingSuite` already covers this due to `InMemoryBaseTable` is updated to use `InMemoryBatchScanWithOrdering` (a new inner classimplementing `SupportsReportOrdering`) when a table ordering is configured. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Sonnet 4.6
c6e685a to
521449b
Compare
…sces partitions ### What changes were proposed in this pull request? #### Background `GroupPartitionsExec` coalesces multiple input partitions that share the same partition key into a single output partition. Before this PR, `outputOrdering` was always discarded after coalescing: even when the child reported ordering (e.g. via `SupportsReportOrdering`) or when ordering was derived from `KeyedPartitioning` key expressions (via `spark.sql.sources.v2.bucketing.partitionKeyOrdering.enabled`), coalescing by simple concatenation destroyed the within-partition ordering. This forced `EnsureRequirements` to inject an extra `SortExec` before `SortMergeJoinExec`, defeating the purpose of using a storage-partitioned join. #### k-way merge: SortedMergeCoalescedRDD This PR introduces `SortedMergeCoalescedRDD`, a new RDD that coalesces partitions by performing a k-way merge instead of simple concatenation. When multiple input partitions share the same key, a priority-queue-based merge interleaves their rows in sorted order, producing a single output partition whose row order matches the child's `outputOrdering`. `GroupPartitionsExec.doExecute()` uses `SortedMergeCoalescedRDD` when all of the following hold: 1. `spark.sql.sources.v2.bucketing.preserveOrderingOnCoalesce.enabled` is `true`. 2. The child reports a non-empty `outputOrdering`. 3. The child subtree is safe for concurrent partition reads (`childIsSafeForKWayMerge`). 4. At least one output partition actually coalesces multiple input partitions. When the config is enabled, the k-way merge is always applied regardless of whether the parent operator actually requires the ordering. Making this dynamic (only merge-sort when required) will be addressed in a follow-up ticket. #### Why k-way merge safety matters: SafeForKWayMerge Unlike `CoalescedRDD`, which processes input partitions sequentially, `SortedMergeCoalescedRDD` opens all N input partition iterators upfront and interleaves reads across them — all on a single JVM thread within a single Spark task. A `SparkPlan` object is shared across all partition computations, so any plan node that stores per-partition mutable state in an instance field rather than inside the partition's iterator closure is aliased across all N concurrent computations. The last writer wins, and any computation that reads or frees state based on its own earlier write will operate on incorrect state (a use-after-free). To avoid this class of bugs, `GroupPartitionsExec` uses a whitelist approach via a new marker trait `SafeForKWayMerge`. Nodes implementing this trait guarantee that all per-partition mutable state is captured inside the partition's iterator closure (e.g. via the `PartitionEvaluatorFactory` pattern), never in shared plan-node instance fields. Unknown node types fall through to unsafe, causing a silent fallback to simple sequential coalescing. The following nodes implement `SafeForKWayMerge`: - `DataSourceV2ScanExecBase` (leaf nodes reading from V2 sources) - `ProjectExec`, `FilterExec` (stateless row-by-row operators) - `WholeStageCodegenExec`, `InputAdapter` (code-gen wrappers that delegate to the above) #### GroupPartitionsExec.outputOrdering `GroupPartitionsExec.outputOrdering` is updated to reflect what ordering is preserved: 1. **No coalescing** (all groups ≤ 1 partition): `child.outputOrdering` is passed through unchanged. 2. **Coalescing with k-way merge** (config enabled + `childIsSafeForKWayMerge`): `child.outputOrdering` is returned in full — the k-way merge produces a globally sorted partition. 3. **Coalescing without k-way merge, no reducers**: only sort orders whose expression is a partition key expression are returned. These key expressions evaluate to the same constant value within every merged partition (all merged splits share the same key), so their sort orders remain valid after concatenation. This is the ordering preserved by the existing `spark.sql.sources.v2.bucketing.preserveKeyOrderingOnCoalesce.enabled` config. 4. **Coalescing without k-way merge, with reducers**: `super.outputOrdering` (empty) — the reduced key can take different values within the output partition, so no ordering is guaranteed. #### DataSourceRDD: concurrent-reader metrics support `SortedMergeCoalescedRDD` opens multiple `PartitionReader`s concurrently within a single Spark task. The existing `DataSourceRDD` assumed at most one active reader per task at a time, causing only the last reader's custom metrics to be reported (the previous readers' metrics were overwritten and lost). `DataSourceRDD` is refactored to support concurrent readers: - A new `TaskState` class (one per task) holds an `ArrayBuffer[PartitionIterator[_]]` (`partitionIterators`) tracking all readers opened for the task, Spark input metrics (`InputMetrics`), and a `closedMetrics` map accumulating final metric values from already-closed readers. - `mergeAndUpdateCustomMetrics()` runs in two phases: (1) drain closed iterators into `closedMetrics`; (2) merge live readers' current values with `closedMetrics` via the new `CustomTaskMetric.mergeWith()` and push the result to the Spark UI accumulators. - This works correctly in all three execution modes: single partition per task, sequential coalescing (one reader at a time), and concurrent k-way merge (N readers simultaneously). #### CustomTaskMetric.mergeWith A new default method `mergeWith(CustomTaskMetric other)` is added to `CustomTaskMetric`. The default implementation sums the two values, which is correct for count-type metrics. Data sources with non-additive metrics (e.g. max, average) should override this method. This replaces the previously proposed `PartitionReader.initMetricsValues` mechanism (which threaded prior metric values into the next reader's constructor) with a cleaner, pull-based merge at reporting time. `PartitionReader.initMetricsValues` becomes deprecated as it is no longer needed. ### Why are the changes needed? Without this fix, `GroupPartitionsExec` always discards ordering when coalescing, forcing `EnsureRequirements` to inject an extra `SortExec` before `SortMergeJoinExec` even when the data is already sorted by the join key within each partition. With `SortedMergeCoalescedRDD`, the full child ordering is preserved end-to-end, eliminating these redundant sorts and making storage-partitioned joins with ordering fully efficient. `spark.sql.sources.v2.bucketing.preserveKeyOrderingOnCoalesce.enabled` (introduced earlier) preserves only sort orders over partition key expressions, which remain constant within a merged partition. This PR goes further: by performing a k-way merge, the full `outputOrdering` — including secondary sort columns beyond the partition key — is preserved end-to-end. ### Does this PR introduce _any_ user-facing change? Yes. A new SQL configuration is added: - `spark.sql.sources.v2.bucketing.preserveOrderingOnCoalesce.enabled` (default: `false`): when enabled, `GroupPartitionsExec` uses a k-way merge to coalesce partitions while preserving the full child ordering, avoiding extra sort steps for operations like `SortMergeJoin`. ### How was this patch tested? - **`SortedMergeCoalescedRDDSuite`**: unit tests for the new RDD covering correctness of the k-way merge, empty partitions, single partition, and ordering guarantees. - **`GroupPartitionsExecSuite`**: unit tests covering all four branches of `outputOrdering` (no coalescing; k-way merge enabled; key-expression ordering only; reducers present). - **`KeyGroupedPartitioningSuite`**: SQL-level tests verifying that no extra `SortExec` is injected when `SortedMergeCoalescedRDD` is used, and a new test (`SPARK-55715: Custom metrics of sorted-merge coalesced partitions`) that verifies per-scan custom metrics are correctly reported across concurrent readers in the k-way merge case. - **`BufferedRowsReader` hardening**: the test-framework reader in `InMemoryBaseTable` now tracks a `closed` flag and throws `IllegalStateException` for reads, double-closes, or metric fetches on a closed reader. This ensures future tests catch reader lifecycle bugs that were previously hidden by the noop `close()`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Sonnet 4.6
521449b to
1c9e6ff
Compare
Contributor
Author
|
This PR is draft as it requires the changes from #55137. Once that PR is merged I will rebase on |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Background
GroupPartitionsExeccoalesces multiple input partitions that share the same partition key into a single output partition. Before this PR,outputOrderingwas always discarded after coalescing: even when the child reported ordering (e.g. viaSupportsReportOrdering) or when ordering was derived fromKeyedPartitioningkey expressions (viaspark.sql.sources.v2.bucketing.partitionKeyOrdering.enabled), coalescing by simple concatenation destroyed the within-partition ordering. This forcedEnsureRequirementsto inject an extraSortExecbeforeSortMergeJoinExec, defeating the purpose of using a storage-partitioned join.k-way merge: SortedMergeCoalescedRDD
This PR introduces
SortedMergeCoalescedRDD, a new RDD that coalesces partitions by performing a k-way merge instead of simple concatenation. When multiple input partitions share the same key, a priority-queue-based merge interleaves their rows in sorted order, producing a single output partition whose row order matches the child'soutputOrdering.GroupPartitionsExec.doExecute()usesSortedMergeCoalescedRDDwhen all of the following hold:spark.sql.sources.v2.bucketing.preserveOrderingOnCoalesce.enabledistrue.outputOrdering.childIsSafeForKWayMerge).When the config is enabled, the k-way merge is always applied regardless of whether the parent operator actually requires the ordering. Making this dynamic (only merge-sort when required) will be addressed in a follow-up ticket.
Why k-way merge safety matters: SafeForKWayMerge
Unlike
CoalescedRDD, which processes input partitions sequentially,SortedMergeCoalescedRDDopens all N input partition iterators upfront and interleaves reads across them — all on a single JVM thread within a single Spark task. ASparkPlanobject is shared across all partition computations, so any plan node that stores per-partition mutable state in an instance field rather than inside the partition's iterator closure is aliased across all N concurrent computations. The last writer wins, and any computation that reads or frees state based on its own earlier write will operate on incorrect state (a use-after-free).To avoid this class of bugs,
GroupPartitionsExecuses a whitelist approach via a new marker traitSafeForKWayMerge. Nodes implementing this trait guarantee that all per-partition mutable state is captured inside the partition's iterator closure (e.g. via thePartitionEvaluatorFactorypattern), never in shared plan-node instance fields. Unknown node types fall through to unsafe, causing a silent fallback to simple sequential coalescing. The following nodes implementSafeForKWayMerge:DataSourceV2ScanExecBase(leaf nodes reading from V2 sources)ProjectExec,FilterExec(stateless row-by-row operators)WholeStageCodegenExec,InputAdapter(code-gen wrappers that delegate to the above)GroupPartitionsExec.outputOrdering
GroupPartitionsExec.outputOrderingis updated to reflect what ordering is preserved:child.outputOrderingis passed through unchanged.childIsSafeForKWayMerge):child.outputOrderingis returned in full — the k-way merge produces a globally sorted partition.spark.sql.sources.v2.bucketing.preserveKeyOrderingOnCoalesce.enabledconfig.super.outputOrdering(empty) — the reduced key can take different values within the output partition, so no ordering is guaranteed.DataSourceRDD: concurrent-reader metrics support
SortedMergeCoalescedRDDopens multiplePartitionReaders concurrently within a single Spark task. The existingDataSourceRDDassumed at most one active reader per task at a time, causing only the last reader's custom metrics to be reported (the previous readers' metrics were overwritten and lost).DataSourceRDDis refactored to support concurrent readers:TaskStateclass (one per task) holds anArrayBuffer[PartitionIterator[_]](partitionIterators) tracking all readers opened for the task, Spark input metrics (inputMetrics), and aclosedMetricsmap accumulating final metric values from already-closed readers.mergeAndUpdateCustomMetrics()runs in two phases: (1) drain closed iterators intoclosedMetrics; (2) merge live readers' current values withclosedMetricsvia the newCustomTaskMetric.mergeWith().CustomTaskMetric.mergeWith
A new default method
mergeWith(CustomTaskMetric other)is added toCustomTaskMetric. The default implementation sums the two values, which is correct for count-type metrics. Data sources with non-additive metrics (e.g. max, average) should override this method. This replaces the previously proposedPartitionReader.initMetricsValuesmechanism (which threaded prior metric values into the next reader's constructor) with a cleaner, pull-based merge at reporting time.PartitionReader.initMetricsValuesbecomes deprecated as it is no longer needed.Why are the changes needed?
Without this fix,
GroupPartitionsExecalways discards ordering when coalescing, forcingEnsureRequirementsto inject an extraSortExecbeforeSortMergeJoinExeceven when the data is already sorted by the join key within each partition. WithSortedMergeCoalescedRDD, the full child ordering is preserved end-to-end, eliminating these redundant sorts and making storage-partitioned joins with ordering fully efficient.spark.sql.sources.v2.bucketing.preserveKeyOrderingOnCoalesce.enabled(introduced earlier) preserves only sort orders over partition key expressions, which remain constant within a merged partition. This PR goes further: by performing a k-way merge, the fulloutputOrdering— including secondary sort columns beyond the partition key — is preserved end-to-end.Does this PR introduce any user-facing change?
Yes. A new SQL configuration is added:
spark.sql.sources.v2.bucketing.preserveOrderingOnCoalesce.enabled(default:false): when enabled,GroupPartitionsExecuses a k-way merge to coalesce partitions while preserving the full child ordering, avoiding extra sort steps for operations likeSortMergeJoin.How was this patch tested?
SortedMergeCoalescedRDDSuite: unit tests for the new RDD covering correctness of the k-way merge, empty partitions, single partition, and ordering guarantees.GroupPartitionsExecSuite: unit tests covering all four branches ofoutputOrdering(no coalescing; k-way merge enabled; key-expression ordering only; reducers present).KeyGroupedPartitioningSuite: SQL-level tests verifying that no extraSortExecis injected whenSortedMergeCoalescedRDDis used, and a new test (SPARK-55715: Custom metrics of sorted-merge coalesced partitions) that verifies per-scan custom metrics are correctly reported across concurrent readers in the k-way merge case.BufferedRowsReaderhardening: the test-framework reader inInMemoryBaseTablenow tracks aclosedflag and throwsIllegalStateExceptionfor reads, double-closes, or metric fetches on a closed reader. This ensures future tests catch reader lifecycle bugs that were previously hidden by the noopclose().Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Sonnet 4.6