Skip to content

[WIP][SPARK-55715][SQL] Keep outputOrdering when GroupPartitionsExec coalesces partitions#55116

Draft
peter-toth wants to merge 2 commits intoapache:masterfrom
peter-toth:SPARK-55715-keep-outputordering-when-grouping-partitions
Draft

[WIP][SPARK-55715][SQL] Keep outputOrdering when GroupPartitionsExec coalesces partitions#55116
peter-toth wants to merge 2 commits intoapache:masterfrom
peter-toth:SPARK-55715-keep-outputordering-when-grouping-partitions

Conversation

@peter-toth
Copy link
Copy Markdown
Contributor

@peter-toth peter-toth commented Mar 31, 2026

What changes were proposed in this pull request?

Background

GroupPartitionsExec coalesces multiple input partitions that share the same partition key into a single output partition. Before this PR, outputOrdering was always discarded after coalescing: even when the child reported ordering (e.g. via SupportsReportOrdering) or when ordering was derived from KeyedPartitioning key expressions (via spark.sql.sources.v2.bucketing.partitionKeyOrdering.enabled), coalescing by simple concatenation destroyed the within-partition ordering. This forced EnsureRequirements to inject an extra SortExec before SortMergeJoinExec, defeating the purpose of using a storage-partitioned join.

k-way merge: SortedMergeCoalescedRDD

This PR introduces SortedMergeCoalescedRDD, a new RDD that coalesces partitions by performing a k-way merge instead of simple concatenation. When multiple input partitions share the same key, a priority-queue-based merge interleaves their rows in sorted order, producing a single output partition whose row order matches the child's outputOrdering.

GroupPartitionsExec.doExecute() uses SortedMergeCoalescedRDD when all of the following hold:

  1. spark.sql.sources.v2.bucketing.preserveOrderingOnCoalesce.enabled is true.
  2. The child reports a non-empty outputOrdering.
  3. The child subtree is safe for concurrent partition reads (childIsSafeForKWayMerge).
  4. At least one output partition actually coalesces multiple input partitions.

When the config is enabled, the k-way merge is always applied regardless of whether the parent operator actually requires the ordering. Making this dynamic (only merge-sort when required) will be addressed in a follow-up ticket.

Why k-way merge safety matters: SafeForKWayMerge

Unlike CoalescedRDD, which processes input partitions sequentially, SortedMergeCoalescedRDD opens all N input partition iterators upfront and interleaves reads across them — all on a single JVM thread within a single Spark task. A SparkPlan object is shared across all partition computations, so any plan node that stores per-partition mutable state in an instance field rather than inside the partition's iterator closure is aliased across all N concurrent computations. The last writer wins, and any computation that reads or frees state based on its own earlier write will operate on incorrect state (a use-after-free).

To avoid this class of bugs, GroupPartitionsExec uses a whitelist approach via a new marker trait SafeForKWayMerge. Nodes implementing this trait guarantee that all per-partition mutable state is captured inside the partition's iterator closure (e.g. via the PartitionEvaluatorFactory pattern), never in shared plan-node instance fields. Unknown node types fall through to unsafe, causing a silent fallback to simple sequential coalescing. The following nodes implement SafeForKWayMerge:

  • DataSourceV2ScanExecBase (leaf nodes reading from V2 sources)
  • ProjectExec, FilterExec (stateless row-by-row operators)
  • WholeStageCodegenExec, InputAdapter (code-gen wrappers that delegate to the above)

GroupPartitionsExec.outputOrdering

GroupPartitionsExec.outputOrdering is updated to reflect what ordering is preserved:

  1. No coalescing (all groups ≤ 1 partition): child.outputOrdering is passed through unchanged.
  2. Coalescing with k-way merge (config enabled + childIsSafeForKWayMerge): child.outputOrdering is returned in full — the k-way merge produces a globally sorted partition.
  3. Coalescing without k-way merge, no reducers: only sort orders whose expression is a partition key expression are returned. These key expressions evaluate to the same constant value within every merged partition (all merged splits share the same key), so their sort orders remain valid after concatenation. This is the ordering preserved by the existing spark.sql.sources.v2.bucketing.preserveKeyOrderingOnCoalesce.enabled config.
  4. Coalescing without k-way merge, with reducers: super.outputOrdering (empty) — the reduced key can take different values within the output partition, so no ordering is guaranteed.

DataSourceRDD: concurrent-reader metrics support

SortedMergeCoalescedRDD opens multiple PartitionReaders concurrently within a single Spark task. The existing DataSourceRDD assumed at most one active reader per task at a time, causing only the last reader's custom metrics to be reported (the previous readers' metrics were overwritten and lost).

DataSourceRDD is refactored to support concurrent readers:

  • A new TaskState class (one per task) holds an ArrayBuffer[PartitionIterator[_]] (partitionIterators) tracking all readers opened for the task, Spark input metrics (inputMetrics), and a closedMetrics map accumulating final metric values from already-closed readers.
  • mergeAndUpdateCustomMetrics() runs in two phases: (1) drain closed iterators into closedMetrics; (2) merge live readers' current values with closedMetrics via the new CustomTaskMetric.mergeWith().
  • This works correctly in all three execution modes: single partition per task, sequential coalescing (one reader at a time), and concurrent k-way merge (N readers simultaneously).

CustomTaskMetric.mergeWith

A new default method mergeWith(CustomTaskMetric other) is added to CustomTaskMetric. The default implementation sums the two values, which is correct for count-type metrics. Data sources with non-additive metrics (e.g. max, average) should override this method. This replaces the previously proposed PartitionReader.initMetricsValues mechanism (which threaded prior metric values into the next reader's constructor) with a cleaner, pull-based merge at reporting time. PartitionReader.initMetricsValues becomes deprecated as it is no longer needed.

Why are the changes needed?

Without this fix, GroupPartitionsExec always discards ordering when coalescing, forcing EnsureRequirements to inject an extra SortExec before SortMergeJoinExec even when the data is already sorted by the join key within each partition. With SortedMergeCoalescedRDD, the full child ordering is preserved end-to-end, eliminating these redundant sorts and making storage-partitioned joins with ordering fully efficient.

spark.sql.sources.v2.bucketing.preserveKeyOrderingOnCoalesce.enabled (introduced earlier) preserves only sort orders over partition key expressions, which remain constant within a merged partition. This PR goes further: by performing a k-way merge, the full outputOrdering — including secondary sort columns beyond the partition key — is preserved end-to-end.

Does this PR introduce any user-facing change?

Yes. A new SQL configuration is added:

  • spark.sql.sources.v2.bucketing.preserveOrderingOnCoalesce.enabled (default: false): when enabled, GroupPartitionsExec uses a k-way merge to coalesce partitions while preserving the full child ordering, avoiding extra sort steps for operations like SortMergeJoin.

How was this patch tested?

  • SortedMergeCoalescedRDDSuite: unit tests for the new RDD covering correctness of the k-way merge, empty partitions, single partition, and ordering guarantees.
  • GroupPartitionsExecSuite: unit tests covering all four branches of outputOrdering (no coalescing; k-way merge enabled; key-expression ordering only; reducers present).
  • KeyGroupedPartitioningSuite: SQL-level tests verifying that no extra SortExec is injected when SortedMergeCoalescedRDD is used, and a new test (SPARK-55715: Custom metrics of sorted-merge coalesced partitions) that verifies per-scan custom metrics are correctly reported across concurrent readers in the k-way merge case.
  • BufferedRowsReader hardening: the test-framework reader in InMemoryBaseTable now tracks a closed flag and throws IllegalStateException for reads, double-closes, or metric fetches on a closed reader. This ensures future tests catch reader lifecycle bugs that were previously hidden by the noop close().

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6

@peter-toth peter-toth force-pushed the SPARK-55715-keep-outputordering-when-grouping-partitions branch from 87b70e7 to c6e685a Compare March 31, 2026 19:19
…canPartitioningAndOrdering

### What changes were proposed in this pull request?

`V2ScanPartitioningAndOrdering.ordering` was calling `V2ExpressionUtils.toCatalystOrdering` without the `funCatalog` argument. This meant that function-based sort expressions reported by a data source via `SupportsReportOrdering` (e.g. transform functions like `bucket(n, col)`) could not be resolved against the function catalog and would be silently dropped.

The fix passes `relation.funCatalog` as the third argument, consistent with how `toCatalystOpt` is already called in the `partitioning` rule of the same object.

### Why are the changes needed?

Without the function catalog, sort orders involving catalog functions reported by `SupportsReportOrdering` are not resolved, causing them to be ignored by the planner even when the data source correctly reports them.

### Does this PR introduce _any_ user-facing change?

Yes. Data sources implementing `SupportsReportOrdering` with function-based sort expressions that require the function catalog will now have those sort orders correctly recognized by Spark, potentially eliminating unnecessary sort operations.

### How was this patch tested?

`WriteDistributionAndOrderingSuite` already covers this due to `InMemoryBaseTable` is updated to use `InMemoryBatchScanWithOrdering` (a new inner classimplementing `SupportsReportOrdering`) when a table ordering is configured.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6
@peter-toth peter-toth force-pushed the SPARK-55715-keep-outputordering-when-grouping-partitions branch from c6e685a to 521449b Compare April 1, 2026 12:04
…sces partitions

### What changes were proposed in this pull request?

#### Background

`GroupPartitionsExec` coalesces multiple input partitions that share the same partition key into a single output partition. Before this PR, `outputOrdering` was always discarded after coalescing: even when the child reported ordering (e.g. via `SupportsReportOrdering`) or when ordering was derived from `KeyedPartitioning` key expressions (via `spark.sql.sources.v2.bucketing.partitionKeyOrdering.enabled`), coalescing by simple concatenation destroyed the within-partition ordering. This forced `EnsureRequirements` to inject an extra `SortExec` before `SortMergeJoinExec`, defeating the purpose of using a storage-partitioned join.

#### k-way merge: SortedMergeCoalescedRDD

This PR introduces `SortedMergeCoalescedRDD`, a new RDD that coalesces partitions by performing a k-way merge instead of simple concatenation. When multiple input partitions share the same key, a priority-queue-based merge interleaves their rows in sorted order, producing a single output partition whose row order matches the child's `outputOrdering`.

`GroupPartitionsExec.doExecute()` uses `SortedMergeCoalescedRDD` when all of the following hold:
1. `spark.sql.sources.v2.bucketing.preserveOrderingOnCoalesce.enabled` is `true`.
2. The child reports a non-empty `outputOrdering`.
3. The child subtree is safe for concurrent partition reads (`childIsSafeForKWayMerge`).
4. At least one output partition actually coalesces multiple input partitions.

When the config is enabled, the k-way merge is always applied regardless of whether the parent operator actually requires the ordering. Making this dynamic (only merge-sort when required) will be addressed in a follow-up ticket.

#### Why k-way merge safety matters: SafeForKWayMerge

Unlike `CoalescedRDD`, which processes input partitions sequentially, `SortedMergeCoalescedRDD` opens all N input partition iterators upfront and interleaves reads across them — all on a single JVM thread within a single Spark task. A `SparkPlan` object is shared across all partition computations, so any plan node that stores per-partition mutable state in an instance field rather than inside the partition's iterator closure is aliased across all N concurrent computations. The last writer wins, and any computation that reads or frees state based on its own earlier write will operate on incorrect state (a use-after-free).

To avoid this class of bugs, `GroupPartitionsExec` uses a whitelist approach via a new marker trait `SafeForKWayMerge`. Nodes implementing this trait guarantee that all per-partition mutable state is captured inside the partition's iterator closure (e.g. via the `PartitionEvaluatorFactory` pattern), never in shared plan-node instance fields. Unknown node types fall through to unsafe, causing a silent fallback to simple sequential coalescing. The following nodes implement `SafeForKWayMerge`:
- `DataSourceV2ScanExecBase` (leaf nodes reading from V2 sources)
- `ProjectExec`, `FilterExec` (stateless row-by-row operators)
- `WholeStageCodegenExec`, `InputAdapter` (code-gen wrappers that delegate to the above)

#### GroupPartitionsExec.outputOrdering

`GroupPartitionsExec.outputOrdering` is updated to reflect what ordering is preserved:
1. **No coalescing** (all groups ≤ 1 partition): `child.outputOrdering` is passed through unchanged.
2. **Coalescing with k-way merge** (config enabled + `childIsSafeForKWayMerge`): `child.outputOrdering` is returned in full — the k-way merge produces a globally sorted partition.
3. **Coalescing without k-way merge, no reducers**: only sort orders whose expression is a partition key expression are returned. These key expressions evaluate to the same constant value within every merged partition (all merged splits share the same key), so their sort orders remain valid after concatenation. This is the ordering preserved by the existing `spark.sql.sources.v2.bucketing.preserveKeyOrderingOnCoalesce.enabled` config.
4. **Coalescing without k-way merge, with reducers**: `super.outputOrdering` (empty) — the reduced key can take different values within the output partition, so no ordering is guaranteed.

#### DataSourceRDD: concurrent-reader metrics support

`SortedMergeCoalescedRDD` opens multiple `PartitionReader`s concurrently within a single Spark task. The existing `DataSourceRDD` assumed at most one active reader per task at a time, causing only the last reader's custom metrics to be reported (the previous readers' metrics were overwritten and lost).

`DataSourceRDD` is refactored to support concurrent readers:
- A new `TaskState` class (one per task) holds an `ArrayBuffer[PartitionIterator[_]]` (`partitionIterators`) tracking all readers opened for the task, Spark input metrics (`InputMetrics`), and a `closedMetrics` map accumulating final metric values from already-closed readers.
- `mergeAndUpdateCustomMetrics()` runs in two phases: (1) drain closed iterators into `closedMetrics`; (2) merge live readers' current values with `closedMetrics` via the new `CustomTaskMetric.mergeWith()` and push the result to the Spark UI accumulators.
- This works correctly in all three execution modes: single partition per task, sequential coalescing (one reader at a time), and concurrent k-way merge (N readers simultaneously).

#### CustomTaskMetric.mergeWith

A new default method `mergeWith(CustomTaskMetric other)` is added to `CustomTaskMetric`. The default implementation sums the two values, which is correct for count-type metrics. Data sources with non-additive metrics (e.g. max, average) should override this method. This replaces the previously proposed `PartitionReader.initMetricsValues` mechanism (which threaded prior metric values into the next reader's constructor) with a cleaner, pull-based merge at reporting time. `PartitionReader.initMetricsValues` becomes deprecated as it is no longer needed.

### Why are the changes needed?

Without this fix, `GroupPartitionsExec` always discards ordering when coalescing, forcing `EnsureRequirements` to inject an extra `SortExec` before `SortMergeJoinExec` even when the data is already sorted by the join key within each partition. With `SortedMergeCoalescedRDD`, the full child ordering is preserved end-to-end, eliminating these redundant sorts and making storage-partitioned joins with ordering fully efficient.

`spark.sql.sources.v2.bucketing.preserveKeyOrderingOnCoalesce.enabled` (introduced earlier) preserves only sort orders over partition key expressions, which remain constant within a merged partition. This PR goes further: by performing a k-way merge, the full `outputOrdering` — including secondary sort columns beyond the partition key — is preserved end-to-end.

### Does this PR introduce _any_ user-facing change?

Yes. A new SQL configuration is added:
- `spark.sql.sources.v2.bucketing.preserveOrderingOnCoalesce.enabled` (default: `false`): when enabled, `GroupPartitionsExec` uses a k-way merge to coalesce partitions while preserving the full child ordering, avoiding extra sort steps for operations like `SortMergeJoin`.

### How was this patch tested?

- **`SortedMergeCoalescedRDDSuite`**: unit tests for the new RDD covering correctness of the k-way merge, empty partitions, single partition, and ordering guarantees.
- **`GroupPartitionsExecSuite`**: unit tests covering all four branches of `outputOrdering` (no coalescing; k-way merge enabled; key-expression ordering only; reducers present).
- **`KeyGroupedPartitioningSuite`**: SQL-level tests verifying that no extra `SortExec` is injected when `SortedMergeCoalescedRDD` is used, and a new test (`SPARK-55715: Custom metrics of sorted-merge coalesced partitions`) that verifies per-scan custom metrics are correctly reported across concurrent readers in the k-way merge case.
- **`BufferedRowsReader` hardening**: the test-framework reader in `InMemoryBaseTable` now tracks a `closed` flag and throws `IllegalStateException` for reads, double-closes, or metric fetches on a closed reader. This ensures future tests catch reader lifecycle bugs that were previously hidden by the noop `close()`.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6
@peter-toth peter-toth force-pushed the SPARK-55715-keep-outputordering-when-grouping-partitions branch from 521449b to 1c9e6ff Compare April 1, 2026 12:49
@peter-toth
Copy link
Copy Markdown
Contributor Author

peter-toth commented Apr 1, 2026

This PR is draft as it requires the changes from #55137. Once that PR is merged I will rebase on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant