Spark 4.1: Implement SupportsReportOrdering DSv2 API by anuragmantri · Pull Request #14948 · apache/iceberg

anuragmantri · 2025-12-31T00:20:06Z

This PR implements the Spark DSv2 SupportsReportOrdering API to enable Spark's sort elimination optimization for partitioned tables when reading sorted Iceberg tables that have a defined sort order and files are written respecting that order.

Sort order reporting can be enabled by using this new flag:

SET spark.sql.iceberg.planning.preserve-data-ordering = true; (default false)

Implementation summary:

Ordering Validation: SortOrderAnalyzer validates two conditions before SparkPartitioningAwareScan.outputOrdering() reports ordering to Spark
- all files carry the current sort order ID and
- each partition key maps to exactly one task group.
If either condition fails, no ordering is reported.
Merging Sorted Files: Since sorted files within a partition may have overlapping ranges, MergingSortedRowDataReader merges rows from multiple sorted files using k-way merge with a min-heap. The read schema is augmented with any sort key columns absent from Spark's projection so that the comparator can always access sort key fields, even when they are pruned by Spark's column-pruning optimizer.
Row Comparison: Uses the SortOrderComparators.forSchema() Iceberg API with InternalRowWrapper to bridge Spark InternalRow to Iceberg StructLike. This correctly handles all transform types (identity, bucket, truncate), SC/DESC directions, and null ordering.

Constraints

When preserve-data-ordering is enabled, bin-packing of large partitions is disabled. All files within a partition are placed into a single Spark task. This is a known limitation of the current KeyGroupedPartitioning based approach and is expected to be addressed in a future improvement SPARK-56241.
Vectorized reads are disabled for partitions with more than one files since k-way merge is row-based.
This implementation only reports sort order if files are sorted in the current table sort order. We could extend this to report any historical sort order later.

Sort elimination examples

For MERGE INTO

Without reporting sort order

CommandResult <empty>
   +- WriteDelta
      +- *(4) Sort [_spec_id#287 ASC NULLS FIRST, _partition#288 ASC NULLS FIRST, _file#285 ASC NULLS FIRST, _pos#286L ASC NULLS FIRST, static_invoke(org.apache.iceberg.spark.functions.BucketFunction$BucketInt.invoke(4, c1#282)) ASC NULLS FIRST, c1#282 ASC NULLS FIRST], false, 0
         +- Exchange hashpartitioning(_spec_id#287, _partition#288, static_invoke(org.apache.iceberg.spark.functions.BucketFunction$BucketInt.invoke(4, c1#282)), 200), REBALANCE_PARTITIONS_BY_COL, 402653184, [plan_id=394]
            +- MergeRowsExec[__row_operation#281, c1#282, c2#283, c3#284, _file#285, _pos#286L, _spec_id#287, _partition#288]
               +- *(3) SortMergeJoin [c1#257], [c1#260], RightOuter
                  :- *(1) Sort [c1#257 ASC NULLS FIRST], false, 0
                  :  +- *(1) Filter isnotnull(c1#257)
                  :     +- *(1) Project [c1#257, _file#265, _pos#266L, _spec_id#263, _partition#264, true AS __row_from_target#273, monotonically_increasing_id() AS __row_id#274L]
                  :        +- *(1) ColumnarToRow
                  :           +- BatchScan testhadoop.default.table[c1#257, _file#265, _pos#266L, _spec_id#263, _partition#264] testhadoop.default.table (branch=null) [filters=, groupedBy=c1_bucket] RuntimeFilters: []
                  +- *(2) Sort [c1#260 ASC NULLS FIRST], false, 0
                     +- *(2) ColumnarToRow
                        +- BatchScan testhadoop.default.table_source[c1#260, c2#261, c3#262] testhadoop.default.table_source (branch=null) [filters=, groupedBy=c1_bucket] RuntimeFilters: []

With sort order reporting:

CommandResult <empty>
   +- WriteDelta
      +- *(4) Sort [_spec_id#80 ASC NULLS FIRST, _partition#81 ASC NULLS FIRST, _file#78 ASC NULLS FIRST, _pos#79L ASC NULLS FIRST, static_invoke(org.apache.iceberg.spark.functions.BucketFunction$BucketInt.invoke(4, c1#75)) ASC NULLS FIRST, c1#75 ASC NULLS FIRST], false, 0
         +- Exchange hashpartitioning(_spec_id#80, _partition#81, static_invoke(org.apache.iceberg.spark.functions.BucketFunction$BucketInt.invoke(4, c1#75)), 200), REBALANCE_PARTITIONS_BY_COL, 402653184, [plan_id=255]
            +- MergeRowsExec[__row_operation#74, c1#75, c2#76, c3#77, _file#78, _pos#79L, _spec_id#80, _partition#81]
               +- *(3) SortMergeJoin [c1#50], [c1#53], RightOuter
                  :- *(1) Filter isnotnull(c1#50)
                  :  +- *(1) Project [c1#50, _file#58, _pos#59L, _spec_id#56, _partition#57, true AS __row_from_target#66, monotonically_increasing_id() AS __row_id#67L]
                  :     +- *(1) ColumnarToRow
                  :        +- BatchScan testhadoop.default.table[c1#50, _file#58, _pos#59L, _spec_id#56, _partition#57] testhadoop.default.table (branch=null) [filters=, groupedBy=c1_bucket] RuntimeFilters: []
                  +- *(2) Project [c1#53, c2#54, c3#55]
                     +- BatchScan testhadoop.default.table_source[c1#53, c2#54, c3#55] testhadoop.default.table_source (branch=null) [filters=, groupedBy=c1_bucket] RuntimeFilters: []

For JOIN

Without reporting sort order

*(3) Project [c1#118, c2#119, c2#122]
+- *(3) SortMergeJoin [c1#118], [c1#121], Inner
   :- *(1) Sort [c1#118 ASC NULLS FIRST], false, 0
   :  +- *(1) ColumnarToRow
   :     +- BatchScan testhadoop.default.table[c1#118, c2#119] testhadoop.default.table (branch=null) [filters=c1 IS NOT NULL, groupedBy=c1_bucket] RuntimeFilters: []
   +- *(2) Sort [c1#121 ASC NULLS FIRST], false, 0
      +- *(2) ColumnarToRow
         +- BatchScan testhadoop.default.table_source[c1#121, c2#122] testhadoop.default.table_source (branch=null) [filters=c1 IS NOT NULL, groupedBy=c1_bucket] RuntimeFilters: []

With sort order reporting:

*(3) Project [c1#36, c2#37, c2#40]
+- *(3) SortMergeJoin [c1#36], [c1#39], Inner
   :- *(1) ColumnarToRow
   :  +- BatchScan testhadoop.default.table[c1#36, c2#37] testhadoop.default.table (branch=null) [filters=c1 IS NOT NULL, groupedBy=c1_bucket] RuntimeFilters: []
   +- *(2) ColumnarToRow
      +- BatchScan testhadoop.default.table_source[c1#39, c2#40] testhadoop.default.table_source (branch=null) [filters=c1 IS NOT NULL, groupedBy=c1_bucket] RuntimeFilters: []

AI Usage: I used Claude Sonnet 4.6 for code generation and writing tests. I manually reviewed the generated code.

anuragmantri · 2026-01-21T22:10:47Z

Moved the changes to Spark 4.1 since it is now the latest version. Marked this PR as WIP as there is a prerequisite PR #14683 that is also in review.

peter-toth · 2026-02-19T16:14:59Z

My concern from Spark PoV is that unnecessary partition grouping can cause performance degradations. SPARK-55092 is a ticket about the problem and apache/spark#53859 / apache/spark#54330 PRs try to fix the problem.

If this PR disables bin packing then the above PRs won't be able to fix the issue.

Bin-packing of file scan tasks is disabled when ordering is required since Spark will discard ordering if multiple input partitions exist with the same grouping key.

So I would suggest keeping bin packing and reporting sort order for those packed partitions (i.e. the partitions might not be unique by key, but they are locally sorted), and when partition grouping is actually needed then Spark should merge the sorted partitions with the same key using k-way merge.

peter-toth · 2026-02-19T18:04:49Z

As we discussed offline, a long term (after apache/spark#54330) solution could be to improve the new GroupPartitionsExec operator to not only coalesce partitions with the same key, but k-way merge them to keep their sorted order.

github-actions · 2026-03-22T00:25:58Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

anuragmantri · 2026-03-23T19:25:03Z

This PR is not stale. We are waiting waiting for the pre-requisite PR to be merged. I will update this PR after that one is merged.

anuragmantri · 2026-03-31T00:48:41Z

I rebased the PR after #15150. This is ready for review now.

@RussellSpitzer @aokolnychyi @szehon-ho - Could you take a look please?

anuragmantri · 2026-03-31T00:52:42Z

As we discussed offline, a long term (after apache/spark#54330) solution could be to improve the new GroupPartitionsExec operator to not only coalesce partitions with the same key, but k-way merge them to keep their sorted order.

Thanks @peter-toth , this makes sense. I think this PR is still needed and would still be valuable for tables with decently sized partitions. Since it's gated by a flag, I think it's safe to implement.

peter-toth · 2026-03-31T18:19:31Z

As we discussed offline, a long term (after apache/spark#54330) solution could be to improve the new GroupPartitionsExec operator to not only coalesce partitions with the same key, but k-way merge them to keep their sorted order.

Thanks @peter-toth , this makes sense. I think this PR is still needed and would still be valuable for tables with decently sized partitions. Since it's gated by a flag, I think it's safe to implement.

Absolutely.
FYI apache/spark#54330 has been merged. apache/spark#55116 will do the Spark side k-way merge to keep full ordering, but it requires this PR to report ordering and provide decently sized partitions.

github-actions bot added spark core labels Dec 31, 2025

anuragmantri changed the title ~~[WIP] Spark 4.0: Implement SupportsReportOrdering DSv2 API~~ Spark 4.0: Implement SupportsReportOrdering DSv2 API Dec 31, 2025

anuragmantri mentioned this pull request Dec 31, 2025

Spark (4.0, 3.5): Set data file sort_order_id in manifest for writes from Spark #14683

Closed

anuragmantri force-pushed the supports-report-ordering branch from cc08ff2 to b4fde94 Compare January 21, 2026 22:08

anuragmantri changed the title ~~Spark 4.0: Implement SupportsReportOrdering DSv2 API~~ Spark 4.1: Implement SupportsReportOrdering DSv2 API Jan 21, 2026

anuragmantri changed the title ~~Spark 4.1: Implement SupportsReportOrdering DSv2 API~~ [WIP] Spark 4.1: Implement SupportsReportOrdering DSv2 API Jan 21, 2026

github-actions bot added the stale label Mar 22, 2026

github-actions bot removed the stale label Mar 24, 2026

peter-toth mentioned this pull request Mar 26, 2026

[SPARK-56241][SQL] Derive outputOrdering from KeyedPartitioning key expressions apache/spark#55036

Closed

Spark 4.1: Implement SupportsReportOrdering DSv2 API

dbf3ea9

anuragmantri force-pushed the supports-report-ordering branch from b4fde94 to dbf3ea9 Compare March 31, 2026 00:47

anuragmantri changed the title ~~[WIP] Spark 4.1: Implement SupportsReportOrdering DSv2 API~~ Spark 4.1: Implement SupportsReportOrdering DSv2 API Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 4.1: Implement SupportsReportOrdering DSv2 API#14948

Spark 4.1: Implement SupportsReportOrdering DSv2 API#14948
anuragmantri wants to merge 1 commit intoapache:mainfrom
anuragmantri:supports-report-ordering

anuragmantri commented Dec 31, 2025 •

edited

Loading

Uh oh!

anuragmantri commented Jan 21, 2026

Uh oh!

peter-toth commented Feb 19, 2026 •

edited

Loading

Uh oh!

peter-toth commented Feb 19, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 22, 2026

Uh oh!

anuragmantri commented Mar 23, 2026

Uh oh!

anuragmantri commented Mar 31, 2026

Uh oh!

anuragmantri commented Mar 31, 2026

Uh oh!

peter-toth commented Mar 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anuragmantri commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anuragmantri commented Jan 21, 2026

Uh oh!

peter-toth commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peter-toth commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 22, 2026

Uh oh!

anuragmantri commented Mar 23, 2026

Uh oh!

anuragmantri commented Mar 31, 2026

Uh oh!

anuragmantri commented Mar 31, 2026

Uh oh!

peter-toth commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anuragmantri commented Dec 31, 2025 •

edited

Loading

peter-toth commented Feb 19, 2026 •

edited

Loading

peter-toth commented Feb 19, 2026 •

edited

Loading

peter-toth commented Mar 31, 2026 •

edited

Loading