Skip to content

Spark 4.1: Implement SupportsReportOrdering DSv2 API#14948

Open
anuragmantri wants to merge 1 commit intoapache:mainfrom
anuragmantri:supports-report-ordering
Open

Spark 4.1: Implement SupportsReportOrdering DSv2 API#14948
anuragmantri wants to merge 1 commit intoapache:mainfrom
anuragmantri:supports-report-ordering

Conversation

@anuragmantri
Copy link
Copy Markdown
Contributor

@anuragmantri anuragmantri commented Dec 31, 2025

This PR implements the Spark DSv2 SupportsReportOrdering API to enable Spark's sort elimination optimization for partitioned tables when reading sorted Iceberg tables that have a defined sort order and files are written respecting that order.

Sort order reporting can be enabled by using this new flag:

SET spark.sql.iceberg.planning.preserve-data-ordering = true; (default false)

Implementation summary:

  1. Ordering Validation: SortOrderAnalyzer validates two conditions before SparkPartitioningAwareScan.outputOrdering() reports ordering to Spark

    • all files carry the current sort order ID and
    • each partition key maps to exactly one task group.

    If either condition fails, no ordering is reported.

  2. Merging Sorted Files: Since sorted files within a partition may have overlapping ranges, MergingSortedRowDataReader merges rows from multiple sorted files using k-way merge with a min-heap. The read schema is augmented with any sort key columns absent from Spark's projection so that the comparator can always access sort key fields, even when they are pruned by Spark's column-pruning optimizer.

  3. Row Comparison: Uses the SortOrderComparators.forSchema() Iceberg API with InternalRowWrapper to bridge Spark InternalRow to Iceberg StructLike. This correctly handles all transform types (identity, bucket, truncate), SC/DESC directions, and null ordering.

Constraints

  1. When preserve-data-ordering is enabled, bin-packing of large partitions is disabled. All files within a partition are placed into a single Spark task. This is a known limitation of the current KeyGroupedPartitioning based approach and is expected to be addressed in a future improvement SPARK-56241.
  2. Vectorized reads are disabled for partitions with more than one files since k-way merge is row-based.
  3. This implementation only reports sort order if files are sorted in the current table sort order. We could extend this to report any historical sort order later.

Sort elimination examples

  1. For MERGE INTO

Without reporting sort order

CommandResult <empty>
   +- WriteDelta
      +- *(4) Sort [_spec_id#287 ASC NULLS FIRST, _partition#288 ASC NULLS FIRST, _file#285 ASC NULLS FIRST, _pos#286L ASC NULLS FIRST, static_invoke(org.apache.iceberg.spark.functions.BucketFunction$BucketInt.invoke(4, c1#282)) ASC NULLS FIRST, c1#282 ASC NULLS FIRST], false, 0
         +- Exchange hashpartitioning(_spec_id#287, _partition#288, static_invoke(org.apache.iceberg.spark.functions.BucketFunction$BucketInt.invoke(4, c1#282)), 200), REBALANCE_PARTITIONS_BY_COL, 402653184, [plan_id=394]
            +- MergeRowsExec[__row_operation#281, c1#282, c2#283, c3#284, _file#285, _pos#286L, _spec_id#287, _partition#288]
               +- *(3) SortMergeJoin [c1#257], [c1#260], RightOuter
                  :- *(1) Sort [c1#257 ASC NULLS FIRST], false, 0
                  :  +- *(1) Filter isnotnull(c1#257)
                  :     +- *(1) Project [c1#257, _file#265, _pos#266L, _spec_id#263, _partition#264, true AS __row_from_target#273, monotonically_increasing_id() AS __row_id#274L]
                  :        +- *(1) ColumnarToRow
                  :           +- BatchScan testhadoop.default.table[c1#257, _file#265, _pos#266L, _spec_id#263, _partition#264] testhadoop.default.table (branch=null) [filters=, groupedBy=c1_bucket] RuntimeFilters: []
                  +- *(2) Sort [c1#260 ASC NULLS FIRST], false, 0
                     +- *(2) ColumnarToRow
                        +- BatchScan testhadoop.default.table_source[c1#260, c2#261, c3#262] testhadoop.default.table_source (branch=null) [filters=, groupedBy=c1_bucket] RuntimeFilters: []

With sort order reporting:

CommandResult <empty>
   +- WriteDelta
      +- *(4) Sort [_spec_id#80 ASC NULLS FIRST, _partition#81 ASC NULLS FIRST, _file#78 ASC NULLS FIRST, _pos#79L ASC NULLS FIRST, static_invoke(org.apache.iceberg.spark.functions.BucketFunction$BucketInt.invoke(4, c1#75)) ASC NULLS FIRST, c1#75 ASC NULLS FIRST], false, 0
         +- Exchange hashpartitioning(_spec_id#80, _partition#81, static_invoke(org.apache.iceberg.spark.functions.BucketFunction$BucketInt.invoke(4, c1#75)), 200), REBALANCE_PARTITIONS_BY_COL, 402653184, [plan_id=255]
            +- MergeRowsExec[__row_operation#74, c1#75, c2#76, c3#77, _file#78, _pos#79L, _spec_id#80, _partition#81]
               +- *(3) SortMergeJoin [c1#50], [c1#53], RightOuter
                  :- *(1) Filter isnotnull(c1#50)
                  :  +- *(1) Project [c1#50, _file#58, _pos#59L, _spec_id#56, _partition#57, true AS __row_from_target#66, monotonically_increasing_id() AS __row_id#67L]
                  :     +- *(1) ColumnarToRow
                  :        +- BatchScan testhadoop.default.table[c1#50, _file#58, _pos#59L, _spec_id#56, _partition#57] testhadoop.default.table (branch=null) [filters=, groupedBy=c1_bucket] RuntimeFilters: []
                  +- *(2) Project [c1#53, c2#54, c3#55]
                     +- BatchScan testhadoop.default.table_source[c1#53, c2#54, c3#55] testhadoop.default.table_source (branch=null) [filters=, groupedBy=c1_bucket] RuntimeFilters: []
  1. For JOIN

Without reporting sort order

*(3) Project [c1#118, c2#119, c2#122]
+- *(3) SortMergeJoin [c1#118], [c1#121], Inner
   :- *(1) Sort [c1#118 ASC NULLS FIRST], false, 0
   :  +- *(1) ColumnarToRow
   :     +- BatchScan testhadoop.default.table[c1#118, c2#119] testhadoop.default.table (branch=null) [filters=c1 IS NOT NULL, groupedBy=c1_bucket] RuntimeFilters: []
   +- *(2) Sort [c1#121 ASC NULLS FIRST], false, 0
      +- *(2) ColumnarToRow
         +- BatchScan testhadoop.default.table_source[c1#121, c2#122] testhadoop.default.table_source (branch=null) [filters=c1 IS NOT NULL, groupedBy=c1_bucket] RuntimeFilters: []

With sort order reporting:

*(3) Project [c1#36, c2#37, c2#40]
+- *(3) SortMergeJoin [c1#36], [c1#39], Inner
   :- *(1) ColumnarToRow
   :  +- BatchScan testhadoop.default.table[c1#36, c2#37] testhadoop.default.table (branch=null) [filters=c1 IS NOT NULL, groupedBy=c1_bucket] RuntimeFilters: []
   +- *(2) ColumnarToRow
      +- BatchScan testhadoop.default.table_source[c1#39, c2#40] testhadoop.default.table_source (branch=null) [filters=c1 IS NOT NULL, groupedBy=c1_bucket] RuntimeFilters: []

AI Usage: I used Claude Sonnet 4.6 for code generation and writing tests. I manually reviewed the generated code.

@anuragmantri anuragmantri changed the title [WIP] Spark 4.0: Implement SupportsReportOrdering DSv2 API Spark 4.0: Implement SupportsReportOrdering DSv2 API Dec 31, 2025
@anuragmantri anuragmantri force-pushed the supports-report-ordering branch from cc08ff2 to b4fde94 Compare January 21, 2026 22:08
@anuragmantri anuragmantri changed the title Spark 4.0: Implement SupportsReportOrdering DSv2 API Spark 4.1: Implement SupportsReportOrdering DSv2 API Jan 21, 2026
@anuragmantri anuragmantri changed the title Spark 4.1: Implement SupportsReportOrdering DSv2 API [WIP] Spark 4.1: Implement SupportsReportOrdering DSv2 API Jan 21, 2026
@anuragmantri
Copy link
Copy Markdown
Contributor Author

Moved the changes to Spark 4.1 since it is now the latest version. Marked this PR as WIP as there is a prerequisite PR #14683 that is also in review.

@peter-toth
Copy link
Copy Markdown

peter-toth commented Feb 19, 2026

My concern from Spark PoV is that unnecessary partition grouping can cause performance degradations. SPARK-55092 is a ticket about the problem and apache/spark#53859 / apache/spark#54330 PRs try to fix the problem.

If this PR disables bin packing then the above PRs won't be able to fix the issue.

  1. Bin-packing of file scan tasks is disabled when ordering is required since Spark will discard ordering if multiple input partitions exist with the same grouping key.

So I would suggest keeping bin packing and reporting sort order for those packed partitions (i.e. the partitions might not be unique by key, but they are locally sorted), and when partition grouping is actually needed then Spark should merge the sorted partitions with the same key using k-way merge.

@peter-toth
Copy link
Copy Markdown

peter-toth commented Feb 19, 2026

As we discussed offline, a long term (after apache/spark#54330) solution could be to improve the new GroupPartitionsExec operator to not only coalesce partitions with the same key, but k-way merge them to keep their sorted order.

@github-actions
Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Mar 22, 2026
@anuragmantri
Copy link
Copy Markdown
Contributor Author

This PR is not stale. We are waiting waiting for the pre-requisite PR to be merged. I will update this PR after that one is merged.

@anuragmantri anuragmantri force-pushed the supports-report-ordering branch from b4fde94 to dbf3ea9 Compare March 31, 2026 00:47
@anuragmantri anuragmantri changed the title [WIP] Spark 4.1: Implement SupportsReportOrdering DSv2 API Spark 4.1: Implement SupportsReportOrdering DSv2 API Mar 31, 2026
@anuragmantri
Copy link
Copy Markdown
Contributor Author

I rebased the PR after #15150. This is ready for review now.

@RussellSpitzer @aokolnychyi @szehon-ho - Could you take a look please?

@anuragmantri
Copy link
Copy Markdown
Contributor Author

As we discussed offline, a long term (after apache/spark#54330) solution could be to improve the new GroupPartitionsExec operator to not only coalesce partitions with the same key, but k-way merge them to keep their sorted order.

Thanks @peter-toth , this makes sense. I think this PR is still needed and would still be valuable for tables with decently sized partitions. Since it's gated by a flag, I think it's safe to implement.

@peter-toth
Copy link
Copy Markdown

peter-toth commented Mar 31, 2026

As we discussed offline, a long term (after apache/spark#54330) solution could be to improve the new GroupPartitionsExec operator to not only coalesce partitions with the same key, but k-way merge them to keep their sorted order.

Thanks @peter-toth , this makes sense. I think this PR is still needed and would still be valuable for tables with decently sized partitions. Since it's gated by a flag, I think it's safe to implement.

Absolutely.
FYI apache/spark#54330 has been merged. apache/spark#55116 will do the Spark side k-way merge to keep full ordering, but it requires this PR to report ordering and provide decently sized partitions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants