feat(datafusion): parallel file scanning with eager task bucketing#43
Merged
phillipleblanc merged 1 commit intoJun 15, 2026
Merged
Conversation
Port of apache#2298 onto the spiceai-0.9.0 fork. IcebergTableProvider::scan() now plans files eagerly and distributes FileScanTasks into min(target_partitions, n_files) buckets, one bucket per DataFusion partition, so file reads are scheduled concurrently instead of streaming through a single UnknownPartitioning(1) partition. When the table is identity-partitioned (single spec, supported column types, partition columns projected) the scan declares Partitioning::Hash so downstream joins/aggregates can skip a RepartitionExec. - TableScan::to_arrow_from_tasks: replay pre-collected FileScanTasks through the Arrow reader; preserves the spice fork's ArrowReaderBuilder (file_io, runtime) signature and row-selection config. - IcebergTableScan gains new_with_tasks (eager) alongside new (lazy, used by IcebergStaticTableProvider); execute(i) streams buckets[i]. Constructors made pub; with_new_children now errors on children. - New table/bucketing.rs: identity-hash bucketing via REPARTITION_RANDOM_STATE + create_hashes, fallback to data_file_path. - Spice limit pushdown preserved: with_limit threaded into the planning builder and build_table_scan. - Drop the unused convert_filters_to_predicate re-export.
There was a problem hiding this comment.
Pull request overview
This PR updates the iceberg-datafusion integration to plan FileScanTasks eagerly during TableProvider::scan(), bucket them into multiple DataFusion partitions for parallel file reads, and (when safe) declare Partitioning::Hash for identity-partitioned tables to avoid redundant repartition steps downstream.
Changes:
- Add eager file planning + bucketing in
IcebergTableProvider::scan(), with optionalPartitioning::Hashfor eligible identity-partitioned tables. - Add
TableScan::to_arrow_from_tasksand wireIcebergTableScanto replay pre-planned task buckets per partition. - Update sqllogictest + integration tests to reflect new scan display and partition behavior.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/sqllogictest/testdata/slts/df_test/timestamp_predicate_pushdown.slt | Updates physical plan expectations to include buckets/file_count in scan display. |
| crates/sqllogictest/testdata/slts/df_test/like_predicate_pushdown.slt | Updates expected input_partitions and scan display for bucketed scans. |
| crates/sqllogictest/testdata/slts/df_test/boolean_predicate_pushdown.slt | Updates scan display expectations to include buckets/file_count. |
| crates/sqllogictest/testdata/slts/df_test/binary_predicate_pushdown.slt | Updates empty-table scan display expectations (file_count:[0]). |
| crates/sqllogictest/testdata/slts/df_test/basic_queries.slt | Updates scan display to include buckets/file_count alongside limit pushdown. |
| crates/integrations/datafusion/tests/integration_datafusion_test.rs | Adjusts execution partition index to match new partitioning behavior. |
| crates/integrations/datafusion/src/table/mod.rs | Implements eager planning + bucketing in IcebergTableProvider::scan() and adds bucketing-focused tests. |
| crates/integrations/datafusion/src/table/bucketing.rs | New module implementing identity-hash bucketing (DataFusion-compatible) with fallback hashing. |
| crates/integrations/datafusion/src/physical_plan/scan.rs | Adds eager multi-partition scan mode and task replay via to_arrow_from_tasks; improves display output. |
| crates/integrations/datafusion/src/physical_plan/mod.rs | Removes unused re-export. |
| crates/iceberg/src/scan/mod.rs | Adds TableScan::to_arrow_from_tasks and refactors to_arrow() to call it. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This was referenced Jun 15, 2026
phillipleblanc
added a commit
to spiceai/spiceai
that referenced
this pull request
Jun 15, 2026
Fork PR spiceai/iceberg-rust#43 merged; re-pin from the branch head (9e6e1a00) to the merge commit b8fc79d9 on spiceai-0.9.1-df-53.
pull Bot
pushed a commit
to TheRakeshPurohit/spiceai
that referenced
this pull request
Jun 15, 2026
… fork branch naming (spiceai#11328) * chore(deps): bump iceberg-rust to parallel file scanning fork Bumps the spiceai/iceberg-rust pin from e519b221 to b652de2e, picking up the eager task-bucketing parallel-scan change (spiceai/iceberg-rust#42, port of apache/iceberg-rust#2298). IcebergTableProvider::scan() now plans files eagerly and distributes FileScanTasks across min(target_partitions, n_files) DataFusion partitions, so Iceberg file reads are scheduled concurrently instead of through a single partition. Identity-partitioned tables additionally declare Partitioning::Hash so downstream joins/aggregates can skip a RepartitionExec. No Spice code changes are required: Spice consumes the provider only via IcebergTableProvider::try_new, and the change is internal to the scan path. The fork preserves Spice's existing iceberg limit-pushdown. * chore(deps): adopt spiceai-<iceberg>-df-<df> fork naming; re-pin to spiceai-0.9.1-df-53 Re-pins iceberg-rust to the renamed `spiceai-0.9.1-df-53` branch (rev 9e6e1a00) and bumps the version requirement to `=0.9.1` to match the corrected crate version. Adds docs/dev/fork-branch-naming.md documenting the `spiceai-<iceberg>-df-<datafusion>` convention. The DF53 fork line was previously named `spiceai-0.9.0` but actually tracks a post-0.9.1 `main` snapshot; the rename + version correction make the branch name reflect its real contents. * chore(deps): re-pin iceberg-rust to merged spiceai-0.9.1-df-53 commit Fork PR spiceai/iceberg-rust#43 merged; re-pin from the branch head (9e6e1a00) to the merge commit b8fc79d9 on spiceai-0.9.1-df-53. * chore(deps): annotate iceberg pins with fork branch; drop naming doc Address Copilot review on spiceai#11328: add `# branch: spiceai-0.9.1-df-53` to the iceberg-rust pins (matching the convention used by delta_kernel/duckdb/arrow). Remove docs/dev/fork-branch-naming.md per review feedback (no separate doc).
github-actions Bot
pushed a commit
to spiceai/spiceai
that referenced
this pull request
Jun 16, 2026
… fork branch naming (#11328) * chore(deps): bump iceberg-rust to parallel file scanning fork Bumps the spiceai/iceberg-rust pin from e519b221 to b652de2e, picking up the eager task-bucketing parallel-scan change (spiceai/iceberg-rust#42, port of apache/iceberg-rust#2298). IcebergTableProvider::scan() now plans files eagerly and distributes FileScanTasks across min(target_partitions, n_files) DataFusion partitions, so Iceberg file reads are scheduled concurrently instead of through a single partition. Identity-partitioned tables additionally declare Partitioning::Hash so downstream joins/aggregates can skip a RepartitionExec. No Spice code changes are required: Spice consumes the provider only via IcebergTableProvider::try_new, and the change is internal to the scan path. The fork preserves Spice's existing iceberg limit-pushdown. * chore(deps): adopt spiceai-<iceberg>-df-<df> fork naming; re-pin to spiceai-0.9.1-df-53 Re-pins iceberg-rust to the renamed `spiceai-0.9.1-df-53` branch (rev 9e6e1a00) and bumps the version requirement to `=0.9.1` to match the corrected crate version. Adds docs/dev/fork-branch-naming.md documenting the `spiceai-<iceberg>-df-<datafusion>` convention. The DF53 fork line was previously named `spiceai-0.9.0` but actually tracks a post-0.9.1 `main` snapshot; the rename + version correction make the branch name reflect its real contents. * chore(deps): re-pin iceberg-rust to merged spiceai-0.9.1-df-53 commit Fork PR spiceai/iceberg-rust#43 merged; re-pin from the branch head (9e6e1a00) to the merge commit b8fc79d9 on spiceai-0.9.1-df-53. * chore(deps): annotate iceberg pins with fork branch; drop naming doc Address Copilot review on #11328: add `# branch: spiceai-0.9.1-df-53` to the iceberg-rust pins (matching the convention used by delta_kernel/duckdb/arrow). Remove docs/dev/fork-branch-naming.md per review feedback (no separate doc).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Ports apache/iceberg-rust#2298 onto the DF53 fork line.
IcebergTableProvider::scan()plans files eagerly and distributesFileScanTasks intomin(target_partitions, n_files)buckets — one bucket per DataFusion partition — so file reads are scheduled concurrently instead of through a singleUnknownPartitioning(1)partition. Identity-partitioned tables (single spec, supported column types, partition cols projected) declarePartitioning::Hashso downstream joins/aggregates can skip aRepartitionExec; everything else falls back toUnknownPartitioning(N)while still bucketing.Key changes:
TableScan::to_arrow_from_tasks(replays pre-collected tasks; keeps this line'sArrowReaderBuilder::new(file_io, runtime)+ row-selection config);IcebergTableScan::new_with_tasks(eager) alongsidenew(lazy); newtable/bucketing.rs(identity-hash viaREPARTITION_RANDOM_STATE+create_hashes, fallback todata_file_path); dropped the unusedconvert_filters_to_predicatere-export. The fork's limit-pushdown (with_limit) is preserved.Validation:
iceberg-datafusion86 lib tests (incl. 6 new bucketing tests + existing limit tests), all 9df_*sqllogictest schedules,integration_datafusion_test::test_provider_plan_stream_schema; clippy + fmt clean. Builds inside the Spice runtime against the Spice DF53 fork (cargo check -p data_components).