feat(datafusion): parallel file scanning with eager task bucketing by phillipleblanc · Pull Request #43 · spiceai/iceberg-rust

phillipleblanc · 2026-06-15T02:34:33Z

Ports apache/iceberg-rust#2298 onto the DF53 fork line.

IcebergTableProvider::scan() plans files eagerly and distributes FileScanTasks into min(target_partitions, n_files) buckets — one bucket per DataFusion partition — so file reads are scheduled concurrently instead of through a single UnknownPartitioning(1) partition. Identity-partitioned tables (single spec, supported column types, partition cols projected) declare Partitioning::Hash so downstream joins/aggregates can skip a RepartitionExec; everything else falls back to UnknownPartitioning(N) while still bucketing.

Key changes: TableScan::to_arrow_from_tasks (replays pre-collected tasks; keeps this line's ArrowReaderBuilder::new(file_io, runtime) + row-selection config); IcebergTableScan::new_with_tasks (eager) alongside new (lazy); new table/bucketing.rs (identity-hash via REPARTITION_RANDOM_STATE + create_hashes, fallback to data_file_path); dropped the unused convert_filters_to_predicate re-export. The fork's limit-pushdown (with_limit) is preserved.

Validation: iceberg-datafusion 86 lib tests (incl. 6 new bucketing tests + existing limit tests), all 9 df_* sqllogictest schedules, integration_datafusion_test::test_provider_plan_stream_schema; clippy + fmt clean. Builds inside the Spice runtime against the Spice DF53 fork (cargo check -p data_components).

Base branch spiceai-0.9.1-df-53 is the renamed DF53 line (was spiceai-0.9.0), following the spiceai-<iceberg>-df-<datafusion> convention; its first commit corrects the stale version field to 0.9.1.

Port of apache#2298 onto the spiceai-0.9.0 fork. IcebergTableProvider::scan() now plans files eagerly and distributes FileScanTasks into min(target_partitions, n_files) buckets, one bucket per DataFusion partition, so file reads are scheduled concurrently instead of streaming through a single UnknownPartitioning(1) partition. When the table is identity-partitioned (single spec, supported column types, partition columns projected) the scan declares Partitioning::Hash so downstream joins/aggregates can skip a RepartitionExec. - TableScan::to_arrow_from_tasks: replay pre-collected FileScanTasks through the Arrow reader; preserves the spice fork's ArrowReaderBuilder (file_io, runtime) signature and row-selection config. - IcebergTableScan gains new_with_tasks (eager) alongside new (lazy, used by IcebergStaticTableProvider); execute(i) streams buckets[i]. Constructors made pub; with_new_children now errors on children. - New table/bucketing.rs: identity-hash bucketing via REPARTITION_RANDOM_STATE + create_hashes, fallback to data_file_path. - Spice limit pushdown preserved: with_limit threaded into the planning builder and build_table_scan. - Drop the unused convert_filters_to_predicate re-export.

Copilot

Pull request overview

This PR updates the iceberg-datafusion integration to plan FileScanTasks eagerly during TableProvider::scan(), bucket them into multiple DataFusion partitions for parallel file reads, and (when safe) declare Partitioning::Hash for identity-partitioned tables to avoid redundant repartition steps downstream.

Changes:

Add eager file planning + bucketing in IcebergTableProvider::scan(), with optional Partitioning::Hash for eligible identity-partitioned tables.
Add TableScan::to_arrow_from_tasks and wire IcebergTableScan to replay pre-planned task buckets per partition.
Update sqllogictest + integration tests to reflect new scan display and partition behavior.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
crates/sqllogictest/testdata/slts/df_test/timestamp_predicate_pushdown.slt	Updates physical plan expectations to include `buckets`/`file_count` in scan display.
crates/sqllogictest/testdata/slts/df_test/like_predicate_pushdown.slt	Updates expected `input_partitions` and scan display for bucketed scans.
crates/sqllogictest/testdata/slts/df_test/boolean_predicate_pushdown.slt	Updates scan display expectations to include `buckets`/`file_count`.
crates/sqllogictest/testdata/slts/df_test/binary_predicate_pushdown.slt	Updates empty-table scan display expectations (`file_count:[0]`).
crates/sqllogictest/testdata/slts/df_test/basic_queries.slt	Updates scan display to include `buckets`/`file_count` alongside limit pushdown.
crates/integrations/datafusion/tests/integration_datafusion_test.rs	Adjusts execution partition index to match new partitioning behavior.
crates/integrations/datafusion/src/table/mod.rs	Implements eager planning + bucketing in `IcebergTableProvider::scan()` and adds bucketing-focused tests.
crates/integrations/datafusion/src/table/bucketing.rs	New module implementing identity-hash bucketing (DataFusion-compatible) with fallback hashing.
crates/integrations/datafusion/src/physical_plan/scan.rs	Adds eager multi-partition scan mode and task replay via `to_arrow_from_tasks`; improves display output.
crates/integrations/datafusion/src/physical_plan/mod.rs	Removes unused re-export.
crates/iceberg/src/scan/mod.rs	Adds `TableScan::to_arrow_from_tasks` and refactors `to_arrow()` to call it.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Fork PR spiceai/iceberg-rust#43 merged; re-pin from the branch head (9e6e1a00) to the merge commit b8fc79d9 on spiceai-0.9.1-df-53.

… fork branch naming (spiceai#11328) * chore(deps): bump iceberg-rust to parallel file scanning fork Bumps the spiceai/iceberg-rust pin from e519b221 to b652de2e, picking up the eager task-bucketing parallel-scan change (spiceai/iceberg-rust#42, port of apache/iceberg-rust#2298). IcebergTableProvider::scan() now plans files eagerly and distributes FileScanTasks across min(target_partitions, n_files) DataFusion partitions, so Iceberg file reads are scheduled concurrently instead of through a single partition. Identity-partitioned tables additionally declare Partitioning::Hash so downstream joins/aggregates can skip a RepartitionExec. No Spice code changes are required: Spice consumes the provider only via IcebergTableProvider::try_new, and the change is internal to the scan path. The fork preserves Spice's existing iceberg limit-pushdown. * chore(deps): adopt spiceai-<iceberg>-df-<df> fork naming; re-pin to spiceai-0.9.1-df-53 Re-pins iceberg-rust to the renamed `spiceai-0.9.1-df-53` branch (rev 9e6e1a00) and bumps the version requirement to `=0.9.1` to match the corrected crate version. Adds docs/dev/fork-branch-naming.md documenting the `spiceai-<iceberg>-df-<datafusion>` convention. The DF53 fork line was previously named `spiceai-0.9.0` but actually tracks a post-0.9.1 `main` snapshot; the rename + version correction make the branch name reflect its real contents. * chore(deps): re-pin iceberg-rust to merged spiceai-0.9.1-df-53 commit Fork PR spiceai/iceberg-rust#43 merged; re-pin from the branch head (9e6e1a00) to the merge commit b8fc79d9 on spiceai-0.9.1-df-53. * chore(deps): annotate iceberg pins with fork branch; drop naming doc Address Copilot review on spiceai#11328: add `# branch: spiceai-0.9.1-df-53` to the iceberg-rust pins (matching the convention used by delta_kernel/duckdb/arrow). Remove docs/dev/fork-branch-naming.md per review feedback (no separate doc).

… fork branch naming (#11328) * chore(deps): bump iceberg-rust to parallel file scanning fork Bumps the spiceai/iceberg-rust pin from e519b221 to b652de2e, picking up the eager task-bucketing parallel-scan change (spiceai/iceberg-rust#42, port of apache/iceberg-rust#2298). IcebergTableProvider::scan() now plans files eagerly and distributes FileScanTasks across min(target_partitions, n_files) DataFusion partitions, so Iceberg file reads are scheduled concurrently instead of through a single partition. Identity-partitioned tables additionally declare Partitioning::Hash so downstream joins/aggregates can skip a RepartitionExec. No Spice code changes are required: Spice consumes the provider only via IcebergTableProvider::try_new, and the change is internal to the scan path. The fork preserves Spice's existing iceberg limit-pushdown. * chore(deps): adopt spiceai-<iceberg>-df-<df> fork naming; re-pin to spiceai-0.9.1-df-53 Re-pins iceberg-rust to the renamed `spiceai-0.9.1-df-53` branch (rev 9e6e1a00) and bumps the version requirement to `=0.9.1` to match the corrected crate version. Adds docs/dev/fork-branch-naming.md documenting the `spiceai-<iceberg>-df-<datafusion>` convention. The DF53 fork line was previously named `spiceai-0.9.0` but actually tracks a post-0.9.1 `main` snapshot; the rename + version correction make the branch name reflect its real contents. * chore(deps): re-pin iceberg-rust to merged spiceai-0.9.1-df-53 commit Fork PR spiceai/iceberg-rust#43 merged; re-pin from the branch head (9e6e1a00) to the merge commit b8fc79d9 on spiceai-0.9.1-df-53. * chore(deps): annotate iceberg pins with fork branch; drop naming doc Address Copilot review on #11328: add `# branch: spiceai-0.9.1-df-53` to the iceberg-rust pins (matching the convention used by delta_kernel/duckdb/arrow). Remove docs/dev/fork-branch-naming.md per review feedback (no separate doc).

Copilot AI review requested due to automatic review settings June 15, 2026 02:34

phillipleblanc mentioned this pull request Jun 15, 2026

feat(datafusion): parallel file scanning with eager task bucketing #42

Closed

Copilot started reviewing on behalf of phillipleblanc June 15, 2026 02:35 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread crates/integrations/datafusion/src/table/bucketing.rs

Comment thread crates/integrations/datafusion/src/physical_plan/scan.rs

This was referenced Jun 15, 2026

feat(datafusion): parallel file scanning with eager task bucketing #44

Merged

chore(deps): bump iceberg-rust (parallel file scanning) + standardize fork branch naming spiceai/spiceai#11328

Merged

phillipleblanc self-assigned this Jun 15, 2026

phillipleblanc merged commit b8fc79d into spiceai-0.9.1-df-53 Jun 15, 2026
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datafusion): parallel file scanning with eager task bucketing#43

feat(datafusion): parallel file scanning with eager task bucketing#43
phillipleblanc merged 1 commit into
spiceai-0.9.1-df-53from
phillip/parallel-file-scanning-df53

phillipleblanc commented Jun 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

phillipleblanc commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

phillipleblanc commented Jun 15, 2026 •

edited

Loading