Skip to content

feat(datafusion): parallel file scanning with eager task bucketing#43

Merged
phillipleblanc merged 1 commit into
spiceai-0.9.1-df-53from
phillip/parallel-file-scanning-df53
Jun 15, 2026
Merged

feat(datafusion): parallel file scanning with eager task bucketing#43
phillipleblanc merged 1 commit into
spiceai-0.9.1-df-53from
phillip/parallel-file-scanning-df53

Conversation

@phillipleblanc

@phillipleblanc phillipleblanc commented Jun 15, 2026

Copy link
Copy Markdown

Ports apache/iceberg-rust#2298 onto the DF53 fork line.

IcebergTableProvider::scan() plans files eagerly and distributes FileScanTasks into min(target_partitions, n_files) buckets — one bucket per DataFusion partition — so file reads are scheduled concurrently instead of through a single UnknownPartitioning(1) partition. Identity-partitioned tables (single spec, supported column types, partition cols projected) declare Partitioning::Hash so downstream joins/aggregates can skip a RepartitionExec; everything else falls back to UnknownPartitioning(N) while still bucketing.

Key changes: TableScan::to_arrow_from_tasks (replays pre-collected tasks; keeps this line's ArrowReaderBuilder::new(file_io, runtime) + row-selection config); IcebergTableScan::new_with_tasks (eager) alongside new (lazy); new table/bucketing.rs (identity-hash via REPARTITION_RANDOM_STATE + create_hashes, fallback to data_file_path); dropped the unused convert_filters_to_predicate re-export. The fork's limit-pushdown (with_limit) is preserved.

Validation: iceberg-datafusion 86 lib tests (incl. 6 new bucketing tests + existing limit tests), all 9 df_* sqllogictest schedules, integration_datafusion_test::test_provider_plan_stream_schema; clippy + fmt clean. Builds inside the Spice runtime against the Spice DF53 fork (cargo check -p data_components).

Base branch spiceai-0.9.1-df-53 is the renamed DF53 line (was spiceai-0.9.0), following the spiceai-<iceberg>-df-<datafusion> convention; its first commit corrects the stale version field to 0.9.1.

Port of apache#2298 onto the spiceai-0.9.0 fork.

IcebergTableProvider::scan() now plans files eagerly and distributes
FileScanTasks into min(target_partitions, n_files) buckets, one bucket
per DataFusion partition, so file reads are scheduled concurrently
instead of streaming through a single UnknownPartitioning(1) partition.
When the table is identity-partitioned (single spec, supported column
types, partition columns projected) the scan declares Partitioning::Hash
so downstream joins/aggregates can skip a RepartitionExec.

- TableScan::to_arrow_from_tasks: replay pre-collected FileScanTasks
  through the Arrow reader; preserves the spice fork's ArrowReaderBuilder
  (file_io, runtime) signature and row-selection config.
- IcebergTableScan gains new_with_tasks (eager) alongside new (lazy,
  used by IcebergStaticTableProvider); execute(i) streams buckets[i].
  Constructors made pub; with_new_children now errors on children.
- New table/bucketing.rs: identity-hash bucketing via
  REPARTITION_RANDOM_STATE + create_hashes, fallback to data_file_path.
- Spice limit pushdown preserved: with_limit threaded into the planning
  builder and build_table_scan.
- Drop the unused convert_filters_to_predicate re-export.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the iceberg-datafusion integration to plan FileScanTasks eagerly during TableProvider::scan(), bucket them into multiple DataFusion partitions for parallel file reads, and (when safe) declare Partitioning::Hash for identity-partitioned tables to avoid redundant repartition steps downstream.

Changes:

  • Add eager file planning + bucketing in IcebergTableProvider::scan(), with optional Partitioning::Hash for eligible identity-partitioned tables.
  • Add TableScan::to_arrow_from_tasks and wire IcebergTableScan to replay pre-planned task buckets per partition.
  • Update sqllogictest + integration tests to reflect new scan display and partition behavior.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
crates/sqllogictest/testdata/slts/df_test/timestamp_predicate_pushdown.slt Updates physical plan expectations to include buckets/file_count in scan display.
crates/sqllogictest/testdata/slts/df_test/like_predicate_pushdown.slt Updates expected input_partitions and scan display for bucketed scans.
crates/sqllogictest/testdata/slts/df_test/boolean_predicate_pushdown.slt Updates scan display expectations to include buckets/file_count.
crates/sqllogictest/testdata/slts/df_test/binary_predicate_pushdown.slt Updates empty-table scan display expectations (file_count:[0]).
crates/sqllogictest/testdata/slts/df_test/basic_queries.slt Updates scan display to include buckets/file_count alongside limit pushdown.
crates/integrations/datafusion/tests/integration_datafusion_test.rs Adjusts execution partition index to match new partitioning behavior.
crates/integrations/datafusion/src/table/mod.rs Implements eager planning + bucketing in IcebergTableProvider::scan() and adds bucketing-focused tests.
crates/integrations/datafusion/src/table/bucketing.rs New module implementing identity-hash bucketing (DataFusion-compatible) with fallback hashing.
crates/integrations/datafusion/src/physical_plan/scan.rs Adds eager multi-partition scan mode and task replay via to_arrow_from_tasks; improves display output.
crates/integrations/datafusion/src/physical_plan/mod.rs Removes unused re-export.
crates/iceberg/src/scan/mod.rs Adds TableScan::to_arrow_from_tasks and refactors to_arrow() to call it.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/integrations/datafusion/src/table/bucketing.rs
Comment thread crates/integrations/datafusion/src/physical_plan/scan.rs
@phillipleblanc phillipleblanc self-assigned this Jun 15, 2026
@phillipleblanc phillipleblanc merged commit b8fc79d into spiceai-0.9.1-df-53 Jun 15, 2026
18 checks passed
phillipleblanc added a commit to spiceai/spiceai that referenced this pull request Jun 15, 2026
Fork PR spiceai/iceberg-rust#43 merged; re-pin from the branch head
(9e6e1a00) to the merge commit b8fc79d9 on spiceai-0.9.1-df-53.
pull Bot pushed a commit to TheRakeshPurohit/spiceai that referenced this pull request Jun 15, 2026
… fork branch naming (spiceai#11328)

* chore(deps): bump iceberg-rust to parallel file scanning fork

Bumps the spiceai/iceberg-rust pin from e519b221 to b652de2e, picking up
the eager task-bucketing parallel-scan change (spiceai/iceberg-rust#42,
port of apache/iceberg-rust#2298).

IcebergTableProvider::scan() now plans files eagerly and distributes
FileScanTasks across min(target_partitions, n_files) DataFusion
partitions, so Iceberg file reads are scheduled concurrently instead of
through a single partition. Identity-partitioned tables additionally
declare Partitioning::Hash so downstream joins/aggregates can skip a
RepartitionExec.

No Spice code changes are required: Spice consumes the provider only via
IcebergTableProvider::try_new, and the change is internal to the scan
path. The fork preserves Spice's existing iceberg limit-pushdown.

* chore(deps): adopt spiceai-<iceberg>-df-<df> fork naming; re-pin to spiceai-0.9.1-df-53

Re-pins iceberg-rust to the renamed `spiceai-0.9.1-df-53` branch (rev
9e6e1a00) and bumps the version requirement to `=0.9.1` to match the
corrected crate version. Adds docs/dev/fork-branch-naming.md documenting
the `spiceai-<iceberg>-df-<datafusion>` convention.

The DF53 fork line was previously named `spiceai-0.9.0` but actually
tracks a post-0.9.1 `main` snapshot; the rename + version correction make
the branch name reflect its real contents.

* chore(deps): re-pin iceberg-rust to merged spiceai-0.9.1-df-53 commit

Fork PR spiceai/iceberg-rust#43 merged; re-pin from the branch head
(9e6e1a00) to the merge commit b8fc79d9 on spiceai-0.9.1-df-53.

* chore(deps): annotate iceberg pins with fork branch; drop naming doc

Address Copilot review on spiceai#11328: add `# branch: spiceai-0.9.1-df-53`
to the iceberg-rust pins (matching the convention used by
delta_kernel/duckdb/arrow). Remove docs/dev/fork-branch-naming.md per
review feedback (no separate doc).
github-actions Bot pushed a commit to spiceai/spiceai that referenced this pull request Jun 16, 2026
… fork branch naming (#11328)

* chore(deps): bump iceberg-rust to parallel file scanning fork

Bumps the spiceai/iceberg-rust pin from e519b221 to b652de2e, picking up
the eager task-bucketing parallel-scan change (spiceai/iceberg-rust#42,
port of apache/iceberg-rust#2298).

IcebergTableProvider::scan() now plans files eagerly and distributes
FileScanTasks across min(target_partitions, n_files) DataFusion
partitions, so Iceberg file reads are scheduled concurrently instead of
through a single partition. Identity-partitioned tables additionally
declare Partitioning::Hash so downstream joins/aggregates can skip a
RepartitionExec.

No Spice code changes are required: Spice consumes the provider only via
IcebergTableProvider::try_new, and the change is internal to the scan
path. The fork preserves Spice's existing iceberg limit-pushdown.

* chore(deps): adopt spiceai-<iceberg>-df-<df> fork naming; re-pin to spiceai-0.9.1-df-53

Re-pins iceberg-rust to the renamed `spiceai-0.9.1-df-53` branch (rev
9e6e1a00) and bumps the version requirement to `=0.9.1` to match the
corrected crate version. Adds docs/dev/fork-branch-naming.md documenting
the `spiceai-<iceberg>-df-<datafusion>` convention.

The DF53 fork line was previously named `spiceai-0.9.0` but actually
tracks a post-0.9.1 `main` snapshot; the rename + version correction make
the branch name reflect its real contents.

* chore(deps): re-pin iceberg-rust to merged spiceai-0.9.1-df-53 commit

Fork PR spiceai/iceberg-rust#43 merged; re-pin from the branch head
(9e6e1a00) to the merge commit b8fc79d9 on spiceai-0.9.1-df-53.

* chore(deps): annotate iceberg pins with fork branch; drop naming doc

Address Copilot review on #11328: add `# branch: spiceai-0.9.1-df-53`
to the iceberg-rust pins (matching the convention used by
delta_kernel/duckdb/arrow). Remove docs/dev/fork-branch-naming.md per
review feedback (no separate doc).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants