Skip to content

datafusion: relax identity grouping-key gate across partition-spec evolution #2658

@toutane

Description

@toutane

Is your feature request related to a problem or challenge?

Feature request.

Background

compute_identity_cols in crates/integrations/datafusion/src/table/bucketing.rs (added in #2298) returns None which forces the eager scan to declare UnknownPartitioning whenever a table has more than one historical partition spec.

This is safe but stricter than iceberg-java, which intersects the identity fields present across all specs (Partitioning.groupingKeyType / commonActiveFieldIds) and still reports a grouping key on the columns that are identity-partitioned in every spec.

Why it's conservative today

The eager bucketing path hashes each task on the partition-tuple slot that matches the table's default spec. Under spec evolution, older files carry a partition tuple whose slot order does not necessarily align with the default spec, and FileScanTask does not currently carry its own spec id to disambiguate. A per-column intersection was attempted in e0d6add and reverted in f25c911 as out of scope for #2298.

Describe the solution you'd like

Match iceberg-java: compute the intersection of identity-source fields common to every spec and declare Partitioning::Hash on those columns, resolving each task's partition slot via its own spec id rather than assuming the default spec's slot order.

Follow-up to #2298.

Willingness to contribute

I would be willing to contribute to this feature with guidance from the Iceberg Rust community

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions