Skip to content

Manifest reader hard-depends on the manifest schema/partition-spec keys; should derive from table metadata (cf. iceberg-java specsById) #2682

@raghav-reglobe

Description

@raghav-reglobe

The iceberg-rust issue

When reading a manifest, ManifestMetadata::parse (spec/manifest/metadata.rs) parses the table schema and partition spec from that manifest's own schema / partition-spec Avro key-value metadata, and Manifest::try_from_avro_bytes then uses them to derive the partition type for decoding entries:

serde_json::from_slice::<Schema>(meta.get("schema"))?            // hard error if not a valid Iceberg schema
let partition_type = metadata.partition_spec.partition_type(&metadata.schema)?;

This has two problems, independent of any particular writer:

  1. Redundant dependency. The scan already holds the authoritative TableMetadata (ObjectCache::get_manifest_list takes it, and TableMetadata::{schema_by_id, partition_spec_by_id} exist). A manifest's embedded schema/partition-spec is a redundant copy of what table metadata already provides by id.
  2. Out of step with the ecosystem. Other implementations don't read the manifest's schema key on the scan path:
    • iceberg-java ManifestReader takes specsById (specs from table metadata); reading the schema from manifest file metadata is deprecated and slated for removal (1.12.0) — the warning is literally "Pass specsById to avoid reading from file metadata."
    • pyiceberg decodes via a fixed MANIFEST_ENTRY_SCHEMAS + the table-metadata schema.
    • iceberg-go decodes via the Avro writer schema.

So iceberg-rust is the only implementation that hard-depends on the manifest's self-described schema, which is both unnecessary and brittle.

Observable impact (the symptom)

Because of this, manifests whose schema key holds anything other than a valid Iceberg table schema are unreadable in iceberg-rust only — pyiceberg, Apache Doris, and Spark (iceberg-java) all read the same tables. For example, duckdb-iceberg serializes the manifest_entry Avro record schema into the schema key (using Avro type names like array/record), so iceberg-rust fails with:

Fail to parse schema in manifest metadata
  → data did not match any variant of untagged enum SchemaEnum

This is the symptom; the root concern is the redundant, ecosystem-divergent dependency above.

Proposed fix

Derive the schema + partition spec from the table metadata (by the manifest's schema-id / partition-spec-id) rather than the manifest's own keys, falling back to the manifest metadata when no table metadata is available — mirroring iceberg-java's ManifestReader(specsById). The scan already has the TableMetadataRef to thread down. PR opening shortly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions