Skip to content

fix(scan): derive manifest schema/spec from table metadata (cf. iceberg-java specsById)#2683

Open
raghav-reglobe wants to merge 1 commit into
apache:mainfrom
raghav-reglobe:manifest-schema-resilience
Open

fix(scan): derive manifest schema/spec from table metadata (cf. iceberg-java specsById)#2683
raghav-reglobe wants to merge 1 commit into
apache:mainfrom
raghav-reglobe:manifest-schema-resilience

Conversation

@raghav-reglobe

Copy link
Copy Markdown

What

When reading a manifest, iceberg-rust parses the table schema and partition spec from that manifest's own schema / partition-spec Avro key-value metadata and uses them to decode entries — hard-failing if schema is not a valid Iceberg schema. This PR derives the schema + spec from the authoritative table metadata (by the manifest's schema-id / partition-spec-id) instead, falling back to the manifest's own keys when no table metadata is available.

Why

  1. Redundant + brittle. The scan already holds TableMetadata (ObjectCache::get_manifest_list takes it; schema_by_id/partition_spec_by_id exist). A manifest's embedded schema is a redundant copy.
  2. Ecosystem alignment. iceberg-java's ManifestReader takes specsById from table metadata and has deprecated reading the schema from manifest file metadata — the warning is literally "Pass specsById to avoid reading from file metadata" (removed in 1.12.0). pyiceberg and iceberg-go don't read it on the scan path either. iceberg-rust is currently the only implementation that hard-depends on it.
  3. Observable impact. Tables whose manifest schema key holds a non-conformant value are readable by pyiceberg, Doris, and Spark (iceberg-java) but not iceberg-rust. For example duckdb-iceberg serializes the manifest_entry Avro schema there (Avro type names like array/record), producing data did not match any variant of untagged enum SchemaEnum.

How

ObjectCache::get_manifest now takes &TableMetadataRef and threads it through ManifestFile::load_manifest_withManifest::try_from_avro_bytes_withManifestMetadata::parse_with, which prefers table_metadata.{schema_by_id, partition_spec_by_id}. The existing public parse / parse_avro / load_manifest / try_from_avro_bytes are preserved (they delegate with None), so behaviour is unchanged when no table metadata is available.

Tests

  • New test_manifest_metadata_parse_prefers_table_metadata_over_bad_schema: a manifest with a non-conformant schema key parses successfully via table metadata, while the manifest-only path rejects it.
  • Full lib suite green (1359 passed), clippy + fmt clean.

Closes #2682.

The manifest reader parses the table schema and partition spec from the
manifest Avro file's own `schema`/`partition-spec` key-value metadata and
hard-fails if `schema` is not a valid Iceberg schema. This makes tables
written by some engines unreadable (e.g. duckdb-iceberg serializes the
manifest_entry Avro schema there, using Avro type names like `array`/
`record`), while pyiceberg, Doris, and Spark (iceberg-java) read them fine.

The manifest's embedded schema is redundant with the authoritative table
metadata. Thread the table metadata through the scan's manifest decode
(ObjectCache::get_manifest -> ManifestFile::load_manifest_with ->
Manifest::try_from_avro_bytes_with -> ManifestMetadata::parse_with) and
prefer table_metadata.{schema_by_id, partition_spec_by_id} (looked up by the
manifest's schema-id / partition-spec-id) over the manifest's own keys,
falling back to the manifest metadata when none is available. Mirrors
iceberg-java's ManifestReader(specsById), whose reading of the schema from
file metadata is deprecated.

Closes apache#2682.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Raghvendra Singh <raghav@cashify.in>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Manifest reader hard-depends on the manifest schema/partition-spec keys; should derive from table metadata (cf. iceberg-java specsById)

1 participant