fix(scan): derive manifest schema/spec from table metadata (cf. iceberg-java specsById)#2683
Open
raghav-reglobe wants to merge 1 commit into
Open
fix(scan): derive manifest schema/spec from table metadata (cf. iceberg-java specsById)#2683raghav-reglobe wants to merge 1 commit into
raghav-reglobe wants to merge 1 commit into
Conversation
The manifest reader parses the table schema and partition spec from the
manifest Avro file's own `schema`/`partition-spec` key-value metadata and
hard-fails if `schema` is not a valid Iceberg schema. This makes tables
written by some engines unreadable (e.g. duckdb-iceberg serializes the
manifest_entry Avro schema there, using Avro type names like `array`/
`record`), while pyiceberg, Doris, and Spark (iceberg-java) read them fine.
The manifest's embedded schema is redundant with the authoritative table
metadata. Thread the table metadata through the scan's manifest decode
(ObjectCache::get_manifest -> ManifestFile::load_manifest_with ->
Manifest::try_from_avro_bytes_with -> ManifestMetadata::parse_with) and
prefer table_metadata.{schema_by_id, partition_spec_by_id} (looked up by the
manifest's schema-id / partition-spec-id) over the manifest's own keys,
falling back to the manifest metadata when none is available. Mirrors
iceberg-java's ManifestReader(specsById), whose reading of the schema from
file metadata is deprecated.
Closes apache#2682.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Raghvendra Singh <raghav@cashify.in>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
When reading a manifest, iceberg-rust parses the table schema and partition spec from that manifest's own
schema/partition-specAvro key-value metadata and uses them to decode entries — hard-failing ifschemais not a valid Iceberg schema. This PR derives the schema + spec from the authoritative table metadata (by the manifest'sschema-id/partition-spec-id) instead, falling back to the manifest's own keys when no table metadata is available.Why
TableMetadata(ObjectCache::get_manifest_listtakes it;schema_by_id/partition_spec_by_idexist). A manifest's embedded schema is a redundant copy.ManifestReadertakesspecsByIdfrom table metadata and has deprecated reading the schema from manifest file metadata — the warning is literally "Pass specsById to avoid reading from file metadata" (removed in 1.12.0). pyiceberg and iceberg-go don't read it on the scan path either. iceberg-rust is currently the only implementation that hard-depends on it.schemakey holds a non-conformant value are readable by pyiceberg, Doris, and Spark (iceberg-java) but not iceberg-rust. For example duckdb-iceberg serializes the manifest_entry Avro schema there (Avro type names likearray/record), producingdata did not match any variant of untagged enum SchemaEnum.How
ObjectCache::get_manifestnow takes&TableMetadataRefand threads it throughManifestFile::load_manifest_with→Manifest::try_from_avro_bytes_with→ManifestMetadata::parse_with, which preferstable_metadata.{schema_by_id, partition_spec_by_id}. The existing publicparse/parse_avro/load_manifest/try_from_avro_bytesare preserved (they delegate withNone), so behaviour is unchanged when no table metadata is available.Tests
test_manifest_metadata_parse_prefers_table_metadata_over_bad_schema: a manifest with a non-conformantschemakey parses successfully via table metadata, while the manifest-only path rejects it.Closes #2682.