Skip to content

DROP or UPDATE on Iceberg metadata is not reflected in AWS Glue #16877

Description

@rafal-r-np

Query engine

AWS Glue + AWS Athena/PyIceberg

Question

I noticed a weird, but it seems that expected, behavior when updating Iceberg table schema by adding, dropping or updating columns.

When I add a new column via Athena/PyIceberg or direct AWS calls the changes are reflected in AWS Glue (a new schema version is created as expected). However, when I drop a column the Iceberg metadata is updated, however Glue schema remains stale (it still has the dropped column). I need to manually update its columns in StorageDescriptor directly. The same behavior is for update. What's worse it that after doing so, the schema in Glue is correct, but doing any consecuitive change e.g. adding a new column, makes the old columns be reverted and so I need to update the glue schema manually even for ADD command.

Some research with AI gave me conclusion that it is an expected behavior in Iceberg (known to Iceberg team very well) and the reason for re-appeared old columns is due to the fact that such columns remain in Glue metadata but have special setting "iceberg.field.current": "false" which is an indicator for engines that this column is not a part of current schema, however in Glue itself the column is then available unless manually updated the mentioned StorageDescriptor.Columns.

What's also confusing is that Glue uses the same metadata as Athena, however it does not base its table structure on Iceberg directly but rather on its own StorageDescriptor and that's due to the fact that it must remain compatible with Hive somehow. Although query engines like Athena reflect the Iceberg metadata correctly, I need Glue to be in sync with them because I need to use LakeFormation which relies on updated Glue schema heavily.

  1. Is is really necessary to manually update Glue schema or it is a wrong pattern and there are better ways to do it? What are the recommended patterns to evolve the schema so that both Glue and query engines are in sync?
  2. Does iceberg always set the "iceberg.field.current": "false" for old, no longer available columns and this can be somehow prevented or is it AWS implementation?
  3. Are there plans to change this confusing behavior in the future?

I am not sure if these questions are for Iceberg or rather for AWS team, but as soon as the Iceberg team maintain integration with AWS services I assume that this might be relevant in my case as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions