DROP or UPDATE on Iceberg metadata is not reflected in AWS Glue

### Query engine

AWS Glue + AWS Athena/PyIceberg

### Question

I noticed a weird, but it seems that expected, behavior when updating Iceberg table schema by adding, dropping or updating columns. 

When I add a new column via Athena/PyIceberg or direct AWS calls the changes are reflected in AWS Glue (a new schema version is created as expected). However, when I drop a column the Iceberg metadata is updated, however Glue schema remains stale (it still has the dropped column). I need to manually update its columns in StorageDescriptor directly. The same behavior is for update. What's worse it that after doing so, the schema in Glue is correct, but doing any consecuitive change e.g. adding a new column, makes the old columns be reverted and so I need to update the glue schema manually even for ADD command. 

Some research with AI gave me conclusion that it is an expected behavior in Iceberg (known to Iceberg team very well) and the reason for re-appeared old columns is due to the fact that such columns remain in Glue metadata but have special setting `"iceberg.field.current": "false"` which is an indicator for engines that this column is not a part of current schema, however in Glue itself the column is then available unless manually updated the mentioned `StorageDescriptor.Columns`. 

What's also confusing is that Glue uses the same metadata as Athena, however it does not base its table structure on Iceberg directly but rather on its own `StorageDescriptor` and that's due to the fact that it must remain compatible with Hive somehow. Although query engines like Athena reflect the Iceberg metadata correctly, I need Glue to be in sync with them because I need to use LakeFormation which relies on updated Glue schema heavily.

1. Is is really necessary to manually update Glue schema or it is a wrong pattern and there are better ways to do it? What are the recommended patterns to evolve the schema so that both Glue and query engines are in sync?
2. Does iceberg always set the `"iceberg.field.current": "false"` for old, no longer available columns and this can be somehow prevented or is it AWS implementation?
3. Are there plans to change this confusing behavior in the future?

I am not sure if these questions are for Iceberg or rather for AWS team, but as soon as the Iceberg team maintain integration with AWS services I assume that this might be relevant in my case as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DROP or UPDATE on Iceberg metadata is not reflected in AWS Glue #16877

Query engine

Question

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

DROP or UPDATE on Iceberg metadata is not reflected in AWS Glue #16877

Description

Query engine

Question

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions