Query engine
AWS Glue + AWS Athena/PyIceberg
Question
I noticed a weird, but it seems that expected, behavior when updating Iceberg table schema by adding, dropping or updating columns.
When I add a new column via Athena/PyIceberg or direct AWS calls the changes are reflected in AWS Glue (a new schema version is created as expected). However, when I drop a column the Iceberg metadata is updated, however Glue schema remains stale (it still has the dropped column). I need to manually update its columns in StorageDescriptor directly. The same behavior is for update. What's worse it that after doing so, the schema in Glue is correct, but doing any consecuitive change e.g. adding a new column, makes the old columns be reverted and so I need to update the glue schema manually even for ADD command.
Some research with AI gave me conclusion that it is an expected behavior in Iceberg (known to Iceberg team very well) and the reason for re-appeared old columns is due to the fact that such columns remain in Glue metadata but have special setting "iceberg.field.current": "false" which is an indicator for engines that this column is not a part of current schema, however in Glue itself the column is then available unless manually updated the mentioned StorageDescriptor.Columns.
What's also confusing is that Glue uses the same metadata as Athena, however it does not base its table structure on Iceberg directly but rather on its own StorageDescriptor and that's due to the fact that it must remain compatible with Hive somehow. Although query engines like Athena reflect the Iceberg metadata correctly, I need Glue to be in sync with them because I need to use LakeFormation which relies on updated Glue schema heavily.
- Is is really necessary to manually update Glue schema or it is a wrong pattern and there are better ways to do it? What are the recommended patterns to evolve the schema so that both Glue and query engines are in sync?
- Does iceberg always set the
"iceberg.field.current": "false" for old, no longer available columns and this can be somehow prevented or is it AWS implementation?
- Are there plans to change this confusing behavior in the future?
I am not sure if these questions are for Iceberg or rather for AWS team, but as soon as the Iceberg team maintain integration with AWS services I assume that this might be relevant in my case as well.
Query engine
AWS Glue + AWS Athena/PyIceberg
Question
I noticed a weird, but it seems that expected, behavior when updating Iceberg table schema by adding, dropping or updating columns.
When I add a new column via Athena/PyIceberg or direct AWS calls the changes are reflected in AWS Glue (a new schema version is created as expected). However, when I drop a column the Iceberg metadata is updated, however Glue schema remains stale (it still has the dropped column). I need to manually update its columns in StorageDescriptor directly. The same behavior is for update. What's worse it that after doing so, the schema in Glue is correct, but doing any consecuitive change e.g. adding a new column, makes the old columns be reverted and so I need to update the glue schema manually even for ADD command.
Some research with AI gave me conclusion that it is an expected behavior in Iceberg (known to Iceberg team very well) and the reason for re-appeared old columns is due to the fact that such columns remain in Glue metadata but have special setting
"iceberg.field.current": "false"which is an indicator for engines that this column is not a part of current schema, however in Glue itself the column is then available unless manually updated the mentionedStorageDescriptor.Columns.What's also confusing is that Glue uses the same metadata as Athena, however it does not base its table structure on Iceberg directly but rather on its own
StorageDescriptorand that's due to the fact that it must remain compatible with Hive somehow. Although query engines like Athena reflect the Iceberg metadata correctly, I need Glue to be in sync with them because I need to use LakeFormation which relies on updated Glue schema heavily."iceberg.field.current": "false"for old, no longer available columns and this can be somehow prevented or is it AWS implementation?I am not sure if these questions are for Iceberg or rather for AWS team, but as soon as the Iceberg team maintain integration with AWS services I assume that this might be relevant in my case as well.