Skip to content

[Bug]: expireSnapshots fails with IllegalArgumentException: null on object-store-backed Iceberg tables (S3FileIO / OSSFileIO / GCSFileIO / ResolvingFileIO) #4237

@felixhzhu

Description

@felixhzhu

What happened?

When an Iceberg table's underlying FileIO is an object-store-native one (S3FileIO / OSSFileIO / GCSFileIO / ResolvingFileIO, commonly seen with Glue / REST / Polaris catalogs), the AMS-side expireSnapshots task always fails with IllegalArgumentException: null thrown from AuthenticatedFileIOAdapter.asFileSystemIO(). SnapshotsExpiringProcess is marked FAILED for every such table, even though the underlying data/manifest files have already been deleted successfully.

Root cause:
RollingFileCleaner.doCleanFiles() unconditionally calls TableFileUtil.deleteEmptyDirectory(), which invokes io.asFileSystemIO(). For object-store FileIOs, AuthenticatedFileIOAdapter.supportFileSystemOperations() returns false, so the single-arg Preconditions.checkArgument(boolean) throws IllegalArgumentException with a null message — matching the stack trace exactly.

The "delete empty parent directory" step only has meaning under HDFS / Hadoop FS semantics. On object stores there is no directory entity, the prefix disappears automatically when the last object is deleted, so this step is a no-op by definition.

Expected: expireSnapshots should succeed on object-store-backed tables; the directory-cleanup step should be skipped silently when the FileIO does not support filesystem operations.

(There is also a downstream issue: once the process is marked FAILED, DefaultTableProcessStore.tryTransitState rejects the subsequent RETRY_REQUESTED event because currentStatus is FAILED, producing a WARN storm. I will file that as a separate issue since the fix lives in amoro-ams, not amoro-format-iceberg.)

Affects Versions

0.9.0-incubating (also reproducible on the 0.9.x branch HEAD)

What table formats are you seeing the problem on?

Iceberg

What engines are you seeing the problem on?

AMS

How to reproduce

Environment:

  • Amoro 0.9.0-incubating, single AMS node.
  • Iceberg catalog: Glue / REST / Polaris (anything that produces a table whose io-impl is an object-store-native FileIO).
  • Table property io-impl is one of: S3FileIO / OSSFileIO / GCSFileIO / ResolvingFileIO.
  • Table has at least one expirable snapshot.

Steps:

  1. Register the Iceberg catalog into AMS.
  2. Pick a table whose oldest snapshot is older than history.expire.max-snapshot-age-ms.
  3. Trigger expireSnapshots — either via the AMS dashboard, or wait for SnapshotsExpiringExecutor.
  4. AMS log immediately reports IllegalArgumentException: null and the process is marked FAILED.

Observations:

  • table.expireSnapshots()...commit() actually succeeds (old snapshots removed from metadata.json).
  • RollingFileCleaner's bulk delete actually succeeds (S3FileIO.deleteFiles is invoked correctly).
  • Only the trailing "delete empty parent directories" step throws and flips the whole process to FAILED.

Reproduces 100% of the time on every object-store-backed table. HDFS / Hadoop FS tables (HadoopFileIO, including s3a/cosn/oss) are NOT affected.

Relevant log output

ERROR [local-expire-snapshots-7] [org.apache.amoro.server.process.iceberg.SnapshotsExpiringProcess]
- unexpected expire error of table xxx(tableId=92)
java.lang.IllegalArgumentException: null
    at org.apache.amoro.shade.guava32.com.google.common.base.Preconditions.checkArgument(Preconditions.java:129)
    at org.apache.amoro.io.AuthenticatedFileIOAdapter.asFileSystemIO(AuthenticatedFileIOAdapter.java:98)
    at org.apache.amoro.utils.TableFileUtil.deleteEmptyDirectory(TableFileUtil.java:95)
    at org.apache.amoro.formats.iceberg.utils.RollingFileCleaner.doCleanFiles(RollingFileCleaner.java:114)
    at org.apache.amoro.formats.iceberg.utils.RollingFileCleaner.clear(RollingFileCleaner.java:133)
    at org.apache.amoro.formats.iceberg.maintainer.IcebergTableMaintainer.expireSnapshots(IcebergTableMaintainer.java:...)

Anything else

Frequency: every scheduled run, every object-store-backed table. Verified on AWS S3 (Glue) and on a Polaris-backed REST catalog.

Proposed fix (already implemented locally, happy to upstream as a PR):

  1. TableFileUtil.deleteEmptyDirectory: early-return when !io.supportFileSystemOperations(). Object stores have no real directories — this step is a no-op by definition, and it's also where the existing exception originates.

  2. RollingFileCleaner.doCleanFiles:

    • guard the "delete empty parent directories" loop with the same supportFileSystemOperations() check (defense in depth + explicit intent),
    • catch per-directory exceptions as WARN so one bad parent dir does not abort the batch,
    • move parentDirectories.clear() and collectedFiles.clear() into a finally block so a failure in directory cleanup never causes already-deleted paths to be re-submitted on the next 1000-file flush.

No schema / config / Thrift IDL changes. No public API / SPI signature changes. HadoopFileIO-based tables are completely unaffected — the new guards return true and code follows the original path. BasicTableTrashManager and other callers of deleteEmptyDirectory benefit from the same guard for free.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    type:bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions