Spark: Parallelize RemoveOrphanFiles prefix listing across executors by arifazmidd · Pull Request #16933 · apache/iceberg

arifazmidd · 2026-06-22T19:21:37Z

Description

DeleteOrphanFilesSparkAction has two listing strategies, but only the Hadoop one parallelizes:

prefix_listing = false (Hadoop / listStatus) is depth-limited on the driver and fans deep sub-directories out across executors. It parallelizes, but issues ~one LIST call per directory. On tables with hundreds of thousands of partition directories that is too many round-trips to finish in a reasonable time even when distributed.
prefix_listing = true (FileIO / listPrefix) uses a flat recursive listing that needs ~an order of magnitude fewer LIST calls, but it is iterated serially on the driver and then parallelize(matchingFiles, 1) (a single partition). For tables with tens of millions of files the driver never finishes listing, so remove_orphan_files hangs before reaching the deletion phase.

This change gives the prefix-listing path the same executor fan-out the Hadoop path already has, so it gets both the low call count of listPrefix and cluster parallelism.

Changes

Parallelize the usePrefixListing branch of listedFileDS():
- Driver discovers shallow sub-prefixes via the existing FileSystemWalker.listDirRecursivelyWithHadoop depth-limited discovery (SupportsPrefixOperations.listPrefix is recursive-only and cannot enumerate a single level, so discovery needs a delimiter-capable step).
- Sub-prefixes are distributed with parallelize(subDirs).mapPartitions(...); each task runs the existing FileSystemWalker.listDirRecursivelyWithFileIO on its sub-prefix, with FileIO obtained from a broadcast SerializableTableWithSize.
Add a parallel-prefix-listing option (default true); false restores the prior serial driver-side path.
No FileSystemWalker changes, reuses the existing walker methods.

Testing

TestRemoveOrphanFilesAction is already parameterized over usePrefixListing, so the full suite exercises the parallel path (186 tests, all passing, against current main).
Real-world: on an S3-backed table with ~40M files across hundreds of thousands of partitions, prefix_listing => true previously hung indefinitely on the driver in listDirRecursivelyWithFileIO. With this change the listing completed (~10k sub-prefix tasks across 30 executors, finishing in minutes) and the job proceeded to delete the orphan files.

Notes

Targets Spark 3.5 first; happy to replicate to 3.4 / 4.0 / 4.1 once the approach looks right.
Highly concurrent listPrefix can trigger S3 503 throttling; raising s3.retry.num-retries / s3.retry.max-wait-ms mitigates it. Could add a docs note if useful.

The prefix_listing path enumerated the entire table on a single driver thread via one listPrefix iterator, so remove_orphan_files could hang on large object-store tables before ever reaching the deletion phase. Distribute the listing across executors the way the Hadoop path already does: discover shallow sub-prefixes on the driver, then list each sub-prefix in parallel with listPrefix using a broadcast SerializableTable. Gated by a new parallel-prefix-listing option (default true), with the prior serial path as the fallback.

github-actions Bot added the spark label Jun 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark: Parallelize RemoveOrphanFiles prefix listing across executors#16933

Spark: Parallelize RemoveOrphanFiles prefix listing across executors#16933
arifazmidd wants to merge 1 commit into
apache:mainfrom
arifazmidd:spark/parallel-prefix-listing

arifazmidd commented Jun 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

arifazmidd commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Testing

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arifazmidd commented Jun 22, 2026 •

edited

Loading