Skip to content

Spark: Parallelize RemoveOrphanFiles prefix listing across executors#16933

Open
arifazmidd wants to merge 1 commit into
apache:mainfrom
arifazmidd:spark/parallel-prefix-listing
Open

Spark: Parallelize RemoveOrphanFiles prefix listing across executors#16933
arifazmidd wants to merge 1 commit into
apache:mainfrom
arifazmidd:spark/parallel-prefix-listing

Conversation

@arifazmidd

@arifazmidd arifazmidd commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Closes #16932

Description

DeleteOrphanFilesSparkAction has two listing strategies, but only the Hadoop one parallelizes:

  • prefix_listing = false (Hadoop / listStatus) is depth-limited on the driver and fans deep sub-directories out across executors. It parallelizes, but issues ~one LIST call per directory. On tables with hundreds of thousands of partition directories that is too many round-trips to finish in a reasonable time even when distributed.
  • prefix_listing = true (FileIO / listPrefix) uses a flat recursive listing that needs ~an order of magnitude fewer LIST calls, but it is iterated serially on the driver and then parallelize(matchingFiles, 1) (a single partition). For tables with tens of millions of files the driver never finishes listing, so remove_orphan_files hangs before reaching the deletion phase.

This change gives the prefix-listing path the same executor fan-out the Hadoop path already has, so it gets both the low call count of listPrefix and cluster parallelism.

Changes

  • Parallelize the usePrefixListing branch of listedFileDS():
    • Driver discovers shallow sub-prefixes via the existing FileSystemWalker.listDirRecursivelyWithHadoop depth-limited discovery (SupportsPrefixOperations.listPrefix is recursive-only and cannot enumerate a single level, so discovery needs a delimiter-capable step).
    • Sub-prefixes are distributed with parallelize(subDirs).mapPartitions(...); each task runs the existing FileSystemWalker.listDirRecursivelyWithFileIO on its sub-prefix, with FileIO obtained from a broadcast SerializableTableWithSize.
  • Add a parallel-prefix-listing option (default true); false restores the prior serial driver-side path.
  • No FileSystemWalker changes, reuses the existing walker methods.

Testing

  • TestRemoveOrphanFilesAction is already parameterized over usePrefixListing, so the full suite exercises the parallel path (186 tests, all passing, against current main).
  • Real-world: on an S3-backed table with ~40M files across hundreds of thousands of partitions, prefix_listing => true previously hung indefinitely on the driver in listDirRecursivelyWithFileIO. With this change the listing completed (~10k sub-prefix tasks across 30 executors, finishing in minutes) and the job proceeded to delete the orphan files.

Notes

  • Targets Spark 3.5 first; happy to replicate to 3.4 / 4.0 / 4.1 once the approach looks right.
  • Highly concurrent listPrefix can trigger S3 503 throttling; raising s3.retry.num-retries / s3.retry.max-wait-ms mitigates it. Could add a docs note if useful.

The prefix_listing path enumerated the entire table on a single driver thread
via one listPrefix iterator, so remove_orphan_files could hang on large
object-store tables before ever reaching the deletion phase. Distribute the
listing across executors the way the Hadoop path already does: discover shallow
sub-prefixes on the driver, then list each sub-prefix in parallel with
listPrefix using a broadcast SerializableTable. Gated by a new
parallel-prefix-listing option (default true), with the prior serial path as
the fallback.
@github-actions github-actions Bot added the spark label Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spark: RemoveOrphanFiles prefix_listing enumerates the whole table serially on the driver

1 participant