Support retrying non-finished async tasks on startup and periodically #2003

danielhumanmod · 2025-07-06T00:46:07Z

Context

Polaris uses async tasks to perform operations such as table and manifest file cleanup. These tasks are executed asynchronously in a separate thread within the same JVM, and retries are handled inline within the task execution. However, this mechanism does not guarantee eventual execution in the following cases:

The task fails repeatedly and hits the maximum retry limit.
The service crashes or shuts down before retrying.

Implementation

Persist failed tasks and introduce a retry mechanism triggered during Polaris startup and via periodic background checks, changes included:

Metastore Layer:
- Exposes a new API getMetaStoreManagerMap
- Ensures LAST_ATTEMPT_START_TIME set for each task entity creation, which is important for time-out filtering when loadTasks() from metastore, so that prevent multiple executors from picking the same task
TaskRecoveryManager: New class responsible for task recovery logic, including:
- Constructing executionPolarisCallContext
- Loading tasks from metastore
- Triggering task execution
QuarkusTaskExecutorImpl: Hook into application lifecycle to initiate task recovery.
Task Retry Strategy: Failed tasks remain persisted in the metastore and are retried by the recovery manager.
Tests: Adjusted existing tests and added new coverage for recovery behavior.

Recommended Review Order

Metastore Layer related code
TaskRecoveryManager
QuarkusTaskExecutorImpl and TaskExecutorImpl
Task cleanup handlers
Tests

danielhumanmod · 2025-07-06T00:54:55Z

...rvice/src/test/java/org/apache/polaris/service/quarkus/task/TableCleanupTaskHandlerTest.java

@@ -147,6 +151,8 @@ public void testTableCleanup() throws IOException {

    handler.handleTask(task, callContext);

+    timeSource.add(Duration.ofMinutes(10));


@adnanhemani continue the previous comment here:

Can you explain this further - I'm not sure why the tests need this 10m jump? Is it so that tasks are "recovered" by the Quarkus Scheduled method?

metaStoreManager.loadTasks fetches available tasks from the metastore — meaning tasks that are either not leased by any executor or whose lease has already timed out (after 5 minutes).

In this test, the new tasks are created and not executed by its parent task, so to ensure these tasks are fetched, we need to simulate a time gap longer than 5 minutes.

danielhumanmod · 2025-07-06T00:57:07Z

...rc/main/java/org/apache/polaris/persistence/relational/jdbc/JdbcMetaStoreManagerFactory.java

@@ -243,6 +244,11 @@ public synchronized EntityCache getOrCreateEntityCache(RealmContext realmContext
    return entityCacheMap.get(realmContext.getRealmIdentifier());
  }

+  @Override
+  public Iterator<Map.Entry<String, PolarisMetaStoreManager>> getMetaStoreManagerMap() {


Make it to Iter<Map.Entry> as https://github.com/apache/polaris/pull/1585/files#r2121942046 suggests

danielhumanmod · 2025-07-06T01:01:43Z

service/common/src/main/java/org/apache/polaris/service/task/TableCleanupTaskHandler.java

@@ -188,6 +193,9 @@ private Stream<TaskEntity> getManifestTaskStream(
                  .withData(
                      new ManifestFileCleanupTaskHandler.ManifestCleanupTask(
                          tableEntity.getTableIdentifier(), TaskUtils.encodeManifestFile(mf)))
+                  .withLastAttemptExecutorId(executorId)
+                  .withAttemptCount(1)


@adnanhemani continue previous discussion here:

How can we assume this?

This new task (ManifestFileCleanupTask) is created by the current task (TableCleanupTask) and will be executed immediately at the end of it.
Since it’s its first execution, we set the attemptCount to 1 here.

danielhumanmod · 2025-07-06T01:33:24Z

service/common/src/main/java/org/apache/polaris/service/task/TaskRecoveryManager.java

+              configurationStore,
+              clock);
+      EntitiesResult entitiesResult =
+          metaStoreManager.loadTasks(polarisCallContext, executorId, PageToken.readEverything());


@adnanhemani continue previous comment here:

I'm not sure I'm understanding the logic here: we are asking for 20 tasks here - but what if there are more than 20 tasks that need recovery?

Good catch, we should update to read all pending tasks

retry non-finished tasks

089dc14

github-project-automation bot added this to Basic Kanban Board Jul 6, 2025

github-project-automation bot moved this to PRs In Progress in Basic Kanban Board Jul 6, 2025

danielhumanmod mentioned this pull request Jul 6, 2025

Support retrying non-finished async tasks on startup and periodically #1585

Closed

danielhumanmod commented Jul 6, 2025

View reviewed changes

danielhumanmod added 2 commits July 5, 2025 18:16

comments to make the test easier to understand

fa98e88

read all tasks from metastore

88715ef

danielhumanmod commented Jul 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support retrying non-finished async tasks on startup and periodically #2003

Support retrying non-finished async tasks on startup and periodically #2003

Uh oh!

danielhumanmod commented Jul 6, 2025

Uh oh!

danielhumanmod Jul 6, 2025 •

edited

Loading

Uh oh!

danielhumanmod Jul 6, 2025

Uh oh!

danielhumanmod Jul 6, 2025

Uh oh!

danielhumanmod Jul 6, 2025

Uh oh!

Uh oh!

		@@ -147,6 +151,8 @@ public void testTableCleanup() throws IOException {

		handler.handleTask(task, callContext);

		timeSource.add(Duration.ofMinutes(10));

Support retrying non-finished async tasks on startup and periodically #2003

Are you sure you want to change the base?

Support retrying non-finished async tasks on startup and periodically #2003

Uh oh!

Conversation

danielhumanmod commented Jul 6, 2025

Context

Implementation

Recommended Review Order

Uh oh!

danielhumanmod Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielhumanmod Jul 6, 2025

Choose a reason for hiding this comment

Uh oh!

danielhumanmod Jul 6, 2025

Choose a reason for hiding this comment

Uh oh!

danielhumanmod Jul 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danielhumanmod Jul 6, 2025 •

edited

Loading