Skip to content

Conversation

beilCrxmarkets
Copy link
Contributor

Enhance Flexibility in MaxRetriesExceededHandler Configuration

This pull request introduces modifications to the MaxRetriesFailureHandler to enhance the flexibility and action possibilities within the maxRetriesExceededHandler. Previously, invoking executionOperations.stop() universally, even when a maxRetriesExceededHandler was defined, limited the potential actions, such as task rescheduling.
Key Changes:

  • The maxRetriesExceededHandler can now decide independently whether to remove the task, allowing for more nuanced handling (e.g., rescheduling tasks).
  • These modifications represent a breaking change, affecting the behavior of the MaxRetriesFailureHandler. With no defined handler, the behavior remains unchanged via a default handler executing executionOperations.stop().
  • We considered adding a RetriesFailureHandler, which would function similarly to MaxRetriesFailureHandler but without automatic invocation of executionOperations.stop(). However, to avoid user confusion on choosing between handlers, this was not pursued.

Fixes

#538

Reminders

  • Added/ran automated tests
  • Update README and/or examples
  • Ran mvn spotless:apply

cc @kagkarlsson

@beilCrxmarkets beilCrxmarkets changed the title Improve flexibility of the MaxRetriesFailureHandler feat: Improve flexibility of the MaxRetriesFailureHandler Nov 29, 2024
Comment on lines 96 to 99
executionOperations.stop();
maxRetriesExceededHandler.accept(executionComplete, executionOperations);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure about the original intent of the maxRetriesExceededHandler. The handler should not have taken executionOperations as a parameter since the execution had already been stopped.

@kagkarlsson
Copy link
Owner

What is your use-case if I might ask? (what operation rather than stop() are you considering)

@beilCrxmarkets
Copy link
Contributor Author

What is your use-case if I might ask? (what operation rather than stop() are you considering)

If the executionOperation is not stoped jet, we could pass new FailureHandler.OnFailureReschedule<Void>(schedule)::onFailure or new FailureHandler.OnFailureRetryLater<Void>(schedule)::onFailure for the maxRetriesExceededHandler.

Our RecurringTask is triggered once a day. If this fails, we want to repeat it. If we exceed the maxRetires, we still want the recurring task to be triggered the next day.

@kagkarlsson
Copy link
Owner

Not sure I understand. Why are you using MaxRetriesFailureHandler if you do not want the "max"-behavior?

You can always implement your own fully custom failurehandler..

@beilCrxmarkets
Copy link
Contributor Author

Not sure I understand. Why are you using MaxRetriesFailureHandler if you do not want the "max"-behavior?

You can always implement your own fully custom failurehandler..

At the moment we use our own implemented FailureHandler to get the desired behavior.

However, I assume that we are not the only ones who need the described behavior, which is why it makes sense to store it in a central location.

I agree with you that the name “MaxRetriesFailureHandler” implies that after the maximum number of attempts is reached, the instance is terminated. We have therefore also considered proposing to add an additional FailureHandler with the name “RetriesFailureHandler”. However, the existence of a “MaxRetriesFailureHandler” and a “RetriesFailureHandler” can be confusing for beginners at first glance, as the difference is not 100% obvious.

If we were to start this project from scratch, it would make sense to add only the “RetriesFailureHandler” instead of the “MaxRetriesFailureHandler”, as the latter offers more flexibility. Since the library and thus the “MaxRetriesFailureHandler” have been around for a while and are used by several users as they are, the compromise would be to adapt the “MaxRetriesFailureHandler” accordingly so that it offers more flexibility.

@kagkarlsson
Copy link
Owner

Are you looking for behavior like:

  • up until X retries, use this FailureHandler
  • after X retries, use this other FailureHandler
    ?

@beilCrxmarkets
Copy link
Contributor Author

Are you looking for behavior like:

  • up until X retries, use this FailureHandler
  • after X retries, use this other FailureHandler
    ?

Yes, our task will be triggered once a day. If this fails, we want to try again 5 times shortly after (RetriesFailureHandler). After all 5 attempts are used up, we want to try again tomorrow at the usual time (OnFailureReschedule).

However, with the current implementation of the MaxRetriesFailureHandler, we cannot do this because the task is deleted from the DB after the 5 attempts are exhausted.

We also combine this with the ExponentialBackoffFailureHandler to get a delay between the 5 additional attempts. This does not cause any problems.

@beilCrxmarkets
Copy link
Contributor Author

@kagkarlsson How should we continue with this topic?

As stated in #538 (comment) @adamalexandru4 has the same issue.

@kagkarlsson
Copy link
Owner

kagkarlsson commented Mar 8, 2025

I don't want to add this functionality in to MaxRetries...

And it is not meant to handle a secondary FailureHandler, the signature is:

    public MaxRetriesFailureHandler(
        int maxRetries,
        FailureHandler<T> failureHandler,
        BiConsumer<ExecutionComplete, ExecutionOperations<T>> maxRetriesExceededHandler) {
      this.maxRetries = maxRetries;
      this.failureHandler = failureHandler;
      this.maxRetriesExceededHandler = maxRetriesExceededHandler;
    }

i.e. BiConsumer<ExecutionComplete, ExecutionOperations<T>>

We might add a warning if the type supplied is actually a FailureHandler?

(It might be there is a better name for MaxRetries..., not sure but since you both made that assumption...)

Your requirement I think should be captured in its own handler. If you want to supply a standard one for the project that is fine 👍. But what should it be called 😅. CompositeFailureHandler.primary(handler1).afterTries(5, handler2) (.afterTime(Duration.ofHours(1), handler3))` (just toying with ideas)

@beilCrxmarkets
Copy link
Contributor Author

I have reverted the changes made to the MaxRetriesFailureHandler and introduced a new SequenceFailureHandler equipped with builder functionality.

The SequenceFailureHandler facilitates the creation of a sequence involving two FailureHandlers. Initially, the primaryFailureHandler is invoked on the first failure, followed by the secondaryFailureHandler for subsequent failures.

While we can consider alternate names, I believe "sequence" aptly describes the concept of two failure handlers operating sequentially. In contrast, "composite" suggests a combination, which may not convey the intended functionality.

However, even with this setup, we cannot achieve the desired behavior. The limitation arises because the MaxRetriesFailureHandler within the SequenceFailureHandler removes the task execution. As a workaround, I've introduced an optional parameter primaryHandlerRetryCount to allow retries for the primaryFailureHandler.

This can create the wanted behavior via

SequenceFailureHandler.builder()
  .primary(new ExponentialBackoffFailureHandler(Duration.ofSeconds(3), 1.0))
  .afterTries(5, OnFailureRescheduleUsingTaskDataSchedule())
  .build()

Despite this solution, I remain uncertain if it constitutes the optimal API, given its inherent limitations.
An alternative approach might be the adoption of a list instead of the primaryFailureHandler, illustrated as follows:

  class SequenceFailureHandler<T> {

    private final List<FailureHandler<T>> failureHandlerSequence = new ArrayList<>();
    private final FailureHandler<T> fallbackFailureHandler;

    public SequenceFailureHandler(
        List<FailureHandler<T>> failureHandlerSequence, FailureHandler<T> fallbackFailureHandler) {
      this.failureHandlerSequence.addAll(failureHandlerSequence);
      this.fallbackFailureHandler = fallbackFailureHandler;
    }

    public void onFailure(
        ExecutionComplete executionComplete, ExecutionOperations<T> executionOperations) {
      int consecutiveFailures = executionComplete.getExecution().consecutiveFailures;
      if (consecutiveFailures < failureHandlerSequence.size()) {
        FailureHandler<T> failureHandler = failureHandlerSequence.get(consecutiveFailures);
        failureHandler.onFailure(executionComplete, executionOperations);
      } else {
        fallbackFailureHandler.onFailure(executionComplete, executionOperations);
      }
    }

    public static <T> List<FailureHandler<T>> times(int times, FailureHandler<T> failureHandler) {
      List<FailureHandler<T>> failureHandlers = new ArrayList<>();
      for (int i = 0; i < times; i++) {
        failureHandlers.add(failureHandler);
      }
      return failureHandlers;
    }
  }

This can be implemented as demonstrated below:

  private static void main() {
    new SequenceFailureHandler<>(
      times(5, new ExponentialBackoffFailureHandler<>(Duration.ofSeconds(3), 1.0)),
      new OnFailureReschedule<>(new CronSchedule("*****")));
  }

What do you think?

@beilCrxmarkets
Copy link
Contributor Author

I've discovered that our current approach will only partially resolve our issues. We need specific error handling behavior:

Retry the operation 5 times with a delay of 3 seconds between attempts. If the operation still fails, reschedule it for the next execution time, typically the following day, and repeat the process.

Currently, the consecutiveFailures count does not reset to 0 after rescheduling to the next day. As a result, if there is a failure the next day, the task is rescheduled immediately instead of attempting it 5 times.

You don't encounter this issue with the MaxRetriesFailureHandler, as it terminates the task when a certain number of consecutive failures is reached.

This problem isn't resolved by the SequenceFailureHandler, which utilizes a list. Once the end of the list is reached, it simply triggers the fallbackFailureHandler, causing the task to be rescheduled immediately.

I believe we need a LoopFailureHandler which accepts a list of FailureHandler objects and starts over with the first handler once the end of the list is reached.

Alternatively, we could implement a retry functionality using the modulo/remainder operator as follows:

    public void onFailure(
        final ExecutionComplete executionComplete,
        final ExecutionOperations<T> executionOperations) {
      int consecutiveFailures = executionComplete.getExecution().consecutiveFailures;
      int totalNumberOfFailures = consecutiveFailures + 1;
      if (totalNumberOfFailures % maxRetries == 0) {
        LOG.error(
            "Execution has failed {} times for task instance {}. Cancelling execution.",
            totalNumberOfFailures,
            executionComplete.getExecution().taskInstance);
        executionOperations.stop();
        maxRetriesExceededHandler.accept(executionComplete, executionOperations);
      } else {
        this.failureHandler.onFailure(executionComplete, executionOperations);
      }
    }

@JBabinskas
Copy link

JBabinskas commented Apr 24, 2025

Hey,

@kagkarlsson cool library, using it instead of good old Quartz and loving it!

On topic, I think this current MaxRetriesFailureHandler implementation is indeed confusing because it not only confused me, but I went around and asked a few colleagues in the office what would happen if one tried to pass any FailureHandler as argument for maxRetriesExceededHandler. Well, no one said that it would end in an error because MaxRetriesFailureHandler always executes executionOperations.stop();. I think this is limiting the MaxRetriesFailureHandler flexibility for no reason. It is extremely common case to retry some nightly job for X amount of times at night, then give up, and just re-schedule it for next night. And MaxRetriesFailureHandler would fit perfectly for that, but I also recognize that changing this behavior would basically be a breaking change, so not sure what would be the best approach here. Currently for those retry than re-schedule cases we just have a custom FailureHandler where I basically copy pasted MaxRetriesFailureHandler and removed the forced executionOperations.stop(); 😅 I think part of confusion comes from that one might look at MaxRetriesFailureHandler and have an idea that it is a terminating handler, but for most people it just means "retry X amount of times, if all failed, then do X" especially when it takes in something named as maxRetriesExceededHandler. Even if it is a breaking change I would vote that MaxRetriesFailureHandler would be changed, giving more freedom for the user to decide what to do, including terminating the job completely.

@beilCrxmarkets
Copy link
Contributor Author

@kagkarlsson I agree with @JBabinskas.

We should adjust the MaxRetriesFailureHandler. There are multiple ways. How should we proceed?

@omarfi
Copy link

omarfi commented Jun 16, 2025

We have the exact same use case.

I also solved it by using a custom "max" retries handler:

@Override
    public FailureHandler<OrderFulfilmentTaskData> getFailureHandler() {
        return (executionComplete, executionOperations) -> {
            if (executionComplete.getExecution().consecutiveFailures < maxRetries) {
                new OnFailureRetryLater<OrderFulfilmentTaskData>(Duration.ofSeconds(fixedDelayMinutes)).onFailure(executionComplete, executionOperations);
            } else {
                executionOperations.removeAndScheduleNew(TASK_DESCRIPTOR
                        .instance(executionComplete.getExecution().taskInstance.getId())
                        .data((OrderFulfilmentTaskData) executionComplete.getExecution().taskInstance.getData())
                        .scheduledTo(Instant.now().plus(Duration.ofSeconds(24))));
            }
        };
    }

I'm also hoping that the library can be shipped with a "CompositeFailureHandler" as proposed in the thread which would be composed of RetryLater and Reschedule- strategy).

Nice library btw.

@beilCrxmarkets
Copy link
Contributor Author

@kagkarlsson How do we proceed?

As you can see, the problem affects several users.

@tstavinoha
Copy link

My team came to a similar conclusion as some commenters in this thread.
MaxRetries handler stops/removes the execution, making it impossible to reschedule and thus handle this scenario.

Already in this thread we can find different workarounds being used, with using modulos, rescheduling, etc., and there could be other dirtier hacks in the wild - for what is actually a fairly common use case.

While we can and do implement a custom failure handler, having first-class support for this scenario without needing to reinvent the wheel and work around the internal details of the library would be a big improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants