Skip to content

Feature Request: Support Parallel Job Execution in Clustered / Multi-Instance Deployments #25606

@Osama94

Description

@Osama94

Summary

The default ABP Background Job Manager uses a single distributed lock (AbpBackgroundJobWorker) to ensure only one server executes the job polling loop at a time. While this correctly prevents duplicate job execution, it also means that all worker instances except one are completely idle during the lock-hold period. In a production environment with multiple application instances, this results in severe under-utilization of available infrastructure and becomes a significant bottleneck for high-throughput workloads.

We propose supporting parallel job execution while preserving FIFO (First-In, First-Out) dispatch order — jobs are always picked up in the order they were enqueued (by priority, then by creation time), but multiple jobs can execute concurrently across servers without blocking each other.


Current Behavior

According to the documentation:

The default background job manager works as FIFO in a single thread and uses a distributed lock to ensure jobs are executed only in a single application instance at a time.

The root cause is in BackgroundJobWorkerService.DoWorkAsync, where the entire polling cycle is wrapped inside a single distributed lock:

// Current implementation (simplified)
await using (await DistributedLock.TryAcquireAsync(options.DistributedLockName))
{
    // ALL job fetching + execution happens here
    // → every other server is blocked for the entire duration
}

Given a 4-server production deployment:

  1. Server A acquires the AbpBackgroundJobWorker distributed lock.
  2. Server A fetches and executes a batch of jobs (up to MaxJobFetchCount, default 1000).
  3. Servers B, C, and D sit completely idle, waiting for the lock to be released.
  4. Only after Server A finishes its entire execution cycle (which may take seconds or minutes) do the other servers compete for the lock.
  5. During high-load scenarios, the job queue grows faster than a single server can drain it.

This effectively turns a multi-server deployment into a single-server execution model, negating the horizontal scaling benefits — including Kubernetes HPA, which will scale pods but gain zero throughput improvement for background job processing under the current design.


Problem Breakdown

1. No parallel execution across instances

The entire BackgroundJobWorkerService polling cycle is wrapped in a single distributed lock. This prevents any form of work-sharing between instances, even when there are hundreds of queued jobs and idle servers.

2. No per-job-level locking

Instead of locking at the individual job level (which would allow multiple servers to each process different jobs concurrently), the lock is held for the entire worker tick, serializing all execution across the cluster.

3. No configuration option for execution strategy

There is currently no built-in way to configure whether job execution should be:

  • Exclusive (current behavior — one server runs at a time)
  • Parallel (multiple servers each pick up and execute different jobs simultaneously)
  • Hybrid (e.g., limit to N concurrent workers across the cluster)

4. Degraded throughput in time-sensitive workloads

Applications that rely on background jobs for near-real-time processing (e.g., notifications, order processing, report generation) are severely impacted when the single active worker becomes a bottleneck.


Desired Behavior

Jobs are always fetched and dispatched in FIFO order (ordered by priority ASC, then CreationTime ASC — matching the current store query), but multiple servers participate in draining the queue concurrently.

Execution flow (proposed)

[Job Queue — FIFO dispatch order]
  Job #1 (oldest / highest priority)  ──► Server A acquires per-job lock → executes
  Job #2                               ──► Server B acquires per-job lock → executes
  Job #3                               ──► Server C acquires per-job lock → executes
  Job #4                               ──► Server D acquires per-job lock → executes
  Job #5 ... waits for next free slot

Jobs are dispatched strictly in FIFO order. Servers race to claim each job via a per-job distributed lock. The first server to acquire a job's lock executes it; others skip and move to the next unclaimed job in the queue.

FIFO guarantee — clarification

Aspect Guarantee
Dispatch order ✅ Jobs are always picked up in enqueue order (priority + CreationTime)
Execution start order ✅ Earlier jobs are always attempted first on every server
Execution completion order ❌ Not guaranteed — a later job may finish before an earlier one (inherent in any parallel system)

This is the same model used by Hangfire, Quartz, RabbitMQ, and every other production-grade distributed job queue. If strict completion-order guarantees are required, the current sequential default behavior already satisfies that use case.


Describe the Solution

Option A — Per-job distributed locking (preferred)

Move the distributed lock from the worker polling cycle to the individual job execution level. Each server independently polls for pending jobs in FIFO order and attempts to acquire a per-job lock before executing.

This also requires a small addition to IBackgroundJobStore: a SetAsRunningAsync(Guid jobId) method to mark a claimed job as in-progress and exclude it from future polls before the lock is released. Without this, a slow job could be re-fetched and re-attempted by other servers during its execution window.

// Proposed: each server polls independently — no global worker lock
var semaphore = new SemaphoreSlim(options.MaxConcurrentJobsPerInstance);

var pendingJobs = await _store.GetWaitingJobsAsync(
    maxCount: options.MaxJobFetchCount
    // ordered by: Priority ASC, CreationTime ASC  ← FIFO preserved
);

foreach (var job in pendingJobs)
{
    var lockKey = $"AbpBackgroundJob:{job.Id}";
    await using var handle = await _distributedLock.TryAcquireAsync(lockKey, timeout: TimeSpan.Zero);
    if (handle == null) continue; // another server already claimed this job

    // Mark as running BEFORE releasing the dispatch loop, to exclude from future polls
    await _store.SetAsRunningAsync(job.Id);

    await semaphore.WaitAsync(); // back-pressure: respect MaxConcurrentJobsPerInstance
    _ = Task.Run(async () =>
    {
        try { await ExecuteJobAsync(job); }
        finally { semaphore.Release(); }
    });
}

Key properties:

  • FIFO order preserved — all servers query the same ordered list and attempt jobs from the top down.
  • No global bottleneck — lock scope is a single job, not the entire worker cycle.
  • No duplicate execution — per-job distributed lock + Running store status flag together guarantee exactly-once execution.
  • Back-pressure enforcedSemaphoreSlim ensures no instance spawns unbounded concurrent tasks regardless of queue depth.
  • Backward compatible — existing retry, timeout, and failure logic in BackgroundJobExecuter is unchanged.

⚠️ Required interface change: IBackgroundJobStore needs a new SetAsRunningAsync(Guid jobId) method. This is a minor but necessary addition — without it, long-running jobs remain in Waiting state and risk being double-claimed by other polling servers.

Option B — Configurable execution strategy via AbpBackgroundJobWorkerOptions

Introduce a new option to let developers choose the execution model:

Configure(options =>
{
    // Choose: Exclusive (default), Parallel
    options.ExecutionStrategy = JobExecutionStrategy.Parallel;

    // Max concurrent jobs this instance can execute at once (default: 1)
    options.MaxConcurrentJobsPerInstance = 10;
});

Option C — Configurable worker concurrency per instance

Even without cross-instance coordination, allow each instance to execute multiple jobs in parallel internally:

Configure(options =>
{
    options.MaxConcurrentJobsPerInstance = Environment.ProcessorCount;
});

Required Changes Summary

Area Change
BackgroundJobWorkerService Replace global lock with per-job lock + semaphore dispatch loop
IBackgroundJobStore Add SetAsRunningAsync(Guid jobId) method
AbpBackgroundJobWorkerOptions Add IsParallelExecutionEnabled (default: false) and MaxConcurrentJobsPerInstance (default: 1)
Documentation Update background jobs and clustered deployment docs
Tests Add unit/integration tests for parallel dispatch, FIFO ordering, and exactly-once execution

Acceptance Criteria

  • AbpBackgroundJobWorkerOptions.IsParallelExecutionEnabled = false preserves the current behavior exactly (no regression).
  • When IsParallelExecutionEnabled = true, multiple application instances each execute different jobs concurrently.
  • Jobs are always fetched and dispatched in FIFO order (priority ASC, then CreationTime ASC) regardless of execution mode.
  • No job is executed more than once, even when multiple instances poll simultaneously (exactly-once guarantee via per-job lock + Running status).
  • MaxConcurrentJobsPerInstance caps the number of concurrently executing jobs per instance, preventing unbounded task spawning.
  • IBackgroundJobStore is extended with SetAsRunningAsync and all existing store implementations are updated accordingly.
  • Unit and integration tests cover: parallel dispatch correctness, FIFO ordering, duplicate-execution prevention, and semaphore back-pressure behavior.
  • Documentation updated for both the background jobs page and the clustered deployment guide.

Real-World Scenario

Setup: 4 application servers behind a load balancer (or 4 Kubernetes pods with HPA), using Redis for distributed locking.

Workload: 500 queued background jobs, each taking ~2 seconds to execute.

Execution model Estimated completion time
Current (single lock, single thread) ~1000 seconds (~16 min)
Parallel across 4 servers, 1 thread each ~250 seconds (~4 min)
Parallel across 4 servers, 4 threads each ~63 seconds (~1 min)

The difference is stark for production workloads. Notably, Kubernetes HPA provides zero throughput benefit for background jobs under the current model — pods scale up but only one ever processes jobs at a time.


Impact

  • High for applications running in clustered/Kubernetes environments
  • High for applications with time-sensitive background processing
  • Medium for applications using the default background job manager instead of Hangfire/Quartz integrations (which already handle this natively)

Notes

  • This issue applies specifically to the default Volo.Abp.BackgroundJobs manager.
  • Third-party integrations (Hangfire, Quartz, RabbitMQ, TickerQ) handle concurrency natively and are not affected.
  • A workaround today is to switch to Hangfire or Quartz, but the default manager should be production-grade for clustered deployments out of the box.
  • Backward compatibility must be maintained — Exclusive (current behavior) should remain the default, with Parallel as opt-in.

References

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions