Feature Request: Support Parallel Job Execution in Clustered / Multi-Instance Deployments



## Summary

The default ABP Background Job Manager uses a single distributed lock (`AbpBackgroundJobWorker`) to ensure only one server executes the job polling loop at a time. While this correctly prevents duplicate job execution, it also means that **all worker instances except one are completely idle** during the lock-hold period. In a production environment with multiple application instances, this results in severe under-utilization of available infrastructure and becomes a significant bottleneck for high-throughput workloads.

We propose supporting **parallel job execution while preserving FIFO (First-In, First-Out) dispatch order** — jobs are always picked up in the order they were enqueued (by priority, then by creation time), but multiple jobs can execute concurrently across servers without blocking each other.

---

## Current Behavior

According to the [documentation](https://abp.io/docs/latest/framework/infrastructure/background-jobs):

> The default background job manager works as **FIFO in a single thread** and uses a **distributed lock** to ensure jobs are executed only in a single application instance at a time.

The root cause is in [`BackgroundJobWorkerService.DoWorkAsync`](https://github.com/abpframework/abp/blob/dev/framework/src/Volo.Abp.BackgroundJobs/Volo/Abp/BackgroundJobs/BackgroundJobWorkerService.cs), where the **entire polling cycle** is wrapped inside a single distributed lock:

```csharp
// Current implementation (simplified)
await using (await DistributedLock.TryAcquireAsync(options.DistributedLockName))
{
    // ALL job fetching + execution happens here
    // → every other server is blocked for the entire duration
}
```

Given a 4-server production deployment:

1. Server A acquires the `AbpBackgroundJobWorker` distributed lock.
2. Server A fetches and executes a batch of jobs (up to `MaxJobFetchCount`, default 1000).
3. Servers B, C, and D sit **completely idle**, waiting for the lock to be released.
4. Only after Server A finishes its entire execution cycle (which may take seconds or minutes) do the other servers compete for the lock.
5. During high-load scenarios, the job queue grows faster than a single server can drain it.

This effectively turns a multi-server deployment into a **single-server execution model**, negating the horizontal scaling benefits — including Kubernetes HPA, which will scale pods but gain **zero throughput improvement** for background job processing under the current design.

---

## Problem Breakdown

### 1. No parallel execution across instances
The entire `BackgroundJobWorkerService` polling cycle is wrapped in a single distributed lock. This prevents any form of work-sharing between instances, even when there are hundreds of queued jobs and idle servers.

### 2. No per-job-level locking
Instead of locking at the **individual job** level (which would allow multiple servers to each process different jobs concurrently), the lock is held for the **entire worker tick**, serializing all execution across the cluster.

### 3. No configuration option for execution strategy
There is currently no built-in way to configure whether job execution should be:
- **Exclusive** (current behavior — one server runs at a time)
- **Parallel** (multiple servers each pick up and execute different jobs simultaneously)
- **Hybrid** (e.g., limit to N concurrent workers across the cluster)

### 4. Degraded throughput in time-sensitive workloads
Applications that rely on background jobs for near-real-time processing (e.g., notifications, order processing, report generation) are severely impacted when the single active worker becomes a bottleneck.

---

## Desired Behavior

Jobs are always **fetched and dispatched in FIFO order** (ordered by priority ASC, then `CreationTime` ASC — matching the current store query), but multiple servers participate in draining the queue concurrently.

### Execution flow (proposed)

```
[Job Queue — FIFO dispatch order]
  Job #1 (oldest / highest priority)  ──► Server A acquires per-job lock → executes
  Job #2                               ──► Server B acquires per-job lock → executes
  Job #3                               ──► Server C acquires per-job lock → executes
  Job #4                               ──► Server D acquires per-job lock → executes
  Job #5 ... waits for next free slot
```

Jobs are dispatched strictly in FIFO order. Servers race to claim each job via a per-job distributed lock. The first server to acquire a job's lock executes it; others skip and move to the next unclaimed job in the queue.

### FIFO guarantee — clarification

| Aspect | Guarantee |
|---|---|
| Dispatch order | ✅ Jobs are always picked up in enqueue order (priority + CreationTime) |
| Execution start order | ✅ Earlier jobs are always attempted first on every server |
| Execution completion order | ❌ Not guaranteed — a later job may finish before an earlier one (inherent in any parallel system) |

This is the same model used by Hangfire, Quartz, RabbitMQ, and every other production-grade distributed job queue. If strict completion-order guarantees are required, the current sequential default behavior already satisfies that use case.

---

## Describe the Solution

### Option A — Per-job distributed locking (preferred)

Move the distributed lock from the worker polling cycle to the **individual job execution level**. Each server independently polls for pending jobs in FIFO order and attempts to acquire a per-job lock before executing.

This also requires a small addition to `IBackgroundJobStore`: a `SetAsRunningAsync(Guid jobId)` method to mark a claimed job as in-progress and exclude it from future polls before the lock is released. Without this, a slow job could be re-fetched and re-attempted by other servers during its execution window.

```csharp
// Proposed: each server polls independently — no global worker lock
var semaphore = new SemaphoreSlim(options.MaxConcurrentJobsPerInstance);

var pendingJobs = await _store.GetWaitingJobsAsync(
    maxCount: options.MaxJobFetchCount
    // ordered by: Priority ASC, CreationTime ASC  ← FIFO preserved
);

foreach (var job in pendingJobs)
{
    var lockKey = $"AbpBackgroundJob:{job.Id}";
    await using var handle = await _distributedLock.TryAcquireAsync(lockKey, timeout: TimeSpan.Zero);
    if (handle == null) continue; // another server already claimed this job

    // Mark as running BEFORE releasing the dispatch loop, to exclude from future polls
    await _store.SetAsRunningAsync(job.Id);

    await semaphore.WaitAsync(); // back-pressure: respect MaxConcurrentJobsPerInstance
    _ = Task.Run(async () =>
    {
        try { await ExecuteJobAsync(job); }
        finally { semaphore.Release(); }
    });
}
```

Key properties:
- **FIFO order preserved** — all servers query the same ordered list and attempt jobs from the top down.
- **No global bottleneck** — lock scope is a single job, not the entire worker cycle.
- **No duplicate execution** — per-job distributed lock + `Running` store status flag together guarantee exactly-once execution.
- **Back-pressure enforced** — `SemaphoreSlim` ensures no instance spawns unbounded concurrent tasks regardless of queue depth.
- **Backward compatible** — existing retry, timeout, and failure logic in `BackgroundJobExecuter` is unchanged.

> ⚠️ **Required interface change:** `IBackgroundJobStore` needs a new `SetAsRunningAsync(Guid jobId)` method. This is a minor but necessary addition — without it, long-running jobs remain in `Waiting` state and risk being double-claimed by other polling servers.

### Option B — Configurable execution strategy via `AbpBackgroundJobWorkerOptions`

Introduce a new option to let developers choose the execution model:

```csharp
Configure(options =>
{
    // Choose: Exclusive (default), Parallel
    options.ExecutionStrategy = JobExecutionStrategy.Parallel;

    // Max concurrent jobs this instance can execute at once (default: 1)
    options.MaxConcurrentJobsPerInstance = 10;
});
```

### Option C — Configurable worker concurrency per instance

Even without cross-instance coordination, allow each instance to execute multiple jobs in parallel internally:

```csharp
Configure(options =>
{
    options.MaxConcurrentJobsPerInstance = Environment.ProcessorCount;
});
```

---

## Required Changes Summary

| Area | Change |
|---|---|
| `BackgroundJobWorkerService` | Replace global lock with per-job lock + semaphore dispatch loop |
| `IBackgroundJobStore` | Add `SetAsRunningAsync(Guid jobId)` method |
| `AbpBackgroundJobWorkerOptions` | Add `IsParallelExecutionEnabled` (default: `false`) and `MaxConcurrentJobsPerInstance` (default: `1`) |
| Documentation | Update background jobs and clustered deployment docs |
| Tests | Add unit/integration tests for parallel dispatch, FIFO ordering, and exactly-once execution |

---

## Acceptance Criteria

- [ ] `AbpBackgroundJobWorkerOptions.IsParallelExecutionEnabled = false` preserves the current behavior exactly (no regression).
- [ ] When `IsParallelExecutionEnabled = true`, multiple application instances each execute different jobs concurrently.
- [ ] Jobs are always fetched and dispatched in FIFO order (priority ASC, then `CreationTime` ASC) regardless of execution mode.
- [ ] No job is executed more than once, even when multiple instances poll simultaneously (exactly-once guarantee via per-job lock + `Running` status).
- [ ] `MaxConcurrentJobsPerInstance` caps the number of concurrently executing jobs per instance, preventing unbounded task spawning.
- [ ] `IBackgroundJobStore` is extended with `SetAsRunningAsync` and all existing store implementations are updated accordingly.
- [ ] Unit and integration tests cover: parallel dispatch correctness, FIFO ordering, duplicate-execution prevention, and semaphore back-pressure behavior.
- [ ] Documentation updated for both the background jobs page and the clustered deployment guide.

---

## Real-World Scenario

**Setup:** 4 application servers behind a load balancer (or 4 Kubernetes pods with HPA), using Redis for distributed locking.

**Workload:** 500 queued background jobs, each taking ~2 seconds to execute.

| Execution model | Estimated completion time |
|---|---|
| Current (single lock, single thread) | ~1000 seconds (~16 min) |
| Parallel across 4 servers, 1 thread each | ~250 seconds (~4 min) |
| Parallel across 4 servers, 4 threads each | ~63 seconds (~1 min) |

The difference is stark for production workloads. Notably, **Kubernetes HPA provides zero throughput benefit** for background jobs under the current model — pods scale up but only one ever processes jobs at a time.

---

## Impact

- **High** for applications running in clustered/Kubernetes environments
- **High** for applications with time-sensitive background processing
- **Medium** for applications using the default background job manager instead of Hangfire/Quartz integrations (which already handle this natively)

---

## Notes

- This issue applies specifically to the **default** `Volo.Abp.BackgroundJobs` manager.
- Third-party integrations (Hangfire, Quartz, RabbitMQ, TickerQ) handle concurrency natively and are not affected.
- A workaround today is to switch to Hangfire or Quartz, but the default manager should be production-grade for clustered deployments out of the box.
- Backward compatibility must be maintained — `Exclusive` (current behavior) should remain the default, with `Parallel` as opt-in.

---

## References

- Background Jobs docs: https://abp.io/docs/latest/framework/infrastructure/background-jobs
- Distributed locking docs: https://abp.io/docs/latest/framework/infrastructure/distributed-locking
- Clustered deployment docs: https://abp.io/docs/latest/deployment/clustered-environment
- Relevant source file: https://github.com/abpframework/abp/blob/dev/framework/src/Volo.Abp.BackgroundJobs/Volo/Abp/BackgroundJobs/BackgroundJobWorkerService.cs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Support Parallel Job Execution in Clustered / Multi-Instance Deployments #25606

Summary

Current Behavior

Problem Breakdown

1. No parallel execution across instances

2. No per-job-level locking

3. No configuration option for execution strategy

4. Degraded throughput in time-sensitive workloads

Desired Behavior

Execution flow (proposed)

FIFO guarantee — clarification

Describe the Solution

Option A — Per-job distributed locking (preferred)

Option B — Configurable execution strategy via `AbpBackgroundJobWorkerOptions`

Option C — Configurable worker concurrency per instance

Required Changes Summary

Acceptance Criteria

Real-World Scenario

Impact

Notes

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Aspect	Guarantee
Dispatch order	✅ Jobs are always picked up in enqueue order (priority + CreationTime)
Execution start order	✅ Earlier jobs are always attempted first on every server
Execution completion order	❌ Not guaranteed — a later job may finish before an earlier one (inherent in any parallel system)

Area	Change
`BackgroundJobWorkerService`	Replace global lock with per-job lock + semaphore dispatch loop
`IBackgroundJobStore`	Add `SetAsRunningAsync(Guid jobId)` method
`AbpBackgroundJobWorkerOptions`	Add `IsParallelExecutionEnabled` (default: `false`) and `MaxConcurrentJobsPerInstance` (default: `1`)
Documentation	Update background jobs and clustered deployment docs
Tests	Add unit/integration tests for parallel dispatch, FIFO ordering, and exactly-once execution

Execution model	Estimated completion time
Current (single lock, single thread)	~1000 seconds (~16 min)
Parallel across 4 servers, 1 thread each	~250 seconds (~4 min)
Parallel across 4 servers, 4 threads each	~63 seconds (~1 min)

Feature Request: Support Parallel Job Execution in Clustered / Multi-Instance Deployments #25606

Description

Summary

Current Behavior

Problem Breakdown

1. No parallel execution across instances

2. No per-job-level locking

3. No configuration option for execution strategy

4. Degraded throughput in time-sensitive workloads

Desired Behavior

Execution flow (proposed)

FIFO guarantee — clarification

Describe the Solution

Option A — Per-job distributed locking (preferred)

Option B — Configurable execution strategy via AbpBackgroundJobWorkerOptions

Option C — Configurable worker concurrency per instance

Required Changes Summary

Acceptance Criteria

Real-World Scenario

Impact

Notes

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Option B — Configurable execution strategy via `AbpBackgroundJobWorkerOptions`