Skip to content

[LTS 9.4] CVE-2025-21786 #406

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: ciqlts9_4
Choose a base branch
from

Conversation

pvts-mat
Copy link
Contributor

@pvts-mat pvts-mat commented Jul 10, 2025

[LTS 9.4]
CVE-2025-21786
VULN-54096

Problem

https://access.redhat.com/security/cve/CVE-2025-21786

A vulnerability was found in the Linux kernel's work queue subsystem, which manages background task execution. The issue stems from improper handling of the "rescuer" thread during the cleanup of unbound work queues.

Background

Workqueue system allows user space programs to defer some tasks to be executed asynchronously by the kernel - the "generic async execution mechanism" as expressed in kernel/workqueue.c's header comment.

A piece of work to be executed is called work item. It's represented by a simple struct work_struct coupling a function defining the job with some additional data:

struct work_struct {
atomic_long_t data;
struct list_head entry;
work_func_t func;
#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
#endif
};

The work items are put through API on the work queues

struct workqueue_struct {

The type of different work queues the work items can be put on determine how they will be executed.

From there they are distributed to internal pool work queues

struct pool_workqueue {

where they await execution by the kernel threads called workers. Those can be easily observed with any process-listing tool like ps or top, eg.

# ps -e w | grep kworker/

      7 ?        I      0:00 [kworker/0:0-events]
      8 ?        I<     0:00 [kworker/0:0H-events_highpri]
      9 ?        I      0:00 [kworker/u22:0-events_unbound]
     11 ?        I      0:00 [kworker/u22:1-events_unbound]
     19 ?        I      0:00 [kworker/0:1-events]
     25 ?        I      0:00 [kworker/1:0-rcu_gp]
     26 ?        I<     0:00 [kworker/1:0H-events_highpri]
     31 ?        I      0:00 [kworker/2:0-events]
…

The workers are gathered in work pools

struct worker_pool {

Each work pool has a single pool work queue and zero or more workers associated. Each CPU has two work pools assigned - one for normal work items and the other for high priority ones. Apart from CPU-bound pools there are also unbound work pools (with unbound work queues mentioned in the CVE), the number of which is dynamic. (This variety of work pools exists for balancing the tradeoff between having high locality of execution (and thus efficiency) for the CPU-bound work pools and much simpler load balancing with the unbound ones.)

It's possible for the work items in a work pool to become deadlocked. For this reason the work queue contains a rescue worker

struct worker *rescuer; /* MD: rescue worker */

which can pick up any work item from the work pool, break the deadlock and push execution forward. The rescuer's thread function rescuer_thread is the subject of the CVE's fix e769461 in the mainline kernel.

Analysis

The bug

Following the KASAN logs from https://lore.kernel.org/lkml/CAKHoSAvP3iQW+GwmKzWjEAOoPvzeWeoMO0Gz7Pp3_4kxt-RMoA@mail.gmail.com/ it can be seen that the use-after-free scenario unfolded as follows:

  1. The rescuer thread released the pool workqueue with put_pwq(…) at

    put_pwq(pwq);

    It was sure
    * Put the reference grabbed by send_mayday(). @pool won't
    * go away while we're still attached to it.
    that the pool associated with this workqueue will still be around at the moment of worker_detach_from_pool(…) call at
    worker_detach_from_pool(rescuer);

  2. Simultaneously, some regular worker from the same pool released it as well

    Last potentially related work creation:
     kasan_save_stack+0x24/0x50 mm/kasan/common.c:47
     __kasan_record_aux_stack+0x8c/0xa0 mm/kasan/generic.c:541
     __call_rcu_common.constprop.0+0x6a/0xad0 kernel/rcu/tree.c:3086
     put_unbound_pool+0x552/0x830 kernel/workqueue.c:4965
     pwq_release_workfn+0x4c6/0x9e0 kernel/workqueue.c:5065
     kthread_worker_fn+0x2b9/0xb00 kernel/kthread.c:844
     kthread+0x2c2/0x3a0 kernel/kthread.c:389
     ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147
     ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
    

    at

    work->func(work);
    reducing its ref count to 0 and scheduling it for destruction.

  3. The pool workqueue, guarded by he Read-Copy-Update mechanism, was destroyed soon after by the idle thread 0, along with its worker pool:

    Freed by task 0:
     kasan_save_stack+0x24/0x50 mm/kasan/common.c:47
     kasan_save_track+0x14/0x30 mm/kasan/common.c:68
     kasan_save_free_info+0x3a/0x60 mm/kasan/generic.c:579
     poison_slab_object mm/kasan/common.c:247 [inline]
     __kasan_slab_free+0x38/0x50 mm/kasan/common.c:264
     kasan_slab_free include/linux/kasan.h:230 [inline]
     slab_free_hook mm/slub.c:2342 [inline]
     slab_free mm/slub.c:4579 [inline]
     kfree+0x212/0x4a0 mm/slub.c:4727
     rcu_do_batch kernel/rcu/tree.c:2567 [inline]
     rcu_core+0x835/0x17f0 kernel/rcu/tree.c:2823
     handle_softirqs+0x1b1/0x7d0 kernel/softirq.c:554
     __do_softirq kernel/softirq.c:588 [inline]
     invoke_softirq kernel/softirq.c:428 [inline]
     __irq_exit_rcu kernel/softirq.c:637 [inline]
     irq_exit_rcu+0x94/0xc0 kernel/softirq.c:649
     instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1049 [inline]
     sysvec_apic_timer_interrupt+0x70/0x80 arch/x86/kernel/apic/apic.c:1049
     asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702
    
  4. The rescuer thread continued execution, hitting the worker_detach_from_pool(…) call, which attempted to remove the rescuer worker from the workers list on the pool which no longer existed

    __dump_stack lib/dump_stack.c:94 [inline]
    dump_stack_lvl+0x116/0x1b0 lib/dump_stack.c:120
    print_address_description mm/kasan/report.c:377 [inline]
    print_report+0xcb/0x620 mm/kasan/report.c:488
    kasan_report+0xbd/0xf0 mm/kasan/report.c:601
    __list_del include/linux/list.h:195 [inline]
    __list_del_entry include/linux/list.h:218 [inline]
    list_del include/linux/list.h:229 [inline]
    detach_worker+0x164/0x180 kernel/workqueue.c:2709
    worker_detach_from_pool kernel/workqueue.c:2728 [inline]
    rescuer_thread+0x69d/0xcd0 kernel/workqueue.c:3526
    kthread+0x2c2/0x3a0 kernel/kthread.c:389
    ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147
    ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
    

    See

    list_del(&worker->node);
    and the read/write operations in list_del's implementation in
    static inline void __list_del(struct list_head * prev, struct list_head * next)
    {
    next->prev = prev;
    WRITE_ONCE(prev->next, next);
    }

The fix

The core of the fix is moving the put_pwq(…) call after the worker_detach_from_pool(…) call to ensure the pool's ref count remains greater than zero at the moment of detaching the rescuer from it. Before:

/*
* Put the reference grabbed by send_mayday(). @pool won't
* go away while we're still attached to it.
*/
put_pwq(pwq);
/*
* Leave this pool. Notify regular workers; otherwise, we end up
* with 0 concurrency and stalling the execution.
*/
kick_pool(pool);
raw_spin_unlock_irq(&pool->lock);
worker_detach_from_pool(rescuer);
raw_spin_lock_irq(&wq_mayday_lock);

After:

/*
* Leave this pool. Notify regular workers; otherwise, we end up
* with 0 concurrency and stalling the execution.
*/
kick_pool(pool);
raw_spin_unlock_irq(&pool->lock);
worker_detach_from_pool(rescuer);
/*
* Put the reference grabbed by send_mayday(). @pool might
* go away any time after it.
*/
put_pwq_unlocked(pwq);
raw_spin_lock_irq(&wq_mayday_lock);

Although the moved function changed to put_pwq_unlocked(…), it's actually the same put_pwq(…) call, but wrapped in the raw_spin_lock_irq(…) / raw_spin_unlock_irq(…) pair

raw_spin_lock_irq(&pwq->pool->lock);
put_pwq(pwq);
raw_spin_unlock_irq(&pwq->pool->lock);

This can be seen even more clearly in the original proposition of the fix given by Tejun Heo in the mailing list https://lore.kernel.org/lkml/[email protected]/:

+		/*
+		 * Put the reference grabbed by send_mayday(). This must come
+		 * after the final access of the pool.
+		 */
+		raw_spin_lock_irq(&pool->lock);
+		put_pwq(pwq);
+		raw_spin_unlock_irq(&pool->lock);

This wrapping was not necessary before because the pool->lock was already being held at the time of put_pwq(pwq) call, see

raw_spin_lock_irq(&pool->lock);

Applicability: no

The affected file kernel/workqueue.c is unconditionally compiled into every kernel

signal.o sys.o umh.o workqueue.o pid.o task_work.o \

so it's part of any LTS 9.4 build regardless of the configuration used.

However, the CVE-2025-21786 bug fixed by e769461 patch does not apply to the code found under ciqlts9_4 revision and applying the patch, while not harmful on the functional level, shouldn't be done. The arguments are listed below.

The "fixes" commit is missing from the LTS 9.4 history

The e769461 fix mentions 68f8305 as the commit introducing the bug and it's missing from LTS 9.4 history of kernel/workqueue.c, neither was it backported - see
workqueue-history.txt.

Commit's e769461 message explicitly blames changes introduced in 68f8305:

The commit 68f8305("workqueue: Reap workers via kthread_stop() and remove detach_completion") adds code to reap the normal workers but mistakenly does not handle the rescuer and also removes the code waiting for the rescuer in put_unbound_pool(), which caused a use-after-free bug reported by Cheung Wall.

The "code waiting for the rescuer" removed in 68f8305 is present in the ciqlts9_4 revision:

if (pool->detach_completion)
wait_for_completion(pool->detach_completion);

The put_pwq(…) call is not placed randomly

Examining git history shows that the authors of the workqueue mechanism - Lai Jiangshan and Tejun Heo - took great care to place the grab/put functions in proper places. See commit 77668c8 which introduced the put_pwq(…) call

workqueue: fix a possible race condition between rescuer and pwq-release

There is a race condition between rescuer_thread() and
pwq_unbound_release_workfn().

Even after a pwq is scheduled for rescue, the associated work items
may be consumed by any worker.  If all of them are consumed before the
rescuer gets to them and the pwq's base ref was put due to attribute
change, the pwq may be released while still being linked on
@wq->maydays list making the rescuer dereference already freed pwq
later.

Make send_mayday() pin the target pwq until the rescuer is done with
it.

(In fact, this commit pre-emptively fixed the CVE-2023-1281 bug (not a CVE back then) which only re-surfaced after the 68f8305 commit - It addresses the same problem.)

Commit 13b1d62, in turn, dealt with the placement of worker_detach_from_pool(…) call and explicitly related it to the put_pwq(…) call:

workqueue: move rescuer pool detachment to the end

In 51697d393922 ("workqueue: use generic attach/detach routine for
rescuers"), The rescuer detaches itself from the pool before put_pwq()
so that the put_unbound_pool() will not destroy the rescuer-attached
pool.

It is unnecessary.  worker_detach_from_pool() can be used as the last
statement to access to the pool just like the regular workers,
put_unbound_pool() will wait for it to detach and then free the pool.

So we move the worker_detach_from_pool() down, make it coincide with
the regular workers.

It's only the "put_unbound_pool() will wait for it to detach" part which turned false after the introduction of 68f8305 which, again, was not done in LTS 9.4.

Using the patched version is not without any cost

From the short bug and fix analysis it should be rather clear (hopefully), that applying the CVE-2025-21786 patch is just a matter of holding a reference a little longer. It could therefore not hurt to apply the patch "just in case". However, putting aside the nevertheless nonzero degree of uncertainty around the harmlessness of this treatment, doing it requires unnecessary locking / unlocking of the &pwq->pool->lock around put_pwq(pwq) call (see the fix). In general it's always better to avoid unnecessary locks, as they hurt performance and can introduce deadlocking problems not present before.

RedHat's "Affected" classification doesn't hold much weight

The counter-argument to not backporting the patch can be RedHat listing "Red Hat Enterprise Linux 9" as "Affected" on the CVE-2025-21786 bug's page https://access.redhat.com/security/cve/CVE-2025-21786.

However, RH's "Affected" may in actuality mean either "affected, confirmed" or "not investigated yet":

Unless explicitly stated as not affected, all previous versions of packages in any minor update stream of a product listed here should be assumed vulnerable, although may not have been subject to full analysis.

This stands in contrast to "not affected" classification which actually means "not affected, confirmed" only.

@pvts-mat pvts-mat marked this pull request as draft July 10, 2025 15:27
This was referenced Jul 10, 2025
@pvts-mat
Copy link
Contributor Author

The "draft" status is only to prevent accidental merge, the PR is ready for review.

@kerneltoast
Copy link

The reason the pwq refcount was able to hit zero was because the initial pwq reference was put in apply_wqattrs_cleanup(). This happened because a task changed the implicated workqueue's CPU affinity mask by writing to /sys/devices/virtual/workqueue/WQ_NAME/cpumask, which triggers a pwq replacement. After the new pwqs are committed, the old ones are freed by apply_wqattrs_cleanup() putting those initial references.

So for the issue to occur, the following must happen at around the same time:

  • There is a worker running from inside a workqueue's rescuer kthread. Only workqueues with WQ_MEM_RECLAIM have a rescuer kthread, and even then the rescuer kthread is only used as a fallback to guarantee forward progress of the workqueue's workers when memory pressure is high. There aren't many workqueues with WQ_MEM_RECLAIM and even then it is rare for a worker to hit the rescuer kthread.
  • There is a task writing to /sys/devices/virtual/workqueue/WQ_NAME/cpumask for that workqueue that has a worker running in the rescuer kthread.
  • The last reference on the pwq must be put by either the worker in the rescuer kthread, or apply_wqattrs_cleanup() quickly enough to get the pwq freed before the rescuer kthread is done using it.
  • At least one RCU grace period must elapse after the last pwq reference is put so that the kfree_rcu() RCU callback can run and actually kfree the pwq. And this must occur before the rescuer kthread finishes using the pwq.

This can be triggered under high memory pressure while writing to /sys/devices/virtual/workqueue/WQ_NAME/cpumask and hammering the CPU running the rescuer kthread for WQ_NAME, I guess.

I don't think we should bother picking this, since the Fixes commit was introduced in 6.11 and wasn't backported to any stable kernels. The CVE fix itself is only present on 6.12+ kernels upstream, so I think it's safe to say we don't need to bother with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants