Skip to content

Commit bf99b36

Browse files
committed
sched/rt: Fix race in push_rt_task
jira LE-3460 Rebuild_History Non-Buildable kernel-6.12.0-55.18.1.el10_0 commit-author Harshit Agarwal <[email protected]> commit 690e47d Empty-Commit: Cherry-Pick Conflicts during history rebuild. Will be included in final tarball splat. Ref for failed cherry-pick at: ciq/ciq_backports/kernel-6.12.0-55.18.1.el10_0/690e47d1.failed Overview ======== When a CPU chooses to call push_rt_task and picks a task to push to another CPU's runqueue then it will call find_lock_lowest_rq method which would take a double lock on both CPUs' runqueues. If one of the locks aren't readily available, it may lead to dropping the current runqueue lock and reacquiring both the locks at once. During this window it is possible that the task is already migrated and is running on some other CPU. These cases are already handled. However, if the task is migrated and has already been executed and another CPU is now trying to wake it up (ttwu) such that it is queued again on the runqeue (on_rq is 1) and also if the task was run by the same CPU, then the current checks will pass even though the task was migrated out and is no longer in the pushable tasks list. Crashes ======= This bug resulted in quite a few flavors of crashes triggering kernel panics with various crash signatures such as assert failures, page faults, null pointer dereferences, and queue corruption errors all coming from scheduler itself. Some of the crashes: -> kernel BUG at kernel/sched/rt.c:1616! BUG_ON(idx >= MAX_RT_PRIO) Call Trace: ? __die_body+0x1a/0x60 ? die+0x2a/0x50 ? do_trap+0x85/0x100 ? pick_next_task_rt+0x6e/0x1d0 ? do_error_trap+0x64/0xa0 ? pick_next_task_rt+0x6e/0x1d0 ? exc_invalid_op+0x4c/0x60 ? pick_next_task_rt+0x6e/0x1d0 ? asm_exc_invalid_op+0x12/0x20 ? pick_next_task_rt+0x6e/0x1d0 __schedule+0x5cb/0x790 ? update_ts_time_stats+0x55/0x70 schedule_idle+0x1e/0x40 do_idle+0x15e/0x200 cpu_startup_entry+0x19/0x20 start_secondary+0x117/0x160 secondary_startup_64_no_verify+0xb0/0xbb -> BUG: kernel NULL pointer dereference, address: 00000000000000c0 Call Trace: ? __die_body+0x1a/0x60 ? no_context+0x183/0x350 ? __warn+0x8a/0xe0 ? exc_page_fault+0x3d6/0x520 ? asm_exc_page_fault+0x1e/0x30 ? pick_next_task_rt+0xb5/0x1d0 ? pick_next_task_rt+0x8c/0x1d0 __schedule+0x583/0x7e0 ? update_ts_time_stats+0x55/0x70 schedule_idle+0x1e/0x40 do_idle+0x15e/0x200 cpu_startup_entry+0x19/0x20 start_secondary+0x117/0x160 secondary_startup_64_no_verify+0xb0/0xbb -> BUG: unable to handle page fault for address: ffff9464daea5900 kernel BUG at kernel/sched/rt.c:1861! BUG_ON(rq->cpu != task_cpu(p)) -> kernel BUG at kernel/sched/rt.c:1055! BUG_ON(!rq->nr_running) Call Trace: ? __die_body+0x1a/0x60 ? die+0x2a/0x50 ? do_trap+0x85/0x100 ? dequeue_top_rt_rq+0xa2/0xb0 ? do_error_trap+0x64/0xa0 ? dequeue_top_rt_rq+0xa2/0xb0 ? exc_invalid_op+0x4c/0x60 ? dequeue_top_rt_rq+0xa2/0xb0 ? asm_exc_invalid_op+0x12/0x20 ? dequeue_top_rt_rq+0xa2/0xb0 dequeue_rt_entity+0x1f/0x70 dequeue_task_rt+0x2d/0x70 __schedule+0x1a8/0x7e0 ? blk_finish_plug+0x25/0x40 schedule+0x3c/0xb0 futex_wait_queue_me+0xb6/0x120 futex_wait+0xd9/0x240 do_futex+0x344/0xa90 ? get_mm_exe_file+0x30/0x60 ? audit_exe_compare+0x58/0x70 ? audit_filter_rules.constprop.26+0x65e/0x1220 __x64_sys_futex+0x148/0x1f0 do_syscall_64+0x30/0x80 entry_SYSCALL_64_after_hwframe+0x62/0xc7 -> BUG: unable to handle page fault for address: ffff8cf3608bc2c0 Call Trace: ? __die_body+0x1a/0x60 ? no_context+0x183/0x350 ? spurious_kernel_fault+0x171/0x1c0 ? exc_page_fault+0x3b6/0x520 ? plist_check_list+0x15/0x40 ? plist_check_list+0x2e/0x40 ? asm_exc_page_fault+0x1e/0x30 ? _cond_resched+0x15/0x30 ? futex_wait_queue_me+0xc8/0x120 ? futex_wait+0xd9/0x240 ? try_to_wake_up+0x1b8/0x490 ? futex_wake+0x78/0x160 ? do_futex+0xcd/0xa90 ? plist_check_list+0x15/0x40 ? plist_check_list+0x2e/0x40 ? plist_del+0x6a/0xd0 ? plist_check_list+0x15/0x40 ? plist_check_list+0x2e/0x40 ? dequeue_pushable_task+0x20/0x70 ? __schedule+0x382/0x7e0 ? asm_sysvec_reschedule_ipi+0xa/0x20 ? schedule+0x3c/0xb0 ? exit_to_user_mode_prepare+0x9e/0x150 ? irqentry_exit_to_user_mode+0x5/0x30 ? asm_sysvec_reschedule_ipi+0x12/0x20 Above are some of the common examples of the crashes that were observed due to this issue. Details ======= Let's look at the following scenario to understand this race. 1) CPU A enters push_rt_task a) CPU A has chosen next_task = task p. b) CPU A calls find_lock_lowest_rq(Task p, CPU Z’s rq). c) CPU A identifies CPU X as a destination CPU (X < Z). d) CPU A enters double_lock_balance(CPU Z’s rq, CPU X’s rq). e) Since X is lower than Z, CPU A unlocks CPU Z’s rq. Someone else has locked CPU X’s rq, and thus, CPU A must wait. 2) At CPU Z a) Previous task has completed execution and thus, CPU Z enters schedule, locks its own rq after CPU A releases it. b) CPU Z dequeues previous task and begins executing task p. c) CPU Z unlocks its rq. d) Task p yields the CPU (ex. by doing IO or waiting to acquire a lock) which triggers the schedule function on CPU Z. e) CPU Z enters schedule again, locks its own rq, and dequeues task p. f) As part of dequeue, it sets p.on_rq = 0 and unlocks its rq. 3) At CPU B a) CPU B enters try_to_wake_up with input task p. b) Since CPU Z dequeued task p, p.on_rq = 0, and CPU B updates B.state = WAKING. c) CPU B via select_task_rq determines CPU Y as the target CPU. 4) The race a) CPU A acquires CPU X’s lock and relocks CPU Z. b) CPU A reads task p.cpu = Z and incorrectly concludes task p is still on CPU Z. c) CPU A failed to notice task p had been dequeued from CPU Z while CPU A was waiting for locks in double_lock_balance. If CPU A knew that task p had been dequeued, it would return NULL forcing push_rt_task to give up the task p's migration. d) CPU B updates task p.cpu = Y and calls ttwu_queue. e) CPU B locks Ys rq. CPU B enqueues task p onto Y and sets task p.on_rq = 1. f) CPU B unlocks CPU Y, triggering memory synchronization. g) CPU A reads task p.on_rq = 1, cementing its assumption that task p has not migrated. h) CPU A decides to migrate p to CPU X. This leads to A dequeuing p from Y's queue and various crashes down the line. Solution ======== The solution here is fairly simple. After obtaining the lock (at 4a), the check is enhanced to make sure that the task is still at the head of the pushable tasks list. If not, then it is anyway not suitable for being pushed out. Testing ======= The fix is tested on a cluster of 3 nodes, where the panics due to this are hit every couple of days. A fix similar to this was deployed on such cluster and was stable for more than 30 days. Co-developed-by: Jon Kohler <[email protected]> Signed-off-by: Jon Kohler <[email protected]> Co-developed-by: Gauri Patwardhan <[email protected]> Signed-off-by: Gauri Patwardhan <[email protected]> Co-developed-by: Rahul Chunduru <[email protected]> Signed-off-by: Rahul Chunduru <[email protected]> Signed-off-by: Harshit Agarwal <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: "Steven Rostedt (Google)" <[email protected]> Reviewed-by: Phil Auld <[email protected]> Tested-by: Will Ton <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] (cherry picked from commit 690e47d) Signed-off-by: Jonathan Maple <[email protected]> # Conflicts: # kernel/sched/rt.c
1 parent 65b36c2 commit bf99b36

File tree

1 file changed

+244
-0
lines changed

1 file changed

+244
-0
lines changed
Lines changed: 244 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
sched/rt: Fix race in push_rt_task
2+
3+
jira LE-3460
4+
Rebuild_History Non-Buildable kernel-6.12.0-55.18.1.el10_0
5+
commit-author Harshit Agarwal <[email protected]>
6+
commit 690e47d1403e90b7f2366f03b52ed3304194c793
7+
Empty-Commit: Cherry-Pick Conflicts during history rebuild.
8+
Will be included in final tarball splat. Ref for failed cherry-pick at:
9+
ciq/ciq_backports/kernel-6.12.0-55.18.1.el10_0/690e47d1.failed
10+
11+
Overview
12+
========
13+
When a CPU chooses to call push_rt_task and picks a task to push to
14+
another CPU's runqueue then it will call find_lock_lowest_rq method
15+
which would take a double lock on both CPUs' runqueues. If one of the
16+
locks aren't readily available, it may lead to dropping the current
17+
runqueue lock and reacquiring both the locks at once. During this window
18+
it is possible that the task is already migrated and is running on some
19+
other CPU. These cases are already handled. However, if the task is
20+
migrated and has already been executed and another CPU is now trying to
21+
wake it up (ttwu) such that it is queued again on the runqeue
22+
(on_rq is 1) and also if the task was run by the same CPU, then the
23+
current checks will pass even though the task was migrated out and is no
24+
longer in the pushable tasks list.
25+
26+
Crashes
27+
=======
28+
This bug resulted in quite a few flavors of crashes triggering kernel
29+
panics with various crash signatures such as assert failures, page
30+
faults, null pointer dereferences, and queue corruption errors all
31+
coming from scheduler itself.
32+
33+
Some of the crashes:
34+
-> kernel BUG at kernel/sched/rt.c:1616! BUG_ON(idx >= MAX_RT_PRIO)
35+
Call Trace:
36+
? __die_body+0x1a/0x60
37+
? die+0x2a/0x50
38+
? do_trap+0x85/0x100
39+
? pick_next_task_rt+0x6e/0x1d0
40+
? do_error_trap+0x64/0xa0
41+
? pick_next_task_rt+0x6e/0x1d0
42+
? exc_invalid_op+0x4c/0x60
43+
? pick_next_task_rt+0x6e/0x1d0
44+
? asm_exc_invalid_op+0x12/0x20
45+
? pick_next_task_rt+0x6e/0x1d0
46+
__schedule+0x5cb/0x790
47+
? update_ts_time_stats+0x55/0x70
48+
schedule_idle+0x1e/0x40
49+
do_idle+0x15e/0x200
50+
cpu_startup_entry+0x19/0x20
51+
start_secondary+0x117/0x160
52+
secondary_startup_64_no_verify+0xb0/0xbb
53+
54+
-> BUG: kernel NULL pointer dereference, address: 00000000000000c0
55+
Call Trace:
56+
? __die_body+0x1a/0x60
57+
? no_context+0x183/0x350
58+
? __warn+0x8a/0xe0
59+
? exc_page_fault+0x3d6/0x520
60+
? asm_exc_page_fault+0x1e/0x30
61+
? pick_next_task_rt+0xb5/0x1d0
62+
? pick_next_task_rt+0x8c/0x1d0
63+
__schedule+0x583/0x7e0
64+
? update_ts_time_stats+0x55/0x70
65+
schedule_idle+0x1e/0x40
66+
do_idle+0x15e/0x200
67+
cpu_startup_entry+0x19/0x20
68+
start_secondary+0x117/0x160
69+
secondary_startup_64_no_verify+0xb0/0xbb
70+
71+
-> BUG: unable to handle page fault for address: ffff9464daea5900
72+
kernel BUG at kernel/sched/rt.c:1861! BUG_ON(rq->cpu != task_cpu(p))
73+
74+
-> kernel BUG at kernel/sched/rt.c:1055! BUG_ON(!rq->nr_running)
75+
Call Trace:
76+
? __die_body+0x1a/0x60
77+
? die+0x2a/0x50
78+
? do_trap+0x85/0x100
79+
? dequeue_top_rt_rq+0xa2/0xb0
80+
? do_error_trap+0x64/0xa0
81+
? dequeue_top_rt_rq+0xa2/0xb0
82+
? exc_invalid_op+0x4c/0x60
83+
? dequeue_top_rt_rq+0xa2/0xb0
84+
? asm_exc_invalid_op+0x12/0x20
85+
? dequeue_top_rt_rq+0xa2/0xb0
86+
dequeue_rt_entity+0x1f/0x70
87+
dequeue_task_rt+0x2d/0x70
88+
__schedule+0x1a8/0x7e0
89+
? blk_finish_plug+0x25/0x40
90+
schedule+0x3c/0xb0
91+
futex_wait_queue_me+0xb6/0x120
92+
futex_wait+0xd9/0x240
93+
do_futex+0x344/0xa90
94+
? get_mm_exe_file+0x30/0x60
95+
? audit_exe_compare+0x58/0x70
96+
? audit_filter_rules.constprop.26+0x65e/0x1220
97+
__x64_sys_futex+0x148/0x1f0
98+
do_syscall_64+0x30/0x80
99+
entry_SYSCALL_64_after_hwframe+0x62/0xc7
100+
101+
-> BUG: unable to handle page fault for address: ffff8cf3608bc2c0
102+
Call Trace:
103+
? __die_body+0x1a/0x60
104+
? no_context+0x183/0x350
105+
? spurious_kernel_fault+0x171/0x1c0
106+
? exc_page_fault+0x3b6/0x520
107+
? plist_check_list+0x15/0x40
108+
? plist_check_list+0x2e/0x40
109+
? asm_exc_page_fault+0x1e/0x30
110+
? _cond_resched+0x15/0x30
111+
? futex_wait_queue_me+0xc8/0x120
112+
? futex_wait+0xd9/0x240
113+
? try_to_wake_up+0x1b8/0x490
114+
? futex_wake+0x78/0x160
115+
? do_futex+0xcd/0xa90
116+
? plist_check_list+0x15/0x40
117+
? plist_check_list+0x2e/0x40
118+
? plist_del+0x6a/0xd0
119+
? plist_check_list+0x15/0x40
120+
? plist_check_list+0x2e/0x40
121+
? dequeue_pushable_task+0x20/0x70
122+
? __schedule+0x382/0x7e0
123+
? asm_sysvec_reschedule_ipi+0xa/0x20
124+
? schedule+0x3c/0xb0
125+
? exit_to_user_mode_prepare+0x9e/0x150
126+
? irqentry_exit_to_user_mode+0x5/0x30
127+
? asm_sysvec_reschedule_ipi+0x12/0x20
128+
129+
Above are some of the common examples of the crashes that were observed
130+
due to this issue.
131+
132+
Details
133+
=======
134+
Let's look at the following scenario to understand this race.
135+
136+
1) CPU A enters push_rt_task
137+
a) CPU A has chosen next_task = task p.
138+
b) CPU A calls find_lock_lowest_rq(Task p, CPU Z’s rq).
139+
c) CPU A identifies CPU X as a destination CPU (X < Z).
140+
d) CPU A enters double_lock_balance(CPU Z’s rq, CPU X’s rq).
141+
e) Since X is lower than Z, CPU A unlocks CPU Z’s rq. Someone else has
142+
locked CPU X’s rq, and thus, CPU A must wait.
143+
144+
2) At CPU Z
145+
a) Previous task has completed execution and thus, CPU Z enters
146+
schedule, locks its own rq after CPU A releases it.
147+
b) CPU Z dequeues previous task and begins executing task p.
148+
c) CPU Z unlocks its rq.
149+
d) Task p yields the CPU (ex. by doing IO or waiting to acquire a
150+
lock) which triggers the schedule function on CPU Z.
151+
e) CPU Z enters schedule again, locks its own rq, and dequeues task p.
152+
f) As part of dequeue, it sets p.on_rq = 0 and unlocks its rq.
153+
154+
3) At CPU B
155+
a) CPU B enters try_to_wake_up with input task p.
156+
b) Since CPU Z dequeued task p, p.on_rq = 0, and CPU B updates
157+
B.state = WAKING.
158+
c) CPU B via select_task_rq determines CPU Y as the target CPU.
159+
160+
4) The race
161+
a) CPU A acquires CPU X’s lock and relocks CPU Z.
162+
b) CPU A reads task p.cpu = Z and incorrectly concludes task p is
163+
still on CPU Z.
164+
c) CPU A failed to notice task p had been dequeued from CPU Z while
165+
CPU A was waiting for locks in double_lock_balance. If CPU A knew
166+
that task p had been dequeued, it would return NULL forcing
167+
push_rt_task to give up the task p's migration.
168+
d) CPU B updates task p.cpu = Y and calls ttwu_queue.
169+
e) CPU B locks Ys rq. CPU B enqueues task p onto Y and sets task
170+
p.on_rq = 1.
171+
f) CPU B unlocks CPU Y, triggering memory synchronization.
172+
g) CPU A reads task p.on_rq = 1, cementing its assumption that task p
173+
has not migrated.
174+
h) CPU A decides to migrate p to CPU X.
175+
176+
This leads to A dequeuing p from Y's queue and various crashes down the
177+
line.
178+
179+
Solution
180+
========
181+
The solution here is fairly simple. After obtaining the lock (at 4a),
182+
the check is enhanced to make sure that the task is still at the head of
183+
the pushable tasks list. If not, then it is anyway not suitable for
184+
being pushed out.
185+
186+
Testing
187+
=======
188+
The fix is tested on a cluster of 3 nodes, where the panics due to this
189+
are hit every couple of days. A fix similar to this was deployed on such
190+
cluster and was stable for more than 30 days.
191+
192+
Co-developed-by: Jon Kohler <[email protected]>
193+
Signed-off-by: Jon Kohler <[email protected]>
194+
Co-developed-by: Gauri Patwardhan <[email protected]>
195+
Signed-off-by: Gauri Patwardhan <[email protected]>
196+
Co-developed-by: Rahul Chunduru <[email protected]>
197+
Signed-off-by: Rahul Chunduru <[email protected]>
198+
Signed-off-by: Harshit Agarwal <[email protected]>
199+
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
200+
Reviewed-by: "Steven Rostedt (Google)" <[email protected]>
201+
Reviewed-by: Phil Auld <[email protected]>
202+
Tested-by: Will Ton <[email protected]>
203+
204+
Link: https://lore.kernel.org/r/[email protected]
205+
(cherry picked from commit 690e47d1403e90b7f2366f03b52ed3304194c793)
206+
Signed-off-by: Jonathan Maple <[email protected]>
207+
208+
# Conflicts:
209+
# kernel/sched/rt.c
210+
diff --cc kernel/sched/rt.c
211+
index 172c588de542,e40422c37033..000000000000
212+
--- a/kernel/sched/rt.c
213+
+++ b/kernel/sched/rt.c
214+
@@@ -1956,26 -1974,6 +1975,29 @@@ static struct rq *find_lock_lowest_rq(s
215+
return lowest_rq;
216+
}
217+
218+
++<<<<<<< HEAD
219+
+static struct task_struct *pick_next_pushable_task(struct rq *rq)
220+
+{
221+
+ struct task_struct *p;
222+
+
223+
+ if (!has_pushable_tasks(rq))
224+
+ return NULL;
225+
+
226+
+ p = plist_first_entry(&rq->rt.pushable_tasks,
227+
+ struct task_struct, pushable_tasks);
228+
+
229+
+ BUG_ON(rq->cpu != task_cpu(p));
230+
+ BUG_ON(task_current(rq, p));
231+
+ BUG_ON(p->nr_cpus_allowed <= 1);
232+
+
233+
+ BUG_ON(!task_on_rq_queued(p));
234+
+ BUG_ON(!rt_task(p));
235+
+
236+
+ return p;
237+
+}
238+
+
239+
++=======
240+
++>>>>>>> 690e47d1403e (sched/rt: Fix race in push_rt_task)
241+
/*
242+
* If the current CPU has more than one RT task, see if the non
243+
* running task can migrate over to a CPU that is running a task
244+
* Unmerged path kernel/sched/rt.c

0 commit comments

Comments
 (0)