|
| 1 | +sched/rt: Fix race in push_rt_task |
| 2 | + |
| 3 | +jira LE-3460 |
| 4 | +Rebuild_History Non-Buildable kernel-6.12.0-55.18.1.el10_0 |
| 5 | +commit-author Harshit Agarwal < [email protected]> |
| 6 | +commit 690e47d1403e90b7f2366f03b52ed3304194c793 |
| 7 | +Empty-Commit: Cherry-Pick Conflicts during history rebuild. |
| 8 | +Will be included in final tarball splat. Ref for failed cherry-pick at: |
| 9 | +ciq/ciq_backports/kernel-6.12.0-55.18.1.el10_0/690e47d1.failed |
| 10 | + |
| 11 | +Overview |
| 12 | +======== |
| 13 | +When a CPU chooses to call push_rt_task and picks a task to push to |
| 14 | +another CPU's runqueue then it will call find_lock_lowest_rq method |
| 15 | +which would take a double lock on both CPUs' runqueues. If one of the |
| 16 | +locks aren't readily available, it may lead to dropping the current |
| 17 | +runqueue lock and reacquiring both the locks at once. During this window |
| 18 | +it is possible that the task is already migrated and is running on some |
| 19 | +other CPU. These cases are already handled. However, if the task is |
| 20 | +migrated and has already been executed and another CPU is now trying to |
| 21 | +wake it up (ttwu) such that it is queued again on the runqeue |
| 22 | +(on_rq is 1) and also if the task was run by the same CPU, then the |
| 23 | +current checks will pass even though the task was migrated out and is no |
| 24 | +longer in the pushable tasks list. |
| 25 | + |
| 26 | +Crashes |
| 27 | +======= |
| 28 | +This bug resulted in quite a few flavors of crashes triggering kernel |
| 29 | +panics with various crash signatures such as assert failures, page |
| 30 | +faults, null pointer dereferences, and queue corruption errors all |
| 31 | +coming from scheduler itself. |
| 32 | + |
| 33 | +Some of the crashes: |
| 34 | +-> kernel BUG at kernel/sched/rt.c:1616! BUG_ON(idx >= MAX_RT_PRIO) |
| 35 | + Call Trace: |
| 36 | + ? __die_body+0x1a/0x60 |
| 37 | + ? die+0x2a/0x50 |
| 38 | + ? do_trap+0x85/0x100 |
| 39 | + ? pick_next_task_rt+0x6e/0x1d0 |
| 40 | + ? do_error_trap+0x64/0xa0 |
| 41 | + ? pick_next_task_rt+0x6e/0x1d0 |
| 42 | + ? exc_invalid_op+0x4c/0x60 |
| 43 | + ? pick_next_task_rt+0x6e/0x1d0 |
| 44 | + ? asm_exc_invalid_op+0x12/0x20 |
| 45 | + ? pick_next_task_rt+0x6e/0x1d0 |
| 46 | + __schedule+0x5cb/0x790 |
| 47 | + ? update_ts_time_stats+0x55/0x70 |
| 48 | + schedule_idle+0x1e/0x40 |
| 49 | + do_idle+0x15e/0x200 |
| 50 | + cpu_startup_entry+0x19/0x20 |
| 51 | + start_secondary+0x117/0x160 |
| 52 | + secondary_startup_64_no_verify+0xb0/0xbb |
| 53 | + |
| 54 | +-> BUG: kernel NULL pointer dereference, address: 00000000000000c0 |
| 55 | + Call Trace: |
| 56 | + ? __die_body+0x1a/0x60 |
| 57 | + ? no_context+0x183/0x350 |
| 58 | + ? __warn+0x8a/0xe0 |
| 59 | + ? exc_page_fault+0x3d6/0x520 |
| 60 | + ? asm_exc_page_fault+0x1e/0x30 |
| 61 | + ? pick_next_task_rt+0xb5/0x1d0 |
| 62 | + ? pick_next_task_rt+0x8c/0x1d0 |
| 63 | + __schedule+0x583/0x7e0 |
| 64 | + ? update_ts_time_stats+0x55/0x70 |
| 65 | + schedule_idle+0x1e/0x40 |
| 66 | + do_idle+0x15e/0x200 |
| 67 | + cpu_startup_entry+0x19/0x20 |
| 68 | + start_secondary+0x117/0x160 |
| 69 | + secondary_startup_64_no_verify+0xb0/0xbb |
| 70 | + |
| 71 | +-> BUG: unable to handle page fault for address: ffff9464daea5900 |
| 72 | + kernel BUG at kernel/sched/rt.c:1861! BUG_ON(rq->cpu != task_cpu(p)) |
| 73 | + |
| 74 | +-> kernel BUG at kernel/sched/rt.c:1055! BUG_ON(!rq->nr_running) |
| 75 | + Call Trace: |
| 76 | + ? __die_body+0x1a/0x60 |
| 77 | + ? die+0x2a/0x50 |
| 78 | + ? do_trap+0x85/0x100 |
| 79 | + ? dequeue_top_rt_rq+0xa2/0xb0 |
| 80 | + ? do_error_trap+0x64/0xa0 |
| 81 | + ? dequeue_top_rt_rq+0xa2/0xb0 |
| 82 | + ? exc_invalid_op+0x4c/0x60 |
| 83 | + ? dequeue_top_rt_rq+0xa2/0xb0 |
| 84 | + ? asm_exc_invalid_op+0x12/0x20 |
| 85 | + ? dequeue_top_rt_rq+0xa2/0xb0 |
| 86 | + dequeue_rt_entity+0x1f/0x70 |
| 87 | + dequeue_task_rt+0x2d/0x70 |
| 88 | + __schedule+0x1a8/0x7e0 |
| 89 | + ? blk_finish_plug+0x25/0x40 |
| 90 | + schedule+0x3c/0xb0 |
| 91 | + futex_wait_queue_me+0xb6/0x120 |
| 92 | + futex_wait+0xd9/0x240 |
| 93 | + do_futex+0x344/0xa90 |
| 94 | + ? get_mm_exe_file+0x30/0x60 |
| 95 | + ? audit_exe_compare+0x58/0x70 |
| 96 | + ? audit_filter_rules.constprop.26+0x65e/0x1220 |
| 97 | + __x64_sys_futex+0x148/0x1f0 |
| 98 | + do_syscall_64+0x30/0x80 |
| 99 | + entry_SYSCALL_64_after_hwframe+0x62/0xc7 |
| 100 | + |
| 101 | +-> BUG: unable to handle page fault for address: ffff8cf3608bc2c0 |
| 102 | + Call Trace: |
| 103 | + ? __die_body+0x1a/0x60 |
| 104 | + ? no_context+0x183/0x350 |
| 105 | + ? spurious_kernel_fault+0x171/0x1c0 |
| 106 | + ? exc_page_fault+0x3b6/0x520 |
| 107 | + ? plist_check_list+0x15/0x40 |
| 108 | + ? plist_check_list+0x2e/0x40 |
| 109 | + ? asm_exc_page_fault+0x1e/0x30 |
| 110 | + ? _cond_resched+0x15/0x30 |
| 111 | + ? futex_wait_queue_me+0xc8/0x120 |
| 112 | + ? futex_wait+0xd9/0x240 |
| 113 | + ? try_to_wake_up+0x1b8/0x490 |
| 114 | + ? futex_wake+0x78/0x160 |
| 115 | + ? do_futex+0xcd/0xa90 |
| 116 | + ? plist_check_list+0x15/0x40 |
| 117 | + ? plist_check_list+0x2e/0x40 |
| 118 | + ? plist_del+0x6a/0xd0 |
| 119 | + ? plist_check_list+0x15/0x40 |
| 120 | + ? plist_check_list+0x2e/0x40 |
| 121 | + ? dequeue_pushable_task+0x20/0x70 |
| 122 | + ? __schedule+0x382/0x7e0 |
| 123 | + ? asm_sysvec_reschedule_ipi+0xa/0x20 |
| 124 | + ? schedule+0x3c/0xb0 |
| 125 | + ? exit_to_user_mode_prepare+0x9e/0x150 |
| 126 | + ? irqentry_exit_to_user_mode+0x5/0x30 |
| 127 | + ? asm_sysvec_reschedule_ipi+0x12/0x20 |
| 128 | + |
| 129 | +Above are some of the common examples of the crashes that were observed |
| 130 | +due to this issue. |
| 131 | + |
| 132 | +Details |
| 133 | +======= |
| 134 | +Let's look at the following scenario to understand this race. |
| 135 | + |
| 136 | +1) CPU A enters push_rt_task |
| 137 | + a) CPU A has chosen next_task = task p. |
| 138 | + b) CPU A calls find_lock_lowest_rq(Task p, CPU Z’s rq). |
| 139 | + c) CPU A identifies CPU X as a destination CPU (X < Z). |
| 140 | + d) CPU A enters double_lock_balance(CPU Z’s rq, CPU X’s rq). |
| 141 | + e) Since X is lower than Z, CPU A unlocks CPU Z’s rq. Someone else has |
| 142 | + locked CPU X’s rq, and thus, CPU A must wait. |
| 143 | + |
| 144 | +2) At CPU Z |
| 145 | + a) Previous task has completed execution and thus, CPU Z enters |
| 146 | + schedule, locks its own rq after CPU A releases it. |
| 147 | + b) CPU Z dequeues previous task and begins executing task p. |
| 148 | + c) CPU Z unlocks its rq. |
| 149 | + d) Task p yields the CPU (ex. by doing IO or waiting to acquire a |
| 150 | + lock) which triggers the schedule function on CPU Z. |
| 151 | + e) CPU Z enters schedule again, locks its own rq, and dequeues task p. |
| 152 | + f) As part of dequeue, it sets p.on_rq = 0 and unlocks its rq. |
| 153 | + |
| 154 | +3) At CPU B |
| 155 | + a) CPU B enters try_to_wake_up with input task p. |
| 156 | + b) Since CPU Z dequeued task p, p.on_rq = 0, and CPU B updates |
| 157 | + B.state = WAKING. |
| 158 | + c) CPU B via select_task_rq determines CPU Y as the target CPU. |
| 159 | + |
| 160 | +4) The race |
| 161 | + a) CPU A acquires CPU X’s lock and relocks CPU Z. |
| 162 | + b) CPU A reads task p.cpu = Z and incorrectly concludes task p is |
| 163 | + still on CPU Z. |
| 164 | + c) CPU A failed to notice task p had been dequeued from CPU Z while |
| 165 | + CPU A was waiting for locks in double_lock_balance. If CPU A knew |
| 166 | + that task p had been dequeued, it would return NULL forcing |
| 167 | + push_rt_task to give up the task p's migration. |
| 168 | + d) CPU B updates task p.cpu = Y and calls ttwu_queue. |
| 169 | + e) CPU B locks Ys rq. CPU B enqueues task p onto Y and sets task |
| 170 | + p.on_rq = 1. |
| 171 | + f) CPU B unlocks CPU Y, triggering memory synchronization. |
| 172 | + g) CPU A reads task p.on_rq = 1, cementing its assumption that task p |
| 173 | + has not migrated. |
| 174 | + h) CPU A decides to migrate p to CPU X. |
| 175 | + |
| 176 | +This leads to A dequeuing p from Y's queue and various crashes down the |
| 177 | +line. |
| 178 | + |
| 179 | +Solution |
| 180 | +======== |
| 181 | +The solution here is fairly simple. After obtaining the lock (at 4a), |
| 182 | +the check is enhanced to make sure that the task is still at the head of |
| 183 | +the pushable tasks list. If not, then it is anyway not suitable for |
| 184 | +being pushed out. |
| 185 | + |
| 186 | +Testing |
| 187 | +======= |
| 188 | +The fix is tested on a cluster of 3 nodes, where the panics due to this |
| 189 | +are hit every couple of days. A fix similar to this was deployed on such |
| 190 | +cluster and was stable for more than 30 days. |
| 191 | + |
| 192 | +Co-developed-by: Jon Kohler < [email protected]> |
| 193 | + Signed-off-by: Jon Kohler < [email protected]> |
| 194 | +Co-developed-by: Gauri Patwardhan < [email protected]> |
| 195 | + Signed-off-by: Gauri Patwardhan < [email protected]> |
| 196 | +Co-developed-by: Rahul Chunduru < [email protected]> |
| 197 | + Signed-off-by: Rahul Chunduru < [email protected]> |
| 198 | + Signed-off-by: Harshit Agarwal < [email protected]> |
| 199 | + Signed-off-by: Peter Zijlstra (Intel) < [email protected]> |
| 200 | + Reviewed-by: "Steven Rostedt (Google)" < [email protected]> |
| 201 | + Reviewed-by: Phil Auld < [email protected]> |
| 202 | + Tested-by: Will Ton < [email protected]> |
| 203 | + |
| 204 | +Link: https://lore.kernel.org/r/ [email protected] |
| 205 | +(cherry picked from commit 690e47d1403e90b7f2366f03b52ed3304194c793) |
| 206 | + Signed-off-by: Jonathan Maple < [email protected]> |
| 207 | + |
| 208 | +# Conflicts: |
| 209 | +# kernel/sched/rt.c |
| 210 | +diff --cc kernel/sched/rt.c |
| 211 | +index 172c588de542,e40422c37033..000000000000 |
| 212 | +--- a/kernel/sched/rt.c |
| 213 | ++++ b/kernel/sched/rt.c |
| 214 | +@@@ -1956,26 -1974,6 +1975,29 @@@ static struct rq *find_lock_lowest_rq(s |
| 215 | + return lowest_rq; |
| 216 | + } |
| 217 | + |
| 218 | +++<<<<<<< HEAD |
| 219 | + +static struct task_struct *pick_next_pushable_task(struct rq *rq) |
| 220 | + +{ |
| 221 | + + struct task_struct *p; |
| 222 | + + |
| 223 | + + if (!has_pushable_tasks(rq)) |
| 224 | + + return NULL; |
| 225 | + + |
| 226 | + + p = plist_first_entry(&rq->rt.pushable_tasks, |
| 227 | + + struct task_struct, pushable_tasks); |
| 228 | + + |
| 229 | + + BUG_ON(rq->cpu != task_cpu(p)); |
| 230 | + + BUG_ON(task_current(rq, p)); |
| 231 | + + BUG_ON(p->nr_cpus_allowed <= 1); |
| 232 | + + |
| 233 | + + BUG_ON(!task_on_rq_queued(p)); |
| 234 | + + BUG_ON(!rt_task(p)); |
| 235 | + + |
| 236 | + + return p; |
| 237 | + +} |
| 238 | + + |
| 239 | +++======= |
| 240 | +++>>>>>>> 690e47d1403e (sched/rt: Fix race in push_rt_task) |
| 241 | + /* |
| 242 | + * If the current CPU has more than one RT task, see if the non |
| 243 | + * running task can migrate over to a CPU that is running a task |
| 244 | +* Unmerged path kernel/sched/rt.c |
0 commit comments