fw/comm/ble: auto-recover from wedged BLE controller#1073
Open
ericmigi wants to merge 2 commits intocoredevices:mainfrom
Open
fw/comm/ble: auto-recover from wedged BLE controller#1073ericmigi wants to merge 2 commits intocoredevices:mainfrom
ericmigi wants to merge 2 commits intocoredevices:mainfrom
Conversation
Collaborator
Author
|
@gmarull this doesn't fix the underlying issue, but it should catch it right |
Collaborator
Author
|
potential fix for the underlying issue ^ |
The OOM retry introduced in f970e3f ("fix H4 stream desync on transport OOM") retries with only a 1ms delay at the same buffer position indefinitely. When NimBLE can't allocate mbufs (ACL/EVT pool exhausted), the HCI task busy-loops, pegging the CPU and preventing other tasks from freeing those buffers. This is likely the root cause of the 50% battery regression between v4.9.152 and v4.9.153. Increase the retry delay from 1ms to 10ms and add a maximum retry count of 100 (~1 second total). If buffers still aren't available, break out and accept the H4 desync. Also add ble_transport_ll_deinit()/reinit() functions that power-cycle the LCPU and reset the IPC queue, wired into bt_driver_stop()/start() so that bt_ctl_reset_bluetooth() now performs a full hardware reset. This enables recovery from a wedged BLE controller. Fixes FIRM-1602 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Eric Migicovsky <eric@repebble.com> Signed-off-by: Eric Migicovsky <ericmigi@gmail.com>
Add a consecutive failure counter in gap_le_advert.c. After 10 consecutive bt_driver_advert_advertising_enable() failures (10 seconds), stop the cycle timer and trigger bt_ctl_reset_bluetooth() to auto-recover via the LCPU power cycling added in the previous commit. This serves as a safety net: the OOM retry fix in the previous commit addresses the likely root cause, but this recovery mechanism catches any other scenario where the BLE controller becomes unresponsive. Also add test infrastructure for advertising enable failure injection. Fixes FIRM-1602 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Eric Migicovsky <eric@repebble.com> Signed-off-by: Eric Migicovsky <ericmigi@gmail.com>
1d4b953 to
2120b83
Compare
gmarull
requested changes
Apr 8, 2026
Member
gmarull
left a comment
There was a problem hiding this comment.
we need to review some memory pool sizing, the workaround was merged as an attempt, seems it is not enough, the real fix is to redimension some of the pools (or add analytics to track usage and find a good number)
Collaborator
Author
|
Liz just ran into this again |
Member
workaround reverted, on .154 it will crash |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ble_transport_ll_deinit()/reinit()to power-cycle the BLE controller hardware, wired intobt_driver_stop()/start()sobt_ctl_reset_bluetooth()now does a full hardware resetProblem
When the BLE controller (LCPU) becomes unresponsive, the advertising scheduler enters a tight retry loop that pegs the CPU at 100%, draining the battery from full to dead in ~8 hours. The existing
bt_ctl_reset_bluetooth()only did a NimBLE host-level reset without power-cycling the LCPU, so the wedged controller stayed wedged.Fix
hci_sf32lb52.c: Add transport teardown/reinit that closes the IPC queue and power-cycles the LCPU vialcpu_power_off()/lcpu_power_on()init.c: Wire transport deinit/reinit intobt_driver_stop()/start()gap_le_advert.c: Add consecutive failure counter → after 10 failures, stop timer + callbt_ctl_reset_bluetooth()Test plan
test_gap_le_advert__enable_failure_triggers_resettest passesFixes FIRM-1602
🤖 Generated with Claude Code