Skip to content

fw/comm/ble: auto-recover from wedged BLE controller#1073

Open
ericmigi wants to merge 2 commits intocoredevices:mainfrom
ericmigi:fix/ble-advert-cpu-peg
Open

fw/comm/ble: auto-recover from wedged BLE controller#1073
ericmigi wants to merge 2 commits intocoredevices:mainfrom
ericmigi:fix/ble-advert-cpu-peg

Conversation

@ericmigi
Copy link
Copy Markdown
Collaborator

@ericmigi ericmigi commented Apr 6, 2026

Summary

  • Detect when BLE advertising repeatedly fails (10 consecutive failures = 10 seconds) and automatically trigger a full BLE stack reset including LCPU power cycle
  • Add ble_transport_ll_deinit()/reinit() to power-cycle the BLE controller hardware, wired into bt_driver_stop()/start() so bt_ctl_reset_bluetooth() now does a full hardware reset
  • Add test infrastructure for advertising enable failure injection

Problem

When the BLE controller (LCPU) becomes unresponsive, the advertising scheduler enters a tight retry loop that pegs the CPU at 100%, draining the battery from full to dead in ~8 hours. The existing bt_ctl_reset_bluetooth() only did a NimBLE host-level reset without power-cycling the LCPU, so the wedged controller stayed wedged.

Fix

  1. hci_sf32lb52.c: Add transport teardown/reinit that closes the IPC queue and power-cycles the LCPU via lcpu_power_off()/lcpu_power_on()
  2. init.c: Wire transport deinit/reinit into bt_driver_stop()/start()
  3. gap_le_advert.c: Add consecutive failure counter → after 10 failures, stop timer + call bt_ctl_reset_bluetooth()

Test plan

  • Verify new test_gap_le_advert__enable_failure_triggers_reset test passes
  • Verify existing advertising tests still pass
  • Test on getafix hardware: toggle airplane mode to verify BLE stop/start with LCPU power cycle works correctly
  • Soak test: verify no regressions in normal BLE operation

Fixes FIRM-1602

🤖 Generated with Claude Code

@ericmigi ericmigi requested a review from gmarull April 6, 2026 05:43
@ericmigi
Copy link
Copy Markdown
Collaborator Author

ericmigi commented Apr 6, 2026

@gmarull this doesn't fix the underlying issue, but it should catch it right

@ericmigi
Copy link
Copy Markdown
Collaborator Author

ericmigi commented Apr 6, 2026

potential fix for the underlying issue ^

ericmigi and others added 2 commits April 5, 2026 23:21
The OOM retry introduced in f970e3f ("fix H4 stream desync on transport
OOM") retries with only a 1ms delay at the same buffer position
indefinitely. When NimBLE can't allocate mbufs (ACL/EVT pool exhausted),
the HCI task busy-loops, pegging the CPU and preventing other tasks from
freeing those buffers. This is likely the root cause of the 50% battery
regression between v4.9.152 and v4.9.153.

Increase the retry delay from 1ms to 10ms and add a maximum retry count
of 100 (~1 second total). If buffers still aren't available, break out
and accept the H4 desync.

Also add ble_transport_ll_deinit()/reinit() functions that power-cycle
the LCPU and reset the IPC queue, wired into bt_driver_stop()/start()
so that bt_ctl_reset_bluetooth() now performs a full hardware reset.
This enables recovery from a wedged BLE controller.

Fixes FIRM-1602

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Eric Migicovsky <eric@repebble.com>
Signed-off-by: Eric Migicovsky <ericmigi@gmail.com>
Add a consecutive failure counter in gap_le_advert.c. After 10
consecutive bt_driver_advert_advertising_enable() failures (10 seconds),
stop the cycle timer and trigger bt_ctl_reset_bluetooth() to
auto-recover via the LCPU power cycling added in the previous commit.

This serves as a safety net: the OOM retry fix in the previous commit
addresses the likely root cause, but this recovery mechanism catches any
other scenario where the BLE controller becomes unresponsive.

Also add test infrastructure for advertising enable failure injection.

Fixes FIRM-1602

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Eric Migicovsky <eric@repebble.com>
Signed-off-by: Eric Migicovsky <ericmigi@gmail.com>
@ericmigi ericmigi force-pushed the fix/ble-advert-cpu-peg branch from 1d4b953 to 2120b83 Compare April 6, 2026 06:22
Copy link
Copy Markdown
Member

@gmarull gmarull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to review some memory pool sizing, the workaround was merged as an attempt, seems it is not enough, the real fix is to redimension some of the pools (or add analytics to track usage and find a good number)

@ericmigi
Copy link
Copy Markdown
Collaborator Author

ericmigi commented Apr 8, 2026

Liz just ran into this again

@gmarull
Copy link
Copy Markdown
Member

gmarull commented Apr 8, 2026

Liz just ran into this again

workaround reverted, on .154 it will crash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants