fix: prevent WS-broadcast OOM crash under connection churn#1
Open
skialpine wants to merge 1 commit into
Open
Conversation
Root-caused from a captured panic backtrace: under sustained multi-client WiFi load (WS connection churn + the 10 Hz weight broadcast), free heap collapses and AsyncWebSocket's printfAll path allocates an AsyncWebSocketMessage per client -> operator new throws std::bad_alloc -> (Arduino-ESP32 is -fno-exceptions) std::terminate() -> abort() -> reboot. That OOM-reboot is the "weight stops being collected under load" failure (not thermal -- die temp was 33 C). Decoded stack: operator new -> __cxa_throw -> std::terminate -> abort AsyncWebSocketClient::_queueMessage (AsyncWebSocket.cpp:490) AsyncWebSocket::printfAll sendWebsocketWeightAll (websocket.h) <- loop() 10 Hz broadcast Fix: - Heap-gate every broadcast-to-all helper (weight, status, button, power-off) with wsBroadcastHeapOk(): skip the frame when free heap is below WS_BROADCAST_HEAP_FLOOR (25 KB, above the 15 KB heap watchdog) instead of allocating into an exhausted heap and crashing. Dropping a frame is invisible; the next is <=500 ms away. - Cap each client's outbound queue via -D WS_MAX_QUEUED_MESSAGES=8 (lib default 32) so a backed-up/half-open client can't hoard heap. - Document the footgun in CLAUDE.md (notes + troubleshooting table). Stacked on the scale-telemetry branch (PR decentespresso#57) whose reset_reason / heap telemetry made this diagnosable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on
feat/scale-telemetry(PR decentespresso#57). Base will be retargeted tomainonce decentespresso#57 merges. The telemetry in decentespresso#57 (reset_reason, heap) is what made this diagnosable.Root cause (from a captured + decoded panic backtrace)
Under sustained multi-client WiFi load (WS connection churn + the 10 Hz weight broadcast), free heap collapses. The broadcast path then allocates an
AsyncWebSocketMessageper client andoperator newthrowsstd::bad_alloc; Arduino-ESP32 builds-fno-exceptions, so the throw goes tostd::terminate()→abort()→ reboot (reset_reason=panic). This is the "weight stops being collected under load" failure — not thermal (die temp was 33 °C).Decoded stack:
The existing 15 KB heap watchdog can't prevent it: it has a 2 s debounce and defers reboot up to 60 s while BLE is connected, so the 10 Hz allocation
bad_allocs long before it acts.Fix
wsBroadcastHeapOk()heap-floor gate on every broadcast-to-all helper (sendWebsocketWeightAll,sendWebsocketStatusAll, button, power-off): when free heap is belowWS_BROADCAST_HEAP_FLOOR(25 KB, above the 15 KB watchdog) the frame is skipped, not allocated. Dropping a frame is invisible (next weight frame ≤500 ms away).-D WS_MAX_QUEUED_MESSAGES=8(lib default 32): bounds each client's outbound queue so a backed-up/half-open client can't hoard heap.Verification (on hardware)
Re-ran the exact load that crashed the unpatched build —
conn_churn --rst8×8 + 10 Hz WS + mDNS, BT connected:[ws] low heap 17736 < 25000 -> skip broadcast.abort, no reboot, weight stream uninterrupted (uptime continuous).🤖 Generated with Claude Code