diff --git a/CLAUDE.md b/CLAUDE.md
index 966f6bb..2551e0f 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -95,6 +95,7 @@ WiFi and BLE share the same 2.4 GHz radio. The Arduino-ESP32 default is `WIFI_PS
 - `LittleFS.begin()` is idempotent.
 - Multiple concurrent WS clients are supported (on-device web UI + a separate app). `cleanupClients()` caps at `DEFAULT_MAX_WS_CLIENTS` (8 on ESP32); shared session state (rate, events) resets only when the last client disconnects (`server->count() == 0`).
 - Broadcast with `websocket.printfAll(...)`, not a hand-rolled `getClients()` loop: `getClients()` doesn't take the library's client-list mutex, so iterating it on the loop task races a client disconnect on the AsyncTCP task (use-after-free). `printfAll` holds the lock and sends to each client.
+- **`printfAll` allocates per client and throws `std::bad_alloc` when the heap is exhausted** — and Arduino-ESP32 builds `-fno-exceptions`, so the throw goes straight to `std::terminate()`→`abort()`→reboot. Connection churn (many/half-open WS clients lingering under the 30 s ack timeout) can collapse the heap, so every broadcast-to-all helper (`sendWebsocketWeightAll`, `sendWebsocketStatusAll`, button/power-off) is gated by `wsBroadcastHeapOk()` (`WS_BROADCAST_HEAP_FLOOR`, ~25 KB, above the 15 KB heap watchdog): when heap is below the floor the frame is **skipped**, not allocated. Per-client queues are capped via `-D WS_MAX_QUEUED_MESSAGES=8` (lib default 32) so a backed-up client can't hoard heap. Don't add a new `printfAll` broadcast without the `wsBroadcastHeapOk()` guard.
 
 WS frame parsing: only act on complete unfragmented text frames:
 
@@ -135,6 +136,7 @@ Functions and locals are camelCase. Some legacy snake_case remains; don't churn
 | Boot logs show `LittleFS mount failed` | Run `pio run -t uploadfs` to write the filesystem image — firmware-only flashes don't touch it. |
 | `pio device monitor` hangs in a non-PTY shell | Use the pyserial snippet in Quick reference. |
 | `pio` flash takes >60s instead of ~15s | Bad firmware is choking the bootloader handshake. Symptom of a serious bug on the device (WiFi coex, OLED stuck, etc.), not a hardware fault. |
+| `reset_reason=panic` / `abort()` + reboot under sustained multi-client WiFi load | Heap-exhaustion OOM in a WS broadcast: `printfAll` → `operator new` throws `bad_alloc`. Broadcasts are heap-gated (`wsBroadcastHeapOk`); look for `[ws] low heap … skip broadcast` on serial and a falling `[health] heap=`. Driven by WS connection churn (half-open clients lingering on the 30 s ack timeout). Not thermal. |
 
 ## Keeping this file fresh
 
@@ -147,6 +149,10 @@ This document is meant to evolve with the codebase. During a session, if you (Cl
 
 If you fix a bug whose symptom is documented in the "When something is broken" table, leave the entry in place — it's still the right "first place to look" for the next person.
 
+## Fixing bugs you find along the way
+
+Pre-existing bugs get fixed too — "it was already there" is not a reason to defer. When you turn up a bug while working on something else (a review flags it, you read past it, a test surfaces it), fix it as part of the same change; a pre-existing bug is no less bad than a newly introduced one, and the person touching the code is the right person to fix it. The only exception is when the fix is genuinely a large, independent effort — then call it out explicitly and agree on a separate change, rather than silently leaving it in place.
+
 ## Don't
 
 - Don't call I²C / SPI / blocking IO from the AsyncTCP task.
diff --git a/README.md b/README.md
index c43d8d5..cef2d61 100644
--- a/README.md
+++ b/README.md
@@ -182,10 +182,37 @@ Status frame shape:
   "soft_sleep": false,
   "events_enabled": true,
   "rate_hz": 10,
-  "interval_ms": 100
+  "interval_ms": 100,
+  "soc_temp_c": 33.3,
+  "soc_temp_max_c": 41.2,
+  "weight_stalled": false,
+  "stall_count": 0,
+  "last_stall_ms": 0,
+  "last_stall_temp_c": 0.0,
+  "adc_recovery_count": 0,
+  "reset_reason": "poweron"
 }
 ```
 
+The trailing fields are diagnostic telemetry (added to investigate a thermal
+"weight stops being collected" failure under sustained load):
+
+- `soc_temp_c` / `soc_temp_max_c` — current and peak ESP32-S3 die temperature
+  (°C) since boot. `soc_temp_max_c` is `-100` until the first valid sample.
+- `weight_stalled` — `true` while the load-cell raw value has been frozen/railed
+  for >8 s (readings have stopped), cleared when they resume.
+- `stall_count` — number of stall events since boot; `last_stall_ms` is the
+  `millis()` of the most recent stall onset (`0` = none yet) and
+  `last_stall_temp_c` is the die temp at that moment (valid only when
+  `last_stall_ms != 0`).
+- `adc_recovery_count` — number of ADC power-cycle recoveries since boot. A
+  climbing value is the signal for a perpetual-recovery loop (the case
+  `weight_stalled` is blind to).
+- `reset_reason` — why the SoC last reset (`poweron`, `panic`, `brownout`,
+  `task_wdt`, …), so a reboot mid-soak is explained.
+
+These reset on reboot (not persisted to NVS).
+
 For backwards compatibility, WiFi only sends weight snapshots by default. A
 client must send `events on` before periodic status, local scale button presses,
 or power-off notifications are emitted. The event stream resets to off when the
diff --git a/include/parameter.h b/include/parameter.h
index 4f12594..b713c0e 100644
--- a/include/parameter.h
+++ b/include/parameter.h
@@ -193,9 +193,46 @@ static const unsigned long ZERO_DISPLAY_MISMATCH_TIMEOUT = 1500;
 static const float ZERO_DISPLAY_MISMATCH_THRESHOLD = 0.5;
 static const uint8_t ADC_ERROR_RECOVERY_COUNT = 2;
 static bool b_adc_recovery_active = false;
-static uint8_t i_adc_recovery_count = 0;
+// volatile: written on the main loop -- incremented on each ADC power-cycle
+// recovery, reset to 0 by resetAdcRecoveryState() -- and read in the WS status
+// frame (which can be built on the AsyncTCP task). uint32_t (not uint8_t) so a
+// *perpetual* recovery loop -- the one failure mode the stall watchdog is blind
+// to -- keeps counting truthfully over a long soak instead of saturating at 255.
+static volatile uint32_t i_adc_recovery_count = 0;
 //bool b_tempDisablePowerOff = true;
 
+// Instrumentation for diagnosing the "weight stops being collected" failure
+// under sustained load (suspected thermal). These are all written on the main
+// loop and read by the WS status frame, which is built BOTH on the main loop
+// (periodic) AND on the AsyncTCP task (command responses) -- so the read crosses
+// a task boundary. volatile prevents the AsyncTCP reader caching a stale value
+// (single aligned scalars => the load/store is atomic on Xtensa, no mutex
+// needed). b_weightStalled is set by the pureScale() stall watchdog when the ADC
+// raw value is frozen/railed.
+volatile bool b_weightStalled = false;
+// volatile for the same cross-task reason; written once at boot in setup().
+volatile const char *g_resetReason = "unknown";
+// Peak/last-event stats since boot (no NVS; reset on reboot, which g_resetReason
+// then explains). g_socTempMaxC = highest SoC die temp seen. The *_stall_*
+// fields capture the most recent stall so the failure is visible after the fact
+// -- consumers must treat last_stall_temp_c as valid only when g_lastStallMs != 0
+// (0.0 otherwise means "no stall yet", not a real 0 C reading).
+volatile float g_socTempC = 0.0f;          // latest SoC temperature (C)
+volatile float g_socTempMaxC = -100.0f;    // peak SoC temperature since boot (C); -100 = no valid sample yet
+volatile uint32_t g_stallCount = 0;        // number of weight-stall events since boot
+volatile unsigned long g_lastStallMs = 0;  // millis() of the last stall onset (0 = none)
+volatile float g_lastStallTempC = 0.0f;    // SoC temp when the last stall began (valid only if g_lastStallMs != 0)
+
+// Snapshot of the stopWatch state, refreshed once per main-loop iteration. The
+// WS status frame is built BOTH on the main loop AND on the AsyncTCP task
+// (command responses); stopWatch is a multi-field object (running flag + start
+// ts + accumulator) also mutated from BLE/USB, so reading it directly off the
+// AsyncTCP task can tear (CLAUDE.md). The status frame reads these single
+// aligned volatiles instead. g_timerElapsed carries stopWatch.elapsed() in its
+// configured resolution (SECONDS) -- it is the WS "timer_seconds" field.
+volatile bool g_timerRunning = false;
+volatile unsigned long g_timerElapsed = 0;
+
 bool b_negativeWeight = false;
 
 bool b_weight_quick_zero = false;           //Tare后快速显示为0优化
diff --git a/include/websocket.h b/include/websocket.h
index b1185c5..750af5d 100644
--- a/include/websocket.h
+++ b/include/websocket.h
@@ -156,8 +156,37 @@ void processWsPendingCmds() {
   }
 }
 
+// --- Heap-floor gate for periodic WS broadcasts ------------------------------
+// printfAll() allocates an AsyncWebSocketMessage (a heap buffer) for EVERY
+// connected client. Under WebSocket connection churn the heap can collapse, and
+// that allocation then throws std::bad_alloc -> std::terminate() -> abort()
+// (Arduino-ESP32 builds with -fno-exceptions, so the throw can't be caught) ->
+// reboot. That OOM-reboot is the "weight stops being collected under sustained
+// multi-client load" failure. Skipping a frame is invisible (the next weight
+// frame is <=500 ms away, status <=5 s); crashing is not. The floor sits above
+// the 15 KB heap watchdog (wifi_setup.cpp) so broadcasts back off well before a
+// reboot is even considered. Every broadcast helper below runs on the main loop,
+// so the skip counter needs no synchronization.
+static const uint32_t WS_BROADCAST_HEAP_FLOOR = 25000;
+static uint32_t g_wsBroadcastHeapSkips = 0;
+static inline bool wsBroadcastHeapOk() {
+  if (ESP.getFreeHeap() >= WS_BROADCAST_HEAP_FLOOR) return true;
+  g_wsBroadcastHeapSkips++;
+  static unsigned long lastLog = 0;
+  unsigned long now = millis();
+  if (now - lastLog >= 2000) {  // rate-limit: broadcasts can be 10 Hz
+    lastLog = now;
+    Serial.printf("[ws] low heap %lu < %lu -> skip broadcast (total skips=%lu)\n",
+                  (unsigned long)ESP.getFreeHeap(),
+                  (unsigned long)WS_BROADCAST_HEAP_FLOOR,
+                  (unsigned long)g_wsBroadcastHeapSkips);
+  }
+  return false;
+}
+
 void sendWebsocketButton(int buttonNumber, int buttonShortPress) {
   if (!b_wifiEnabled || !b_websocketEventsEnabled || websocket.count() == 0) return;
+  if (!wsBroadcastHeapOk()) return;
   websocket.printfAll("{\"type\":\"button\",\"button\":\"%s\",\"button_number\":%d,\"press\":\"%s\",\"press_code\":%d,\"ms\":%lu}",
                       websocketButtonName(buttonNumber),
                       buttonNumber,
@@ -168,6 +197,7 @@ void sendWebsocketButton(int buttonNumber, int buttonShortPress) {
 
 void sendWebsocketPowerOff(int i_reason) {
   if (!b_wifiEnabled || !b_websocketEventsEnabled || websocket.count() == 0) return;
+  if (!wsBroadcastHeapOk()) return;
   websocket.printfAll("{\"type\":\"power\",\"event\":\"power_off\",\"reason\":\"%s\",\"reason_code\":%d,\"ms\":%lu}",
                       websocketPowerOffReason(i_reason),
                       i_reason,
@@ -183,7 +213,7 @@ void sendWebsocketRateInfo(AsyncWebSocketClient *client, const char *status) {
 }
 
 void sendWebsocketStatus(AsyncWebSocketClient *client, const char *status) {
-  client->printf("{\"type\":\"status\",\"status\":\"%s\",\"protocol_version\":1,\"firmware_version\":\"%s\",\"grams\":%.2f,\"ms\":%lu,\"battery_percent\":%d,\"battery_voltage\":%.2f,\"charging\":%s,\"timer_running\":%s,\"timer_seconds\":%lu,\"display_on\":%s,\"low_power\":%s,\"soft_sleep\":%s,\"events_enabled\":%s,\"rate_hz\":%lu,\"interval_ms\":%lu}",
+  client->printf("{\"type\":\"status\",\"status\":\"%s\",\"protocol_version\":1,\"firmware_version\":\"%s\",\"grams\":%.2f,\"ms\":%lu,\"battery_percent\":%d,\"battery_voltage\":%.2f,\"charging\":%s,\"timer_running\":%s,\"timer_seconds\":%lu,\"display_on\":%s,\"low_power\":%s,\"soft_sleep\":%s,\"events_enabled\":%s,\"rate_hz\":%lu,\"interval_ms\":%lu,\"soc_temp_c\":%.1f,\"soc_temp_max_c\":%.1f,\"weight_stalled\":%s,\"stall_count\":%lu,\"last_stall_ms\":%lu,\"last_stall_temp_c\":%.1f,\"adc_recovery_count\":%lu,\"reset_reason\":\"%s\"}",
                  status,
                  FIRMWARE_VER,
                  f_displayedValue,
@@ -191,14 +221,22 @@ void sendWebsocketStatus(AsyncWebSocketClient *client, const char *status) {
                  websocketBatteryPercent(),
                  f_batteryVoltage,
                  websocketIsCharging() ? "true" : "false",
-                 stopWatch.isRunning() ? "true" : "false",
-                 (unsigned long)stopWatch.elapsed(),
+                 g_timerRunning ? "true" : "false",
+                 g_timerElapsed,
                  b_u8g2Sleep ? "false" : "true",
                  b_websocketLowPowerEnabled ? "true" : "false",
                  b_softSleep ? "true" : "false",
                  b_websocketEventsEnabled ? "true" : "false",
                  websocketRateForInterval(weightWebsocketNotifyInterval),
-                 weightWebsocketNotifyInterval);
+                 weightWebsocketNotifyInterval,
+                 g_socTempC,
+                 g_socTempMaxC,
+                 b_weightStalled ? "true" : "false",
+                 (unsigned long)g_stallCount,
+                 g_lastStallMs,
+                 g_lastStallTempC,
+                 (unsigned long)i_adc_recovery_count,
+                 (const char *)g_resetReason);
 }
 
 // Broadcast via printfAll(): it holds the library's client-list mutex and
@@ -211,7 +249,8 @@ void sendWebsocketStatus(AsyncWebSocketClient *client, const char *status) {
 // without blocking the others.
 void sendWebsocketStatusAll(const char *status) {
   if (!b_wifiEnabled || !b_websocketEventsEnabled || websocket.count() == 0) return;
-  websocket.printfAll("{\"type\":\"status\",\"status\":\"%s\",\"protocol_version\":1,\"firmware_version\":\"%s\",\"grams\":%.2f,\"ms\":%lu,\"battery_percent\":%d,\"battery_voltage\":%.2f,\"charging\":%s,\"timer_running\":%s,\"timer_seconds\":%lu,\"display_on\":%s,\"low_power\":%s,\"soft_sleep\":%s,\"events_enabled\":%s,\"rate_hz\":%lu,\"interval_ms\":%lu}",
+  if (!wsBroadcastHeapOk()) return;
+  websocket.printfAll("{\"type\":\"status\",\"status\":\"%s\",\"protocol_version\":1,\"firmware_version\":\"%s\",\"grams\":%.2f,\"ms\":%lu,\"battery_percent\":%d,\"battery_voltage\":%.2f,\"charging\":%s,\"timer_running\":%s,\"timer_seconds\":%lu,\"display_on\":%s,\"low_power\":%s,\"soft_sleep\":%s,\"events_enabled\":%s,\"rate_hz\":%lu,\"interval_ms\":%lu,\"soc_temp_c\":%.1f,\"soc_temp_max_c\":%.1f,\"weight_stalled\":%s,\"stall_count\":%lu,\"last_stall_ms\":%lu,\"last_stall_temp_c\":%.1f,\"adc_recovery_count\":%lu,\"reset_reason\":\"%s\"}",
                       status,
                       FIRMWARE_VER,
                       f_displayedValue,
@@ -219,18 +258,27 @@ void sendWebsocketStatusAll(const char *status) {
                       websocketBatteryPercent(),
                       f_batteryVoltage,
                       websocketIsCharging() ? "true" : "false",
-                      stopWatch.isRunning() ? "true" : "false",
-                      (unsigned long)stopWatch.elapsed(),
+                      g_timerRunning ? "true" : "false",
+                      g_timerElapsed,
                       b_u8g2Sleep ? "false" : "true",
                       b_websocketLowPowerEnabled ? "true" : "false",
                       b_softSleep ? "true" : "false",
                       b_websocketEventsEnabled ? "true" : "false",
                       websocketRateForInterval(weightWebsocketNotifyInterval),
-                      weightWebsocketNotifyInterval);
+                      weightWebsocketNotifyInterval,
+                      g_socTempC,
+                      g_socTempMaxC,
+                      b_weightStalled ? "true" : "false",
+                      (unsigned long)g_stallCount,
+                      g_lastStallMs,
+                      g_lastStallTempC,
+                      (unsigned long)i_adc_recovery_count,
+                      (const char *)g_resetReason);
 }
 
 void sendWebsocketWeightAll(float grams, unsigned long ms) {
   if (!b_wifiEnabled || websocket.count() == 0) return;
+  if (!wsBroadcastHeapOk()) return;
   websocket.printfAll("{\"grams\":%.2f,\"ms\":%lu}", grams, ms);
 }
 
diff --git a/platformio.ini b/platformio.ini
index 289a3f1..1192edc 100644
--- a/platformio.ini
+++ b/platformio.ini
@@ -27,6 +27,10 @@ build_flags =
 ;  -DESP32
   -D CONFIG_ASYNC_TCP_RUNNING_CORE=1
   -DELEGANTOTA_USE_ASYNC_WEBSERVER=1
+  ; Cap each WS client's outbound queue (lib default 32) so a backed-up or
+  ; half-open client (connection churn) can't hoard heap. Bounds aggregate heap
+  ; growth; complements the WS_BROADCAST_HEAP_FLOOR gate in include/websocket.h.
+  -D WS_MAX_QUEUED_MESSAGES=8
   !python3 git_rev_macro.py
 
 #	-D DEBUG
diff --git a/src/hds.ino b/src/hds.ino
index aac5ec8..0116613 100644
--- a/src/hds.ino
+++ b/src/hds.ino
@@ -392,10 +392,39 @@ void wifi_init() {
 
 MyUsbCallbacks usbCallbacks;
 
+// Map esp_reset_reason() to a short string for boot logging + WS telemetry, so a
+// spontaneous reset (brownout / panic / watchdog) is attributable instead of
+// looking like a clean power-on.
+const char *resetReasonStr(esp_reset_reason_t r) {
+  switch (r) {
+    case ESP_RST_POWERON:   return "poweron";
+    case ESP_RST_EXT:       return "ext";
+    case ESP_RST_SW:        return "sw";
+    case ESP_RST_PANIC:     return "panic";
+    case ESP_RST_INT_WDT:   return "int_wdt";
+    case ESP_RST_TASK_WDT:  return "task_wdt";
+    case ESP_RST_WDT:       return "wdt";
+    case ESP_RST_DEEPSLEEP: return "deepsleep";
+    case ESP_RST_BROWNOUT:  return "brownout";
+    case ESP_RST_SDIO:      return "sdio";
+    default: {
+      // Don't collapse unmapped IDF reset codes (e.g. CPU_LOCKUP, USB, JTAG on
+      // newer IDF) to a bare "unknown" -- keep the numeric code so a new/rare
+      // reason is still attributable. Written once at boot, so a static buffer
+      // is safe.
+      static char buf[16];
+      snprintf(buf, sizeof(buf), "unknown_%d", (int)r);
+      return buf;
+    }
+  }
+}
+
 void setup() {
   Serial.begin(115200);
   while (!Serial)  // Wait for the Serial port to initialize (typically used in Arduino to ensure the Serial monitor is ready)
     ;
+  g_resetReason = resetReasonStr(esp_reset_reason());
+  Serial.printf("[boot] reset_reason=%s\n", (const char *)g_resetReason);
   if (!EEPROM.begin(512)) {
     Serial.println("EEPROM init failed!");
     while (1) {
@@ -932,6 +961,56 @@ void pureScale() {
     t_lastScaleData = millis();
   }
 
+  // Stall watchdog: a live load cell's raw 24-bit value dithers on every
+  // conversion (the ADS1232/HX711 runs ~10 samples/s at the configured rate). If
+  // the raw value is byte-identical for >8 s it's frozen or railed (a rail to 0
+  // freezes rawValue at the last good value via the driver's data>0 guard, so it
+  // still reads as "unchanged") -- the "weight stops being collected" failure
+  // (suspected thermal/analog) that an in-firmware ADC power-cycle can't fix.
+  // Surface it (flag + one-shot log) instead of silently streaming a stuck value.
+  // Skipped while a deliberate ADC power-cycle recovery is in progress (raw is
+  // frozen by definition then); the window is re-seeded on the first 250 ms poll
+  // after recovery clears (via the t_rawChange==0 sentinel). Blind spot: a
+  // *perpetual* recovery loop (recovery every ~5 s) keeps re-seeding so this flag
+  // may never trip -- the climbing adc_recovery_count in the status frame is the
+  // signal for that case. Checked every 250 ms (not every loop): the ADC only
+  // produces ~10 samples/s, so polling faster just burns CPU/heat.
+  {
+    static long lastRaw = 0x7FFFFFFFL;
+    static unsigned long t_rawChange = 0;   // 0 = (re)seed window on next sample
+    static unsigned long t_stallCheck = 0;
+    if (b_adc_recovery_active) {
+      // Deliberate ADC power-cycle in progress: raw is frozen by design, not by
+      // the failure we detect. Re-seed the window so we don't false-trip when
+      // streaming resumes.
+      t_rawChange = 0;
+    } else if (millis() - t_stallCheck >= 250) {
+      unsigned long nowMs = millis();
+      if (nowMs == 0) nowMs = 1;  // 0 is the reseed sentinel for t_rawChange; never store it as a real timestamp (boot/rollover)
+      t_stallCheck = nowMs;
+      long raw = scale.getDebugInfo().rawValue;
+      if (t_rawChange == 0) {
+        lastRaw = raw;
+        t_rawChange = nowMs;
+      } else if (raw != lastRaw) {
+        lastRaw = raw;
+        t_rawChange = nowMs;
+        if (b_weightStalled) {
+          b_weightStalled = false;
+          Serial.println("[adc] weight readings resumed");
+        }
+      } else if (!b_weightStalled && nowMs - t_rawChange > 8000) {
+        b_weightStalled = true;
+        g_stallCount++;
+        g_lastStallMs = nowMs;
+        g_lastStallTempC = g_socTempC;
+        Serial.printf("[adc] WEIGHT STALLED #%lu: raw frozen at %ld for >8s soc=%.1fC heap=%lu\n",
+                      (unsigned long)g_stallCount, raw, g_lastStallTempC,
+                      (unsigned long)ESP.getFreeHeap());
+      }
+    }
+  }
+
   if (scale.update()) {
     b_newDataReady = true;
     t_lastScaleData = millis();
@@ -941,9 +1020,7 @@ void pureScale() {
              millis() - t_lastScaleRecovery > 5000) {
     Serial.println("Scale ADC timeout. Power cycling ADC.");
     b_adc_recovery_active = true;
-    if (i_adc_recovery_count < 255) {
-      i_adc_recovery_count++;
-    }
+    i_adc_recovery_count++;  // uint32_t: counts truthfully, won't wrap in any realistic runtime
     scale.powerDown();
     delay(5);
     scale.powerUp();
@@ -1268,11 +1345,53 @@ void loop() {
   // here on the loop task rather than racing peripheral drivers.
   processWsPendingCmds();
 
+  // Snapshot the multi-field stopWatch into aligned volatiles on the loop task so
+  // the WS status frame (built on the AsyncTCP task for command responses) never
+  // reads stopWatch cross-task. Done after the drain above so a just-applied
+  // timer start/stop/zero is reflected. elapsed() is in the configured
+  // resolution (SECONDS).
+  g_timerRunning = stopWatch.isRunning();
+  g_timerElapsed = (unsigned long)stopWatch.elapsed();
+
   if (b_powerOff){
     shut_down_now_nobeep();
     return;
   }
 
+  // SoC-temperature sampler + peak tracking (diagnosing the suspected thermal
+  // stall). Runs every 2 s regardless of WiFi state or power-supply mode
+  // (USB/battery); prints a summary every 10 s so a serial capture during a
+  // stress run shows the temp trend, and feeds g_socTempC/Max into the WS
+  // status frame.
+  {
+    static unsigned long t_tempSample = 0, t_tempLog = 0;
+    unsigned long nowMs = millis();
+    if (nowMs - t_tempSample >= 2000) {
+      t_tempSample = nowMs;
+      float t = temperatureRead();
+      // temperatureRead() returns NaN if the SoC sensor is unavailable. Reject
+      // any non-finite value (NaN or +/-inf): NaN serializes as invalid JSON and
+      // a non-finite compare would freeze the peak. Keep the last valid value and
+      // log once so the failure is visible rather than silent.
+      if (isfinite(t)) {
+        g_socTempC = t;
+        if (t > g_socTempMaxC) g_socTempMaxC = t;
+      } else {
+        static bool tempFailLogged = false;
+        if (!tempFailLogged) {
+          tempFailLogged = true;
+          Serial.println("[temp] temperatureRead() returned NaN -- SoC sensor unavailable");
+        }
+      }
+      if (nowMs - t_tempLog >= 10000) {
+        t_tempLog = nowMs;
+        Serial.printf("[temp] soc=%.1fC max=%.1fC stalls=%lu last_stall=%lums stall_temp=%.1fC heap=%lu\n",
+                      g_socTempC, g_socTempMaxC, (unsigned long)g_stallCount,
+                      g_lastStallMs, g_lastStallTempC, (unsigned long)ESP.getFreeHeap());
+      }
+    }
+  }
+
   if (bleState == CONNECTED && b_requireHeartBeat && millis() - t_firstConnect > HEARTBEAT_TIMEOUT) {
     if (millis() - t_heartBeat > HEARTBEAT_TIMEOUT) {
       disconnectBLE();
diff --git a/tools/thermal_load_test.sh b/tools/thermal_load_test.sh
new file mode 100755
index 0000000..3a5c5df
--- /dev/null
+++ b/tools/thermal_load_test.sh
@@ -0,0 +1,165 @@
+#!/bin/bash
+# 1-hour multi-protocol thermal/stall load test.
+#   - drives:   USB 10 Hz binary + WS 10 Hz stream + HTTP/WS churn + mDNS
+#   - NOT driven here: BT (the user's app drives it concurrently)
+#   - monitors: the WS status-frame telemetry (soc_temp_c/max, weight_stalled,
+#               stall_count, last_stall_temp_c, adc_recovery_count, reset_reason)
+#               every ~60 s, watching for the "weight stops being collected"
+#               failure and the temp at which it hits.
+# Opening USB reboots the scale once (clean baseline); reconnect BT afterward.
+#
+# Usage: tools/thermal_load_test.sh [DURATION_S] [IP] [HOST]
+set -u
+cd "$(dirname "$0")/.."
+DUR="${1:-3600}"
+IP="${2:-192.168.10.242}"
+HOST="${3:-hds.local}"
+PORT="$(ls /dev/cu.*usbserial* 2>/dev/null | head -1)"
+LOG=/tmp/thermal; rm -rf "$LOG"; mkdir -p "$LOG"
+ts(){ date +%H:%M:%S; }
+echo "[thermal] START $(ts) dur=${DUR}s ip=$IP host=$HOST port=${PORT:-<none>}"
+[ -z "$PORT" ] && echo "[thermal] WARNING: no USB serial port found -- USB 10Hz load DISABLED (WiFi-only)"
+
+# 1) USB 10 Hz (opening the port pulses DTR/RTS -> one reboot -> clean baseline)
+if [ -n "$PORT" ]; then
+  python3 -u tools/usb_rate_check.py "$PORT" --seconds "$DUR" --mult 1 --boot-wait 8 > "$LOG/usb.log" 2>&1 &
+  USB_PID=$!
+  echo "[thermal] USB launched (scale rebooting) @ $(ts)"
+fi
+
+# 2) detect the reboot, then wait for full recovery
+down=0
+for i in $(seq 1 30); do
+  if ! ping -c1 -t1 "$HOST" >/dev/null 2>&1; then down=1; echo "[thermal] reboot detected @ $(ts)"; break; fi
+  sleep 1
+done
+until ping -c1 -t1 "$HOST" >/dev/null 2>&1; do sleep 1; done
+sleep 4
+echo "[thermal] WiFi back @ $(ts) (reboot_detected=$down) -- RECONNECT BT APP NOW"
+
+RDUR=$((DUR-60)); [ "$RDUR" -lt 120 ] && RDUR=120
+
+# 3) WiFi load
+python3 -u tools/ws_drop_repro.py "$IP" --rate 10 --duration "$RDUR" --print-every 120 > "$LOG/ws.log" 2>&1 &
+WS_PID=$!
+python3 -u tools/conn_churn.py "$IP" --http --ws --rate 0.5 --workers 1 --duration "$RDUR" > "$LOG/churn.log" 2>&1 &
+CHURN_PID=$!
+python3 -u tools/mdns_stress.py --host "$HOST" --rate 1 --duration "$RDUR" --resolver > "$LOG/mdns.log" 2>&1 &
+MDNS_PID=$!
+
+# 4) telemetry monitor: one WS client, events on, log status every ~60 s.
+#    Tracks peak temp / stalls / recoveries / reboots ACROSS the whole run (so a
+#    firmware reset that zeroes the since-boot counters doesn't lose the peak),
+#    and prints a SUMMARY + RESULT verdict at the end.
+python3 -u - "$HOST" "$RDUR" > "$LOG/telemetry.log" 2>&1 <<'PY' &
+import json,sys,time,websocket
+host=sys.argv[1]; dur=int(sys.argv[2]); end=time.time()+dur
+def connect():
+    w=websocket.create_connection("ws://%s/snapshot"%host,timeout=8); w.settimeout(1.0)
+    try: w.send('{"command":"events","action":"on"}')
+    except Exception: pass
+    return w
+def first_status(w, secs):
+    # wait up to `secs` (covers >=1 full 5s status interval) for a status frame
+    t=time.time()+secs
+    st=None
+    while time.time()<t:
+        try:
+            d=json.loads(w.recv())
+            if d.get("type")=="status": st=d
+        except Exception: pass
+    return st
+ws=connect()
+peak=-999.0; total_stalls=0; reboots=0; max_recov=0
+prev_stalls=None; prev_max=None; last_reset="?"
+no_status_streak=0; max_no_status_streak=0; total_no_status=0
+first=True
+while time.time()<end:
+    st = first_status(ws, 9 if first else 2.5); first=False
+    if st:
+        soc=st.get('soc_temp_c'); mx=st.get('soc_temp_max_c'); sc=st.get('stall_count',0) or 0
+        recov=st.get('adc_recovery_count',0) or 0; rr=st.get('reset_reason','?')
+        last_reset=rr; no_status_streak=0
+        if isinstance(mx,(int,float)) and mx>peak: peak=mx
+        if recov>max_recov: max_recov=recov
+        # reboot heuristic: since-boot counters or peak dropped vs last frame
+        if prev_stalls is not None and (sc<prev_stalls or (isinstance(mx,(int,float)) and isinstance(prev_max,(int,float)) and mx<prev_max-3)):
+            reboots+=1
+            print("[%s] *** REBOOT detected (counters reset; reset_reason=%s) ***"%(time.strftime('%H:%M:%S'),rr),flush=True)
+        total_stalls=max(total_stalls, sc)
+        prev_stalls=sc; prev_max=mx
+        flag=" *** STALL ***" if st.get('weight_stalled') else ""
+        print("[%s] soc=%5sC max=%5sC stalled=%-5s stalls=%s recov=%s last_stall_ms=%s stall_temp=%s reset=%s grams=%s chg=%s%s"%(
+            time.strftime('%H:%M:%S'), soc, mx, st.get('weight_stalled'), sc, recov,
+            st.get('last_stall_ms'), st.get('last_stall_temp_c'), rr, st.get('grams'),
+            st.get('charging'), flag), flush=True)
+    else:
+        no_status_streak+=1; total_no_status+=1
+        if no_status_streak>max_no_status_streak: max_no_status_streak=no_status_streak
+        print("[%s] NO STATUS FRAME (reconnecting; streak=%d total=%d)"%(time.strftime('%H:%M:%S'),no_status_streak,total_no_status), flush=True)
+        try: ws.close()
+        except Exception: pass
+        try: ws=connect(); first=True
+        except Exception as e: print("reconnect failed:",e,flush=True); time.sleep(5)
+    time.sleep(58)
+try: ws.close()
+except Exception: pass
+# Loss of status frames means the scale stopped answering -- exactly the "weight
+# stops" failure being hunted -- so it must FAIL, not silently PASS because the
+# stall/reboot counters simply stopped advancing. Two patterns count: a SUSTAINED
+# loss (streak >= 3 consecutive misses) AND a FLAPPING wedge that recovers on each
+# reconnect (which resets the streak but accumulates total_no_status). A healthy
+# hour-long run has ~0 misses (the 58 s sleep leaves a backlog of buffered status
+# frames), so a cumulative threshold of 5 tolerates rare network jitter while
+# still catching a scale that keeps dropping out.
+visibility_lost = (max_no_status_streak >= 3) or (total_no_status >= 5)
+# peak stays at its -999 sentinel if no soc_temp_max_c field was ever seen: the
+# thermal data this test exists to capture is missing, so don't call it a PASS.
+no_temp_data = peak < -900
+result = "PASS" if (total_stalls==0 and max_recov==0 and reboots==0 and not visibility_lost and not no_temp_data) else "FAIL"
+print("SUMMARY peak_temp=%.1fC total_stalls=%d adc_recoveries=%d reboots=%d max_no_status_streak=%d total_no_status=%d last_reset=%s RESULT=%s"%(
+    peak, total_stalls, max_recov, reboots, max_no_status_streak, total_no_status, last_reset, result), flush=True)
+PY
+TELE_PID=$!
+
+echo "[thermal] load + telemetry running ${RDUR}s @ $(ts)"
+# Wait for each child individually so we capture its exit status (a bare `wait`
+# discards them). These tools exit 0 on normal completion and non-zero only on a
+# startup failure (missing dep / unresolvable host) or a crash/kill -- never on
+# "detected drops" -- so a non-zero code reliably means that load generator did
+# not do its job and the scale was under-stressed.
+rc_usb=0; rc_ws=0; rc_churn=0; rc_mdns=0; rc_tele=0
+[ -n "${USB_PID:-}" ] && { wait "$USB_PID"; rc_usb=$?; }
+wait "$WS_PID";    rc_ws=$?
+wait "$CHURN_PID"; rc_churn=$?
+wait "$MDNS_PID";  rc_mdns=$?
+wait "$TELE_PID";  rc_tele=$?
+echo "[thermal] DONE $(ts)"
+echo "===== TELEMETRY (last 12 + summary) ====="; tail -12 "$LOG/telemetry.log"
+echo "----- key events -----"; grep -E "STALL|REBOOT|SUMMARY" "$LOG/telemetry.log" || echo "(none)"
+echo "===== WS (drops) ====="; tail -12 "$LOG/ws.log"
+echo "===== USB ====="; tail -6 "$LOG/usb.log" 2>/dev/null || echo "(USB disabled)"
+echo "===== churn ====="; tail -3 "$LOG/churn.log"
+echo "===== mDNS ====="; tail -3 "$LOG/mdns.log"
+
+# Verdict -> exit code. A green run requires BOTH the telemetry monitor's
+# RESULT=PASS *and* that every load generator stayed alive for the whole run: a
+# generator that never started or crashed (non-zero exit) means the scale was
+# under-stressed, so the run is invalid even if no stall was seen. A non-zero
+# monitor exit or a missing SUMMARY line means the monitor itself died. This
+# makes the script usable as a CI gate.
+fail=0
+if [ "$rc_tele" -ne 0 ] || ! grep -q "RESULT=PASS" "$LOG/telemetry.log"; then
+  echo "[thermal] FAIL: telemetry verdict not PASS (stall/reboot/recovery, lost visibility, or monitor died rc=$rc_tele)"; fail=1
+fi
+check_gen() {  # $1=name $2=exit-code
+  if [ "$2" -ne 0 ]; then
+    echo "[thermal] FAIL: load generator '$1' exited $2 -- never started or crashed, scale under-stressed (see $LOG/$1.log)"; fail=1
+  fi
+}
+[ -n "${USB_PID:-}" ] && check_gen usb "$rc_usb"
+check_gen ws "$rc_ws"
+check_gen churn "$rc_churn"
+check_gen mdns "$rc_mdns"
+[ "$fail" -eq 0 ] && echo "[thermal] RESULT=PASS" || echo "[thermal] RESULT=FAIL"
+exit "$fail"