Skip to content

[codex] Harden packed message handling#2

Draft
brothercorvo wants to merge 22 commits intotorlando-tech:feat/t-deckfrom
brothercorvo:codex/guard-missing-packed-message
Draft

[codex] Harden packed message handling#2
brothercorvo wants to merge 22 commits intotorlando-tech:feat/t-deckfrom
brothercorvo:codex/guard-missing-packed-message

Conversation

@brothercorvo
Copy link
Copy Markdown

What changed

  • harden packed-message handling across the LXMF router and propagation path
  • update message-store related plumbing used by packed message processing
  • adjust transport-facing interfaces needed by the new guard paths

Why

Packed message handling still had edge cases that could break message loading and follow-on processing when expected packed payload state was missing or incomplete.

Impact

This makes the embedded messaging stack more defensive when handling packed LXMF payloads and reduces the chance of state corruption or crashes in those paths.

Validation

  • Could not run a PlatformIO build in this environment because pio is not installed.

Astrrra and others added 21 commits February 11, 2026 13:28
3s is way too short for slow LoRa links, bumped it to 15s.
Static pool arrays (~20 pools) were in BSS segment consuming ~15-25KB
of internal RAM. Convert to pointers allocated via heap_caps_aligned_alloc
in PSRAM at startup, following the same pattern as
Identity::init_known_destinations_pool().

Results: boot heap increased from ~116KB to ~161KB, steady-state
max_block improved from 7.6KB to 65KB, skipped announces eliminated.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Subtract ATT_OVERHEAD (3 bytes) from MTU values so fragment sizes
  match what the BLE stack can actually transmit
- Add LONE (0x00) fragment type for single-packet messages, matching
  Columba's BLE protocol implementation
- Increase handshake timeout from 10s to 30s to match Columba
- Track consecutive keepalive failures and disconnect after 3
- Add zombie detection for connected peers idle >45s
- Add advertising refresh interval constant (60s)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…moryMonitor

- Replace Bytes _app_data in IdentityEntry with fixed uint8_t[128] buffer
  to prevent known destinations from consuming BytesPool tiny slots
  (was exhausting pool after 3hrs on busy networks)
- Move BytesPool storage from BSS to PSRAM, reduce TINY_SLOTS to 1024
  now that destinations don't consume pool slots
- Defer MemoryMonitor logging from timer callback to main loop poll()
  to avoid FreeRTOS timer task stack overflow (3120 byte limit)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
save_known_destinations() was called on every announce, writing all 670+
entries to SPIFFS and blocking the main loop for 20+ seconds. Add a dirty
flag so saves only happen during the periodic persist_data() timer (~60s).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Link packets were flooding all interfaces (TCP, LoRa, WiFi) because
the interface routing check was commented out. This caused massive
latency on audio streams — each packet hit LoRa SPI (55ms) even when
the link was established over WiFi/AutoInterface.

Fix: Use packet.destination_link().attached_interface() instead of
the non-existent packet.destination().attached_interface(). Also add
the missing Link::attached_interface() const getter implementation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Validate conn_handle against MAX_CONN_HANDLES before array access in
setPeerHandle and promoteToIdentityKeyed to prevent out-of-bounds writes.
Add writeCharacteristic() virtual method to IBLEPlatform for targeted
GATT characteristic writes (needed for identity handshake).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Atomic save: write to /known_dst.tmp then rename to /known_dst.bin
  to prevent corruption if a crash occurs mid-write
- Fast persist: save within 5s of dirty flag (don't wait for 60s interval)
  via new should_persist_data() called from main loop
- Delete corrupt files: if magic bytes are invalid on load, remove the
  file so a fresh one can be written
- Recover from temp file: if .bin is missing but .tmp exists, rename it
  (crash happened between write and rename)
- Promote load/save logs to INFO for visibility

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add _persist_yield_callback function pointer, called every 5 entries
  during save_known_destinations() to feed platform watchdog during
  slow SPIFFS flash I/O (71+ entries can take 30-50s)
- Increase should_persist_data() dirty threshold from 5s to 60s to
  reduce SPIFFS fragmentation from frequent writes. Exit handler and
  crash recovery paths still force immediate persist.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a `persist` flag to KnownDestinationSlot so only destinations with
actual message exchange (contacts) are written to SPIFFS. Network announces
stay in the PSRAM pool for routing but don't survive reboots. This reduces
persistence time from 40-50s (150+ entries) to <1s (handful of contacts),
eliminating the main-loop blocking that caused device unresponsiveness.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…fication

Python generates LXMF test vectors (basic, empty, fields, large, unicode,
stamp) that C++ unpacks with signature validation. C++ generates vectors
that Python unpacks and verifies (signatures, hashes, fields, content).
Full pipeline orchestrated by run_interop.sh.

Also fixes native17 build: remove #ifdef ARDUINO guard from Bytes
std::vector constructor (needed for MsgPack bin_t interop), add
src_filter to exclude UI/BLE sources, and init Transport/Identity pools.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nsfer limit

Three bugs prevented LXMF propagation node message sync from working:

1. Link::request() BIN-wrapped pre-serialized msgpack data via Bytes::to_msgpack(),
   causing Python peers to see raw bytes instead of nested structures. Fixed by
   manually building the packed [timestamp, path_hash, data] array with raw
   embedded msgpack.

2. Resource responses from Link::request() were not routed to the request callback.
   The RESOURCE_ADV handler used a generic concluded callback that never called
   response_resource_concluded(). Fixed handle_resource_concluded() to detect
   response Resources by extracting request_id from packed data and matching
   against pending requests.

3. per_transfer_limit=0 sent as uint8 caused Python server to reject all messages
   (0 KB limit). Fixed to send msgpack nil for "no limit".

Also adds parse_response_array() for type-agnostic response parsing, refactors
sync into process_sync() state machine, adds NVS persistence for propagation
node selection and stamp costs, and cleans up Resource.cpp debug artifacts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Pack field keys as integers (not BIN) to match Python LXMF wire format
- Use general ENCRYPTED_PACKET_MDU (391 bytes) for explicitly OPPORTUNISTIC
  messages instead of LoRa-specific threshold (159 bytes)
- Fix unpack_from_bytes to deserialize field keys as integers
- Fix validate_signature to repack field keys as integers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bytes({key_int}) matched Bytes(size_t capacity) instead of creating a
1-byte buffer, storing empty keys that fields_get() could never match.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants