Skip to content

Fix linux-att adapter wiring + V5X pagination & tolerant ACK waits#26

Closed
d-roman-halliday wants to merge 2 commits into
Dejniel:masterfrom
d-roman-halliday:feature/linux-att-fixes-and-pagination
Closed

Fix linux-att adapter wiring + V5X pagination & tolerant ACK waits#26
d-roman-halliday wants to merge 2 commits into
Dejniel:masterfrom
d-roman-halliday:feature/linux-att-fixes-and-pagination

Conversation

@d-roman-halliday
Copy link
Copy Markdown
Contributor

Replacement for #24. The earlier PR predated 1424fe3 linux bearer problem #23 on master and shipped a parallel linux_l2cap_client.py adapter (~1042 LoC). That parallel adapter is no longer the right shape: upstream's linux_att.py already takes the same architectural approach, so this PR keeps your code in place and instead fixes the wiring bugs that prevented it from working on at least one Linux host. The #25 host-side mitigations from #24 are preserved.

I'll close #24 after this one is in your review queue.

Commits

1. eddd640 Fix linux-att adapter wiring so it can actually connect on Linux

On Ubuntu 6.17 / BlueZ 5.83 / Python 3.12 (MXW01) master (post 1424fe3/d28d937/b8d38a5) silently fell back to Bleak — i.e. the very BR/EDR-misroute path the linux-att workaround was added to bypass (#23). Three bugs:

  1. Tuple/string mismatch. The shared _FallbackSocket iterates over (address, channel) tuples and passes the same tuple to every backend it owns. _LinuxAttSocket.connect() was typed as (address: str) and crashed on tuple.replace() inside address normalisation. Unpack the tuple before use; the RFCOMM channel is meaningless for an LE L2CAP/ATT socket so we discard it.
  2. EINPROGRESS not handled. _open_att_socket previously called sock.settimeout(timeout) before the L2CAP connect, which puts the socket in non-blocking mode. The raw libc.connect() then returns -1 / EINPROGRESS instead of blocking. The old code raised that as a fatal error. Now we either select+SO_ERROR (when the caller already set a timeout), or use a blocking connect and apply the timeout afterwards.
  3. Non-blocking L2CAP connect hangs. Even after the select+SO_ERROR fix, the AF_BLUETOOTH/L2CAP kernel path on this host never marked the socket writable, and the select() would time out after 30 s for a peer the blocking-mode equivalent connects to in <1 s. Switch to blocking mode for the connect itself, then settimeout() for subsequent read/writes.

Regression test for #1 in tests/test_bluetooth_adapter_fallback.py. #2 and #3 are exercised by the hardware Lorem-ipsum print below — unit tests for them would require faking ctypes.CDLL and are fragile relative to the underlying syscall behaviour.

2. 59bc714 Add per-job-rows pagination and tolerant V5X command-ack waits

For #25. Two host-side mitigations that don't unlock the firmware ceiling but improve the failure mode:

  • Pagination in PrintJobBuilder. New env var TIMINI_PRINT_MAX_JOB_ROWS (default 0 = no split). When set, any rendered page raster taller than the limit is split vertically into sub-rasters and built as separate V5X sessions (A7 / A2 / A9 / bulk / AD), one per ProtocolStep.SEND so the runtime can pace them across one connection.
  • Tolerant runtime ACK waits in printing/runtime/v5x.py. _wait_for_start_ready already logged-and-continued on a missing 0xAA; _wait_for_command_ack did not. With pagination on, the second sub-job's 0xA7 ACK is preemptively consumed by a between-segments 0xA6 idle re-identification — so seg2's ACK never arrives and the wait raised an empty asyncio.TimeoutError that bubbled out as Error: BLE write failed: with no detail. Now caught and log-and-continued, mirroring the existing tolerance for 0xAA.

10 new pagination unit tests in tests/test_builder_pagination.py. Updated tests/test_bleak_transport_session.py::test_v5x_timeout_clears_pending_handshake_state to assert the new "send completes, state still cleared" behaviour rather than the old "raise".

Test plan

  • python -m unittest discover -s tests -p 'test_*.py' — 360/360 pass on this branch.
  • Live MXW01 (v1.9.3.1.2) on Linux + BlueZ:
    • BLE connect via linux-att succeeds in <1 s (Linux direct ATT connected: address_type=1 mtu_payload=509 services=4).
    • Short text print: clean run, 0xA9 status 0x0000.
    • Lorem-ipsum repro at TIMINI_BLE_BULK_DELAY_MS=30 (equivalent to b8d38a5's bulk pacing on MTU-512 V5X): host-side ACKs clean (0xA9 status 0x0000, no 0xAA fc 20 73-style distress bytes). Physical print still truncates at ~"ex ea commodo" (~37 %), confirming the residual constraint is in MXW01 firmware/render pipeline and not host-side. Documented in V5X firmware variants have a per-job row ceiling; long jobs truncate mid-row silently (MXW01 v1.9.3.1.2) #25.
  • CI on Linux x86_64/arm64, Windows, macOS arm64/intel (will run on push).

Residual firmware constraint (for #25 reviewers)

Strongest hypothesis from the test sequence: MXW01 v1.9.3.1.2 has a per-power-cycle render budget (~200–300 rows of 384 px) rather than a per-job one. Once consumed, subsequent jobs are accepted at the protocol layer but produce no physical output. Pagination + tolerant acks make the failure mode graceful (the host doesn't abort, the printer doesn't cascade-fail subsequent jobs), but unlocking the ceiling needs firmware-side cooperation we don't have.

Note on local BlueZ state during verification

On this host the linux-att L2CAP path occasionally hangs at connect-time when BlueZ has a stale entry for the device (e.g. after a failed connect via the dbus path). Running bluetoothctl remove <addr> clears that cleanly. Not an issue caused by this PR — but worth flagging since the symptom (silent connect timeout) is easy to mistake for the original #23 bug.

d-roman-halliday and others added 2 commits May 15, 2026 09:36
Two related fixes aimed at issue Dejniel#25 (V5X per-job row ceiling on
MXW01 v1.9.3.1.2 and likely similar firmwares).

1. **Pagination in PrintJobBuilder.**
   New env var `TIMINI_PRINT_MAX_JOB_ROWS` (default 0 = no split). When
   set, any rendered page raster taller than the limit is split
   vertically into sub-rasters and built as separate V5X sessions
   (`A7 / A2 / A9 / bulk / AD`). Each sub-session becomes a
   `ProtocolStep.SEND` carrying the full session bytes so the existing
   step-driven send loop paces them across one connection. Uses
   `RasterBuffer.slice_rows` which already supports the split.

2. **Tolerant runtime ACK waits in `printing/runtime/v5x.py`.**
   `_wait_for_start_ready` already logged-and-continued on missing
   `0xAA`; `after_split_command` did not. With pagination on, the
   second sub-job's `0xA7` ACK is preemptively consumed by a
   between-segments `0xA6` idle re-identification — so seg2's ACK
   never arrives and the wait raised an empty `asyncio.TimeoutError`
   that bubbled out as "BLE write failed:" with no detail. Now caught
   and logged the same way; the handshake state is still cleaned up
   so the session continues correctly.

Tests:
- `tests/test_builder_pagination.py` — 10 new tests covering the
  env-var parsing and `_split_raster_for_max_rows`.
- Updated `tests/test_bleak_transport_session.py
  test_v5x_timeout_clears_pending_handshake_state` to reflect the
  new "log + continue" behaviour rather than the old "raise". Same
  test still asserts the handshake state gets cleared.
- All 361 tests pass.

Validation: with `TIMINI_PRINT_MAX_JOB_ROWS=200` and
`TIMINI_BLE_BACKEND=l2cap` the long Lorem ipsum now drives the runtime
through both sub-jobs without the empty `TimeoutError` failure mode
that previously aborted the second segment. The printer reports clean
status (`0xAA` payload first byte `0x00` rather than `0xfc`).

Residual firmware constraint: the physical print is still truncated
at the same row count regardless of pagination, suggesting the MXW01
v1.9.3.1.2 has a *per-power-cycle* row budget rather than a per-job
one. That is hardware behaviour we cannot work around from the host;
documented in Dejniel#25 for further investigation.
Three bugs prevented the Linux direct-ATT/L2CAP workaround from ever
succeeding on this host (MXW01, Ubuntu 6.17 / BlueZ 5.83 / Python 3.12)
— it would fail and silently fall back to Bleak, which is exactly the
BR/EDR-misroute path the workaround was added to bypass (Dejniel#23).

1. **Tuple/string mismatch at the entry point.** The shared
   `_FallbackSocket` iterates over `(address, channel)` tuples and passes
   the same tuple to every backend it owns. `_LinuxAttSocket.connect()`
   was typed as `(address: str)` and crashed on `tuple.replace()` inside
   address normalisation. Unpack the tuple form before use; the RFCOMM
   channel is meaningless for an LE L2CAP/ATT socket, so we discard it.

2. **EINPROGRESS not handled.** `_open_att_socket` called
   `sock.settimeout(timeout)` before the L2CAP connect, which puts the
   socket in non-blocking mode under the hood. The subsequent raw
   `libc.connect()` (called via ctypes for the LE-public/LE-random
   sockaddr form Python's stdlib doesn't expose) therefore returns -1
   with errno=EINPROGRESS instead of blocking. The old code raised that
   as a fatal connect failure.

3. **Non-blocking L2CAP connect hangs.** Even after handling EINPROGRESS
   via `select()`+SO_ERROR (the standard non-blocking-TCP pattern), the
   AF_BLUETOOTH/L2CAP kernel path on this host never marked the socket
   writable, and the select() would time out after 30 s for a peer that
   the blocking-mode equivalent connects to in <1 s. Switch to blocking
   mode for the connect itself, then apply the caller-requested timeout
   afterwards so it governs subsequent read/writes. (The EINPROGRESS
   handling stays in place so callers that *do* pre-set a timeout still
   work.)

Regression test for Dejniel#1 added in tests/test_bluetooth_adapter_fallback.py.
Dejniel#2 and Dejniel#3 are exercised by the hardware Lorem-ipsum print included in
the PR's validation notes — unit tests for them would require faking
ctypes.CDLL and are fragile relative to the underlying syscall
behaviour.

Verified end-to-end against an MXW01 on this host: BLE connect now
succeeds in <1s and a full V5X print job (text and image) completes
with clean 0xA9 status notifications.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Dejniel
Copy link
Copy Markdown
Owner

Dejniel commented May 20, 2026

@d-roman-halliday I cherry-picked the Linux ATT wiring commit from this PR onto master as 94abaeb, with you preserved as the commit author. Thanks for tracking that down.

I am not merging the whole PR as-is because the second commit mixes the Linux ATT fix with #25 pagination / tolerant V5X ACK behavior. For future PRs, please keep one behavioral layer per PR; it makes it much easier for me to merge your work directly.

Can you clarify whether the second commit changes anything observable for #25 on current master, beyond making the host continue/log instead of aborting? From your notes it looks like the physical truncation still happens, so I would rather keep that as a separate, explicitly modeled MXW01/V5X firmware issue unless it actually improves the printed output.

@d-roman-halliday
Copy link
Copy Markdown
Contributor Author

d-roman-halliday commented May 20, 2026

Following up on the #25 retest above: truncation isn't reproducing on current master against the MXW01 anymore (full numbers in #25 comment). The pagination half of this PR is therefore optional.

@Dejniel
Copy link
Copy Markdown
Owner

Dejniel commented May 22, 2026

Thanks again for the PR and the hardware testing.

I cherry-picked the Linux ATT wiring fix onto master as 94abaeb, preserving you as the commit author. Since #23 and #25 are now fixed on current master, and the pagination/tolerant-ACK part is optional rather than needed right now, I'm closing this PR without merging the rest.

If the stale BlueZ cache behavior keeps happening, please open a separate issue for it so we can track that independently.

@Dejniel Dejniel closed this May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants