Fix linux-att adapter wiring + V5X pagination & tolerant ACK waits#26
Fix linux-att adapter wiring + V5X pagination & tolerant ACK waits#26d-roman-halliday wants to merge 2 commits into
Conversation
Two related fixes aimed at issue Dejniel#25 (V5X per-job row ceiling on MXW01 v1.9.3.1.2 and likely similar firmwares). 1. **Pagination in PrintJobBuilder.** New env var `TIMINI_PRINT_MAX_JOB_ROWS` (default 0 = no split). When set, any rendered page raster taller than the limit is split vertically into sub-rasters and built as separate V5X sessions (`A7 / A2 / A9 / bulk / AD`). Each sub-session becomes a `ProtocolStep.SEND` carrying the full session bytes so the existing step-driven send loop paces them across one connection. Uses `RasterBuffer.slice_rows` which already supports the split. 2. **Tolerant runtime ACK waits in `printing/runtime/v5x.py`.** `_wait_for_start_ready` already logged-and-continued on missing `0xAA`; `after_split_command` did not. With pagination on, the second sub-job's `0xA7` ACK is preemptively consumed by a between-segments `0xA6` idle re-identification — so seg2's ACK never arrives and the wait raised an empty `asyncio.TimeoutError` that bubbled out as "BLE write failed:" with no detail. Now caught and logged the same way; the handshake state is still cleaned up so the session continues correctly. Tests: - `tests/test_builder_pagination.py` — 10 new tests covering the env-var parsing and `_split_raster_for_max_rows`. - Updated `tests/test_bleak_transport_session.py test_v5x_timeout_clears_pending_handshake_state` to reflect the new "log + continue" behaviour rather than the old "raise". Same test still asserts the handshake state gets cleared. - All 361 tests pass. Validation: with `TIMINI_PRINT_MAX_JOB_ROWS=200` and `TIMINI_BLE_BACKEND=l2cap` the long Lorem ipsum now drives the runtime through both sub-jobs without the empty `TimeoutError` failure mode that previously aborted the second segment. The printer reports clean status (`0xAA` payload first byte `0x00` rather than `0xfc`). Residual firmware constraint: the physical print is still truncated at the same row count regardless of pagination, suggesting the MXW01 v1.9.3.1.2 has a *per-power-cycle* row budget rather than a per-job one. That is hardware behaviour we cannot work around from the host; documented in Dejniel#25 for further investigation.
Three bugs prevented the Linux direct-ATT/L2CAP workaround from ever succeeding on this host (MXW01, Ubuntu 6.17 / BlueZ 5.83 / Python 3.12) — it would fail and silently fall back to Bleak, which is exactly the BR/EDR-misroute path the workaround was added to bypass (Dejniel#23). 1. **Tuple/string mismatch at the entry point.** The shared `_FallbackSocket` iterates over `(address, channel)` tuples and passes the same tuple to every backend it owns. `_LinuxAttSocket.connect()` was typed as `(address: str)` and crashed on `tuple.replace()` inside address normalisation. Unpack the tuple form before use; the RFCOMM channel is meaningless for an LE L2CAP/ATT socket, so we discard it. 2. **EINPROGRESS not handled.** `_open_att_socket` called `sock.settimeout(timeout)` before the L2CAP connect, which puts the socket in non-blocking mode under the hood. The subsequent raw `libc.connect()` (called via ctypes for the LE-public/LE-random sockaddr form Python's stdlib doesn't expose) therefore returns -1 with errno=EINPROGRESS instead of blocking. The old code raised that as a fatal connect failure. 3. **Non-blocking L2CAP connect hangs.** Even after handling EINPROGRESS via `select()`+SO_ERROR (the standard non-blocking-TCP pattern), the AF_BLUETOOTH/L2CAP kernel path on this host never marked the socket writable, and the select() would time out after 30 s for a peer that the blocking-mode equivalent connects to in <1 s. Switch to blocking mode for the connect itself, then apply the caller-requested timeout afterwards so it governs subsequent read/writes. (The EINPROGRESS handling stays in place so callers that *do* pre-set a timeout still work.) Regression test for Dejniel#1 added in tests/test_bluetooth_adapter_fallback.py. Dejniel#2 and Dejniel#3 are exercised by the hardware Lorem-ipsum print included in the PR's validation notes — unit tests for them would require faking ctypes.CDLL and are fragile relative to the underlying syscall behaviour. Verified end-to-end against an MXW01 on this host: BLE connect now succeeds in <1s and a full V5X print job (text and image) completes with clean 0xA9 status notifications. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@d-roman-halliday I cherry-picked the Linux ATT wiring commit from this PR onto I am not merging the whole PR as-is because the second commit mixes the Linux ATT fix with #25 pagination / tolerant V5X ACK behavior. For future PRs, please keep one behavioral layer per PR; it makes it much easier for me to merge your work directly. Can you clarify whether the second commit changes anything observable for #25 on current |
|
Following up on the #25 retest above: truncation isn't reproducing on current |
|
Thanks again for the PR and the hardware testing. I cherry-picked the Linux ATT wiring fix onto master as 94abaeb, preserving you as the commit author. Since #23 and #25 are now fixed on current master, and the pagination/tolerant-ACK part is optional rather than needed right now, I'm closing this PR without merging the rest. If the stale BlueZ cache behavior keeps happening, please open a separate issue for it so we can track that independently. |
Replacement for #24. The earlier PR predated
1424fe3 linux bearer problem #23onmasterand shipped a parallellinux_l2cap_client.pyadapter (~1042 LoC). That parallel adapter is no longer the right shape: upstream'slinux_att.pyalready takes the same architectural approach, so this PR keeps your code in place and instead fixes the wiring bugs that prevented it from working on at least one Linux host. The #25 host-side mitigations from #24 are preserved.I'll close #24 after this one is in your review queue.
Commits
1.
eddd640Fix linux-att adapter wiring so it can actually connect on LinuxOn Ubuntu 6.17 / BlueZ 5.83 / Python 3.12 (MXW01)
master(post1424fe3/d28d937/b8d38a5) silently fell back to Bleak — i.e. the very BR/EDR-misroute path the linux-att workaround was added to bypass (#23). Three bugs:_FallbackSocketiterates over(address, channel)tuples and passes the same tuple to every backend it owns._LinuxAttSocket.connect()was typed as(address: str)and crashed ontuple.replace()inside address normalisation. Unpack the tuple before use; the RFCOMM channel is meaningless for an LE L2CAP/ATT socket so we discard it._open_att_socketpreviously calledsock.settimeout(timeout)before the L2CAP connect, which puts the socket in non-blocking mode. The rawlibc.connect()then returns -1 /EINPROGRESSinstead of blocking. The old code raised that as a fatal error. Now we either select+SO_ERROR (when the caller already set a timeout), or use a blocking connect and apply the timeout afterwards.settimeout()for subsequent read/writes.Regression test for #1 in
tests/test_bluetooth_adapter_fallback.py. #2 and #3 are exercised by the hardware Lorem-ipsum print below — unit tests for them would require fakingctypes.CDLLand are fragile relative to the underlying syscall behaviour.2.
59bc714Add per-job-rows pagination and tolerant V5X command-ack waitsFor #25. Two host-side mitigations that don't unlock the firmware ceiling but improve the failure mode:
PrintJobBuilder. New env varTIMINI_PRINT_MAX_JOB_ROWS(default0= no split). When set, any rendered page raster taller than the limit is split vertically into sub-rasters and built as separate V5X sessions (A7 / A2 / A9 / bulk / AD), one perProtocolStep.SENDso the runtime can pace them across one connection.printing/runtime/v5x.py._wait_for_start_readyalready logged-and-continued on a missing0xAA;_wait_for_command_ackdid not. With pagination on, the second sub-job's0xA7ACK is preemptively consumed by a between-segments0xA6idle re-identification — so seg2's ACK never arrives and the wait raised an emptyasyncio.TimeoutErrorthat bubbled out asError: BLE write failed:with no detail. Now caught and log-and-continued, mirroring the existing tolerance for0xAA.10 new pagination unit tests in
tests/test_builder_pagination.py. Updatedtests/test_bleak_transport_session.py::test_v5x_timeout_clears_pending_handshake_stateto assert the new "send completes, state still cleared" behaviour rather than the old "raise".Test plan
python -m unittest discover -s tests -p 'test_*.py'— 360/360 pass on this branch.linux-attsucceeds in <1 s (Linux direct ATT connected: address_type=1 mtu_payload=509 services=4).0xA9status0x0000.TIMINI_BLE_BULK_DELAY_MS=30(equivalent tob8d38a5's bulk pacing on MTU-512 V5X): host-side ACKs clean (0xA9status0x0000, no0xAA fc 20 73-style distress bytes). Physical print still truncates at ~"ex ea commodo" (~37 %), confirming the residual constraint is in MXW01 firmware/render pipeline and not host-side. Documented in V5X firmware variants have a per-job row ceiling; long jobs truncate mid-row silently (MXW01 v1.9.3.1.2) #25.Residual firmware constraint (for #25 reviewers)
Strongest hypothesis from the test sequence: MXW01 v1.9.3.1.2 has a per-power-cycle render budget (~200–300 rows of 384 px) rather than a per-job one. Once consumed, subsequent jobs are accepted at the protocol layer but produce no physical output. Pagination + tolerant acks make the failure mode graceful (the host doesn't abort, the printer doesn't cascade-fail subsequent jobs), but unlocking the ceiling needs firmware-side cooperation we don't have.
Note on local BlueZ state during verification
On this host the linux-att L2CAP path occasionally hangs at connect-time when BlueZ has a stale entry for the device (e.g. after a failed connect via the dbus path). Running
bluetoothctl remove <addr>clears that cleanly. Not an issue caused by this PR — but worth flagging since the symptom (silent connect timeout) is easy to mistake for the original #23 bug.