Create a Scaler WASM client you can run in your browser with Pyodide/JupyterLite#737
Draft
e117649 wants to merge 83 commits intofinos:mainfrom
Draft
Create a Scaler WASM client you can run in your browser with Pyodide/JupyterLite#737e117649 wants to merge 83 commits intofinos:mainfrom
e117649 wants to merge 83 commits intofinos:mainfrom
Conversation
002a16e to
561194f
Compare
Signed-off-by: eric <eric117649@gmail.com>
Implements RFC 6455 WebSocket framing over TCP so YMQ can traverse WebSocket-aware proxies and firewalls without an external library. - Add WebSocketAddress struct and ws:// / wss:// parsing to Address - Implement WebSocketStream: HTTP/1.1 Upgrade handshake (client + server), binary frame encode/decode, SHA-1 + Base64 inline for Sec-WebSocket-Accept - Extend Client variant to include WebSocketStream alongside TCP and IPC - Wire WebSocket accept path into AcceptServer (binds TCP, upgrades on connect) - Wire WebSocket connect path into ConnectClient (TCP connect then HTTP upgrade) - Document libwebsockets in library_tool.sh for future wss:// support - Add address parsing tests and an end-to-end ClientServerHandshake test Closes finos#728
Introduces WebSocketSocket, a blocking Socket implementation backed by RFC 6455 WebSocket framing over TCP, and wires it into the parameterised YMQSocketTest suite alongside the existing tcp/ipc variants. All 12 socket test cases now run under ws://, including raw-client / raw-server combinations and edge-case tests (huge header, incomplete identity, big message, slow network).
- Handle fragmented messages: FIN bit parsed, continuation frames assembled in _fragmentBuffer before delivery - Respond to PING with PONG; echo CLOSE frame on receipt (RFC 6455 §5.5) - Reject frames with payload > 64 MiB to prevent OOM (UV_EPROTO) - Cap upgrade-phase read buffer at 64 KiB to prevent unbounded growth - Validate HTTP upgrade: GET method, Connection header, Sec-WebSocket-Version: 13 - Send WebSocket CLOSE frame in shutdown() before TCP FIN - Replace assert() in finishClientUpgrade/finishServerUpgrade with error returns - Fix extractHeader to tolerate headers with no space after colon (RFC 7230 OWS) - Add 10 new tests covering each fix
Moves sha1, base64Encode, generateWebSocketKey, computeWebSocketAccept, and extractHeader out of anonymous namespaces in websocket_stream.cpp and websocket_socket.cpp into a shared websocket_utils.h/.cpp. Adds a comment explaining why these are hand-rolled rather than pulled from a library.
The WebSocket transport is now implemented without libwebsockets; remove the library script support and compiled/downloaded artifacts.
The YMQ transport was crashing during tests (specifically in WebSocket transport) when a socket was disconnected exactly as an operation was being submitted or during shutdown. This was caused by the use of UV_EXIT_ON_ERROR for network errors that should be handled as disconnections rather than fatal system failures. Changes: - Added UV_ENOTCONN to the handled errors in onRead and onWriteDone. It is now treated as a graceful/aborted disconnection rather than a crash. - Updated shutdownClient to ignore UV_ENOTCONN. If a socket is already disconnected, shutting it down is now a no-op. - Refactored processSendOperation to remove UV_EXIT_ON_ERROR from write calls. It now handles immediate failures by notifying the caller through the provided callback. - Introduced a shared_ptr to manage SendMessageCallback ownership during chunked writes, ensuring the callback is executed exactly once even if an intermediate chunk fails. Safety Rationale: These changes distinguish between network volatility (expected in async IO) and actual logic bugs. Fatal system errors (like EINVAL or EBADF) still trigger UV_EXIT_ON_ERROR, while connectivity-related errors are propagated to the upper layers of the application to be handled by the socket's state machine.
This fixes the Cap'n Proto version mismatch error in macOS CI where library version (1.0.1) did not match the version of the committed generated files (1.2.0). Generated files are now correctly ignored in src/protocol/ and will be regenerated during the build process.
Browser/Pyodide environments cannot load the _ymq C extension. This adds a pure-Python implementation in scaler.io.ymq._ymq_wasm that mirrors the C extension's surface (Bytes, Message, Address, IOContext, ConnectorSocket, errors) on top of js.WebSocket and uses the same YMQ wire protocol as the native build (4-byte magic + 8-byte LE length-prefixed frames). scaler.io.ymq.__init__ dispatches to the shim when sys.platform == 'emscripten'; native imports are unchanged. BinderSocket and ConnectorSocket.bind raise NotImplementedError since browsers cannot accept inbound connections. Includes 34 unit tests covering Bytes/Message/Address surface, handshake, framing (single/split/concatenated), zero-length, recv-before-callback buffering, send-before-open queueing, shutdown, and remote close. Tests use a fake WebSocket so they run in any CPython env without a browser.
Cross-verify the pure-Python browser shim against the native _ymq C
extension so that drift in the wire protocol, error taxonomy, or API
surface fails loudly rather than silently breaking browser clients.
Three levels of parity are checked:
1. Wire-protocol constants are parsed from the authoritative C++ header
(src/cpp/scaler/ymq/configuration.h) and compared against the shim's
_MAGIC_STRING and header-format constants.
2. Module surface: every name the shim re-exports must exist on the
native module, and exception subclasses must share the same parent.
3. Value semantics: Bytes round-trip, Address scheme classification, and
ErrorCode enum values must match between the two implementations.
Pure mechanical refactor: move the 'create internal address + spawn ClientAgent thread + create SyncConnector' code path out of Client.__initialize__ into a new ClientAgentBridge abstract base with an IPCAgentBridge implementation that preserves the existing behavior exactly. Client.__initialize__, shutdown, and __destroy now delegate to the bridge. No runtime or protocol change; existing tests in tests/client and tests/io continue to pass. This sets up a follow-up commit to add an InProcessAgentBridge for the browser / Pyodide environment, where threads and IPC are unavailable but the same ClientAgent logic can run on the single available asyncio loop.
Browser clients need to express their scheduler endpoint as ws://host:port since the only transport available to a WebAssembly page is the browser's WebSocket API. Native clients gain the same capability for free (the ymq C++ layer already speaks ws:// via WebSocketStream) though no native code path currently hands a ws:// AddressConfig to it. ws/wss parse and render exactly like tcp (host:port, port required). Paths are not supported at the AddressConfig level; if needed later they should be added as an explicit optional field rather than embedded in host.
Implements ClientAgentBridge without threads or real IPC sockets so the same ClientAgent code can run in the browser. The agent coroutine runs on the user's asyncio loop and exchanges BaseMessage objects with the Client via two asyncio.Queue halves. The sync half of the connector pair blocks synchronously via pyodide.ffi.run_sync (JSPI), so Client's public sync API keeps working unchanged. - ClientAgent gains an optional internal_connector_factory parameter so the bridge can inject an in-process internal connector while the external connector still goes through the real network backend. Native path is unchanged (the parameter defaults to None). - Client dispatches to IPCAgentBridge or InProcessAgentBridge via the new create_default_bridge() factory keyed on sys.platform. - Bridge tests cover: in-process queue handoff (both directions), bind/connect no-ops, routine dispatch, shutdown/destroy sentinel propagation, platform dispatch, and surface parity between the two bridges (a missing method on either fails the parity test). Known limitation addressed in a follow-up commit: object storage is still reached via SyncObjectStorageConnector which is raw-TCP-only. Browser clients will route object I/O through a WebSocket object-storage gateway added in Layer 3.
ScalerFuture is a concurrent.futures.Future subclass, so wrapping it with asyncio.wrap_future inside __await__ lets notebook code use either the blocking .result() path or the async 'await future' path with the same object. In the browser (Pyodide) the sync path depends on JSPI; await is the natural form and doesn't require any browser capability. Native CPython behavior is unchanged — wrap_future handles the cross-thread case; existing tests/client/test_future.py still passes. Tests cover: pre-set result, mid-await result, exception propagation, cancel propagation (as asyncio.CancelledError), and a regression guard that the sync .result() / exception-raising path keeps working.
…g.__str__ Mostly IDE format-on-save output matching the repo's black/isort/flake8 configuration (line-length=120, skip-magic-trailing-comma=true, isort profile=black). Also drops an accidental duplicated (unreachable) 'return repr(self)' line that had snuck into AddressConfig.__str__. No behavior change; all wasm-branch tests still pass (74/74).
Browser Client instantiation fails fast with a clear RuntimeError when JSPI (pyodide.ffi.run_sync) is unavailable, instead of deadlocking at the first sync call. Native platforms are untouched. The check lives alongside the bridge that actually needs JSPI so it can share the same sys.platform gate and import probe. It runs once at the top of Client.__initialize__ so it fires on construction rather than at first task submit.
CI mypy run flagged the bytes literal 'b"id"' as incompatible with create_default_bridge's identity: ClientID parameter. Switch to ClientID.generate_client_id() to get the right type.
asyncio.Queue() reads the current event loop at construction time on Python 3.8 (the deprecation only landed in 3.10). The previous test constructed the queues outside any running loop, so they bound to the process default loop while the test driver ran a separate loop, causing 'Future attached to a different loop' under 3.8 CI. Set the driver's loop as the current event loop in setUp and create the queues from inside that loop so they bind correctly on every Python version we support.
…o issue-728-websocket-transport
The merge left two competing branches in __repr__: a 'tcp/ws/wss' arm
that printed scheme://host:port (from the wasm branch) and the newer
ws/wss arm that appends the path component (from the websockets branch).
The first arm short-circuited the second, so ws:// and wss:// addresses
dropped their path on round-trip and CI failed:
ws://127.0.0.1:8765 != ws://127.0.0.1:8765/
Split the cases: tcp:// alone first, then ws/wss with path, then the
no-port schemes. Both single-branch test suites pass:
tests.config.test_config_types: 13/13
tests.client.test_bridge + test_future_await + tests.io.test_ymq_*: 68/68
finishClientUpgrade and finishServerUpgrade are moved from file-static free functions to private static class methods, giving them access to the private fromUpgradedSocket without exposing it publicly. Tests are updated to go through the real HTTP upgrade handshake via a new TestWebSocketStreamPair helper, removing the need for any external access to fromUpgradedSocket.
Mirrors the public sync surface that scaler.io.ymq.ConnectorSocket exposes on native (via the sockets.py wrapper around the callback-based C extension). On Pyodide the new helpers drive the existing callback-based send/recv via pyodide.ffi.run_sync (JSPI), which suspends the wasm stack while the JS event loop continues to drive the underlying WebSocket events. The JSPI bridge (_run_sync_jspi) is module-private and importable in any Python environment; tests patch it with a plain asyncio loop driver so the helpers can be exercised under CPython. Adds 4 new wasm tests and 1 parity test that fails if either implementation drifts from the shared callable surface (send_message, recv_message, send_message_sync, recv_message_sync, shutdown).
The public scaler.io.ymq.ConnectorSocket comes from the sockets.py wrapper on native, which exposes a synchronous ConnectorSocket.connect(context, identity, address, ...) — no callback. The wasm shim previously required a callback as the first positional argument (mirrored from the bare C extension), preventing YMQSyncObjectStorageConnector from binding against the wasm shim under sys.platform == 'emscripten'. This change drops the never-used 'fired immediately with None' callback from the wasm connect path so YMQSyncObjectStorageConnector now works unchanged on emscripten, using the wasm shim's send_message_sync / recv_message_sync helpers added in 7cda10a. Adds tests/io/test_object_storage_ws_parity.py: spins up two real C++ ObjectStorageServer instances, one bound on tcp:// and one on ws://, and asserts YMQSyncObjectStorageConnector produces byte-equal results across both transports. This locks down the ws:// bind path that the websockets→wasm merge enabled (Layer 3 piece 1) and is the integration-level counterpart to the wire-protocol parity tests in test_ymq_parity.py.
The TestTCPPair was created as a stack local inside the constructor and destroyed at end of construction, but its listen callback (capturing a pointer to the TestTCPPair) could fire later during loop.run() in the test — accessing a destroyed object. Fix: store TestTCPPair as a unique_ptr member so it stays alive for the lifetime of the test. Also heap-allocate headerBuf via shared_ptr so the drain callback does not hold a dangling reference to a constructor local.
The JupyterLite kernel (jupyterlite-pyodide-kernel 0.7.1) now ships Pyodide 0.29.3 built with Emscripten 4.0.9. The wasm build pipeline needs to match that ABI for piplite to load the Scaler wheel. - Cap'n Proto bumped 1.0.1 -> 1.1.0 (emsdk 4.0.9's libc++ rejects capnp 1.0.1 RPC headers); native install also needs to be 1.1.0 so the host capnp tool emits compatible generated code. - New emscripten patch for capnp 1.1.0: gates capnp_tool, capnp-rpc, capnp-json, and capnp-websocket out of the wasm build (RPC libs trip a defaulted-copy-assign diagnostic; we don't need any of them for serialization). - setup-wasm-env action: defaults bumped to emsdk 4.0.9 / pyodide- build 0.34.3 / Pyodide xbuildenv 0.29.3 / Python 3.13. Replaces apt-installed capnp with a from-source build (Ubuntu's package lags behind the wasm libcapnp ABI). - jupyterlite-sphinx wired into docs (extension, contents glob, and Try-in-Browser banner via nbsphinx_prolog). Smoke-test notebook added under gallery/ that piplite-installs the wasm wheel from /_static/wasm/ (origin-absolute URL escapes the JupyterLite SW). - _static/wasm/ added to .gitignore (wheels are local build outputs).
Make the scaler Python client load and run inside Pyodide (the browser)
without dragging in CPython-only dependencies, and lock the import
surface in CI so we don't regress.
C++ (capnp extension)
- src/cpp/scaler/protocol/CMakeLists.txt: emscripten-only
--whole-archive for capnp/capnpc/kj so all template instantiations end
up in the SIDE_MODULE.
- CMakeLists.txt: add -fwasm-exceptions -sSUPPORT_LONGJMP
-fvisibility=hidden -fvisibility-inlines-hidden
-fno-merge-all-constants for the emscripten target. The last flag
works around a Pyodide SIDE_MODULE relocator bug that ran adjacent
string literals together.
- scripts/library_tool.sh: build wasm capnp + libuv with -fPIC
-fwasm-exceptions -sSUPPORT_LONGJMP and without -fvisibility=hidden so
the cross-archive symbols stay visible to the wheel.
- src/cpp/scaler/protocol/pymod/{bootstrap,capnp,schema_registry}.cpp:
every string literal that flows into PyImport_ImportModule,
PyObject_GetAttrString, PyDict_SetItemString, PyModuleDef.m_name,
PyModule_AddObjectRef and registerCompiledSchema is now a
static const char[] backing array, dodging the same relocator bug.
Python (browser-safe gating)
- src/scaler/io/network_backends.py: gate pyzmq + ZMQ binder/connector
imports behind sys.platform != 'emscripten'; YMQNetworkBackend and
get_network_backend_from_env now degrade gracefully when zmq is None.
- src/scaler/client/agent/heartbeat_manager.py: psutil import is
optional on emscripten; HeartbeatManager reports cpu/rss = 0 when
psutil is unavailable.
- src/scaler/client/client.py: guard the worker Processor import
(pulls in multiprocessing) behind sys.platform != 'emscripten';
parent-task / scheduler-address resolution falls back to None when
Processor is missing.
- pyproject.toml: cloudpickle is unconditional; psutil and pyzmq are
marked sys_platform != 'emscripten'.
Notebook + tests + CI
- docs/source/gallery/wasm_smoke_test.ipynb: cache-busting wheel URL,
cells exercising scaler import, ConnectorSocket sync API, and
Client + check_browser_runtime.
- tests/wasm/test_browser_client_imports.py: unittest mirror of those
notebook cells; runs on CPython (regular discover picks it up) and
on wasm under pyodide venv.
- .github/actions/build-wasm-client/action.yml: after building the
wasm wheel, install it into a pyodide venv and run
tests/wasm/test_browser_client_imports.py so a regression to the
import surface fails the wasm CI job.
# Conflicts: # src/cpp/scaler/protocol/pymod/bootstrap.cpp
- tests/wasm/test_browser_client_imports.py: collapse the test_cell3_import_connector_socket assertion onto one continuation line to match black -l 120 output (CI was failing the format check). - src/cpp/scaler/protocol/pymod/bootstrap.cpp: drop the dname/dname_obj pair in build_module_from_descriptors; they were leftovers from the removed PySys_FormatStdout diagnostics and triggered -Wunused-variable, which fails the build under treat-warnings-as- errors profiles.
…oke notebook with parallel_sqrt example
Build fix
- src/cpp/scaler/protocol/pymod/{bootstrap,schema_registry}.cpp: rename
the static MOD_COMMON / MOD_STATUS / MOD_OBJECT_STORAGE / MOD_MESSAGE
arrays to kModCommon / kModStatus / kModObjectStorage / kModMessage.
On Ubuntu CI a system header preprocesses 'MOD_STATUS' into a numeric
constant, breaking the build with 'expected unqualified-id before
numeric constant'. Lowercase-prefixed identifiers cannot collide with
conventional macros.
- src/cpp/scaler/protocol/pymod/bootstrap.cpp: drop the unused
display / prefix_len locals in build_schema_descriptor (warnings only,
but they fail -Werror profiles).
Naming
- tests/wasm/test_browser_client_imports.py: drop 'smoke' from the
module/class docstrings and rename test_cell{2,3,4}_* to
test_import_{scaler,connector_socket,client}. The names now describe
what each test exercises rather than which notebook cell it mirrored.
- .github/actions/build-wasm-client/action.yml: rename the step to
'Run browser-client import tests under Pyodide' (no 'smoke') and
loosen the wheel glob so it matches both the original and retagged
filenames.
Gallery cleanup
- docs/source/gallery/wasm_smoke_test.ipynb: removed -- it was only a
diagnostic harness for the wasm wheel.
- docs/source/gallery/parallel_sqrt.ipynb: lightweight replacement that
serves the same 'quick connectivity check' purpose. Two cells: a
config cell with SCHEDULER_ADDRESS / OBJECT_STORAGE_ADDRESS constants
to edit, and a single Client.submit batch of math.sqrt(0..15). Run
against any Scaler scheduler started on the host CPython side.
Bug
---
On the wasm build, instantiating any capnp struct with keyword arguments
(e.g. Resource(cpu=1, rss=2)) raised
TypeError: Resource() takes no arguments
even though the existing import smoke tests passed and the same code
worked on every native build.
Root cause is in bootstrap.cpp's runtime type construction. After
creating the CapnpStruct base class via type(), we patch __init__ in
with PyObject_SetAttrString(capnp_struct_type, "__init__", descriptor).
Two things conspire on Pyodide SIDE_MODULE:
1. The inline string literal "__init__" (and the other short literals
like "to_bytes", "which") sits in a mergeable .rodata.str section.
The wasm relocator can mis-resolve offsets within those sections, so
the descriptor ends up registered under a garbled key. Subclasses
then inherit object.__init__ and reject all kwargs.
2. Even without the literal corruption, the slot-resync that
PyObject_SetAttrString("__init__", ...) is supposed to trigger via
update_one_slot does not reliably propagate tp_init to subclasses
created later under the wasm interpreter.
Fix
---
* Materialise every PyMethodDef::ml_name and every key passed to
PyObject_SetAttrString as a named static const char[] (NAME_INIT,
NAME_TO_BYTES, NAME_WHICH, NAME_GETATTR, NAME_FROM_BYTES). This is
the same hardening the file already applies to module names; extending
it to method names removes the relocator hazard.
* Belt-and-suspenders: assign tp_init directly on CapnpStruct and
CapnpUnionStruct via two new initproc adapters
(capnp_struct_init_slot / capnp_union_init_slot) so subclass slot
inheritance does not depend on update_one_slot at all. PyType_Modified
is called after each assignment to invalidate the type-version cache.
Regression test
---------------
tests/wasm/test_browser_client_imports.py now constructs
Resource(cpu=42, rss=1024) and ClientHeartbeat(resource=..., latencyUS=7),
which would have failed on every wasm wheel built from the previous code.
Tooling
-------
* pyproject.toml: drop the cmake.define overrides for CapnProto_DIR /
CMAKE_PREFIX_PATH. cmake's find_package picks both up from the
environment unaided; the entries were redundant.
* scripts/library_tool.sh: add an emsdk subcommand that follows the
same download / compile / install lifecycle as the other libraries
(download = git clone tag, compile = emsdk install <ver>, install =
emsdk activate <ver>). Pinned to EMSDK_VERSION=4.0.9 alongside the
Pyodide xbuildenv.
* .github/actions/setup-wasm-env/action.yml: use the new emsdk
subcommand and key the cache off scripts/library_tool.sh's hash so
toolchain bumps invalidate the cache automatically.
* .devcontainer/Dockerfile: build libuv from source via library_tool.sh
(matching the wheel build) instead of pulling Ubuntu's libuv1-dev,
and add the tmux package used by test_jupyterlite.sh.
* docs/source/conf.py: exclude gallery/debug_*.ipynb from Sphinx so
local-only debug notebooks (debug_jupyterlite.ipynb) do not leak
into the published docs. JupyterLite still serves them.
* docs/source/gallery/debug_jupyterlite.ipynb,
test_jupyterlite.sh: local-only manual debug harness for the
wasm/JupyterLite client (tmux session running the full cluster +
docs HTTP server, plus a notebook that exercises math/numpy/pandas
through the browser client).
Adds a pyodide subcommand to scripts/library_tool.sh that mirrors the emsdk one: download creates $THIRD_PARTY_DIR/pyodide-venv and installs pyodide-build, compile fetches the matching xbuildenv, install is a no-op (the venv is used in-place). Versions move together and live next to EMSDK_VERSION in the script, so a single bump there propagates to CI, the devcontainer, and the local test_jupyterlite.sh harness. Devcontainer ------------ The whole external-deps tree (boost, capnp, libuv, emsdk, pyodide-build venv, wasm-target capnp + libuv) is now baked into the image. Native deps stay at /usr/local where they were; everything wasm-related is rooted at /opt/scaler so the workspace bind mount can't shadow it. A new THIRD_PARTY_DIR env var on library_tool.sh swaps the install root from the default ./thirdparties/ to whatever the caller wants; the Dockerfile sets it to /opt/scaler for the bake. emsdk_env.sh derives EMSDK from BASH_SOURCE, so a one-line /etc/profile.d/emsdk.sh snippet exposes emcc / emcmake / the bundled node in every interactive shell. devcontainer.json adds the pyodide venv to PATH and sets CapnProto_DIR / CMAKE_PREFIX_PATH so the wasm wheel build picks up the baked install. No postCreateCommand needed. GH Action --------- .github/actions/setup-wasm-env now drops its pyodide-build-version / pyodide-xbuildenv-version inputs and calls library_tool.sh pyodide instead. No callers passed those inputs, so this is a clean removal. test_jupyterlite.sh ------------------- Replaces the /tmp/pyodide-venv-new path (left over from interactive debugging) with the new ./thirdparties/pyodide-venv location.
…rness
setup-wasm-env/action.yml had an embedded multi-line python script inside
a 'run: |' block to read the pyodide-build version from pyproject.toml.
The 'import tomllib' line started at column 1, which terminates the YAML
block scalar -- GitHub's parser rejected the file with 'While scanning a
simple key, could not find expected ':''.
Inline the version (0.34.3) directly in the action and add a comment that
it must match the pyproject.toml [dependency-groups] dev pin (and the
EMSDK_VERSION in scripts/library_tool.sh). All three move together with
the Pyodide kernel JupyterLite ships, so duplicating one string is fine.
While here, replace the 'or /opt/scaler/...' annotations in
test_jupyterlite.sh's prerequisite block with ${THIRD_PARTY_DIR:-./thirdparties}
so the same commands work in the devcontainer (THIRD_PARTY_DIR=/opt/scaler)
and outside it (defaults to ./thirdparties).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Requires websocket support and should be merged after #739