feat(compute-mesh): real local inference backend and unified streaming protocol#1257
feat(compute-mesh): real local inference backend and unified streaming protocol#1257Avi-47 wants to merge 5 commits intomofa-org:mainfrom
Conversation
|
Hi @lijingrs @yangrudan @BH3GEI , Just a small clarification regarding the local backend in this PR. workflow → routing → provider execution → token streaming → gateway SSE This allows the framework to run a real pipeline locally while preserving the same |
d8a97ee to
df0f4ce
Compare
7d2be8f to
78ea3d2
Compare
|
closed while adding a comment, opening this |
38c98b7 to
bf39727
Compare
bf39727 to
19bee21
Compare
Summary
This pull request implements a fully integrated compute mesh inference pipeline by introducing a functional local inference backend and a unified streaming protocol across the MoFA inference stack.
MoFA already contained most of the architectural components required for the Cognitive Compute Mesh, including routing policies, model pool management, kernel streaming types, and SSE support in the gateway. However, these components were not previously wired into a working inference pipeline. Local inference was stubbed and streaming responses were simulated by splitting completed outputs rather than producing real incremental tokens.
This PR connects those components into an operational pipeline capable of streaming tokens from a local provider. The implementation integrates the provider with the orchestrator and exposes streamed responses through the existing kernel streaming abstractions, along with an end-to-end example demonstrating the compute mesh workflow.
The local provider replaces the previous stub implementation with a lightweight token generator that simulates incremental decoding. This enables the full pipeline to run without heavy model dependencies or GPU hardware while preserving the same streaming interfaces used by real backends such as Candle or llama.cpp, allowing those engines to be integrated later without changes to the orchestrator or routing layers.
Closes #1254.
Problem
The existing compute mesh architecture contains most of the required infrastructure but lacks an operational pipeline. Several pieces of the system were implemented independently but were not wired together.
Local inference providers returned stub responses rather than executing an inference pipeline.
The streaming interface in the orchestrator simulated token streaming by splitting final outputs.
The compute mesh demo demonstrated routing logic but did not execute a real inference flow.
Because of this, the framework could not yet demonstrate a real end-to-end pipeline from workflow execution to token streaming.
Approach
This PR integrates the missing components and implements a unified inference flow across the system.
First, a real local inference provider was implemented in the
mofa-local-llmcrate. The previous stub implementation has been replaced with a token generation backend that produces incremental tokens and supports streaming responses. This backend simulates realistic token generation while remaining lightweight and not requiring large model weights.Second, the
ModelProvidertrait has been extended to support streaming inference through a unifiedinfer_streaminterface. This interface produces streaming tokens using the kernel streaming abstractions, allowing different providers to implement incremental generation while maintaining a consistent interface.Third, the
InferenceOrchestratorwas extended to support local providers and streaming inference. The orchestrator now routes requests using the existing routing policies and executes streaming inference when a local provider is available. If no provider is configured, the orchestrator falls back to a safe error stream rather than simulated output.Fourth, the streaming path now connects all layers of the stack. Tokens generated by the provider are delivered through the kernel streaming abstractions, passed through the orchestrator, and returned as a streaming response suitable for SSE delivery by the gateway.
Finally, the compute mesh example has been extended to demonstrate a working pipeline. The example shows how a prompt flows through the workflow runtime, inference routing layer, local inference backend, and streaming response system.
Architecture
The resulting inference pipeline now follows the intended compute mesh architecture:
This architecture enables the same interface to support local inference, cloud providers, or hybrid routing policies.
Changes
ModelProvidertraitInferenceOrchestratorextended to support local providers and streaming executionmofa-local-llmTesting
The implementation was tested locally using the compute mesh demo.
This runs a pipeline where a prompt is routed through the orchestrator, executed by the local provider, and streamed back token by token. The demo confirms that routing, provider execution, and streaming behavior all function together as expected.
Code quality checks were also run locally including:
Example Execution Output
Running the demo:
Example Output
Execution trace (simplified):
ScreenShots
Impact
This PR converts the compute mesh architecture from a simulated framework into an operational inference pipeline. It provides a reference implementation of how local inference providers integrate with the orchestrator and how token streaming flows through the system.
It also provides a working example for contributors building new inference providers or routing strategies.
Future improvements could extend this architecture with real model backends such as Candle, llama.cpp, or other inference engines while preserving the same provider interface.
This change therefore establishes the core execution path required for the Cognitive Compute Mesh architecture.