Skip to content

feat(compute-mesh): real local inference backend and unified streaming protocol#1257

Open
Avi-47 wants to merge 5 commits intomofa-org:mainfrom
Avi-47:feature/compute-mesh-real-inference
Open

feat(compute-mesh): real local inference backend and unified streaming protocol#1257
Avi-47 wants to merge 5 commits intomofa-org:mainfrom
Avi-47:feature/compute-mesh-real-inference

Conversation

@Avi-47
Copy link
Copy Markdown
Contributor

@Avi-47 Avi-47 commented Mar 15, 2026

Summary

This pull request implements a fully integrated compute mesh inference pipeline by introducing a functional local inference backend and a unified streaming protocol across the MoFA inference stack.

MoFA already contained most of the architectural components required for the Cognitive Compute Mesh, including routing policies, model pool management, kernel streaming types, and SSE support in the gateway. However, these components were not previously wired into a working inference pipeline. Local inference was stubbed and streaming responses were simulated by splitting completed outputs rather than producing real incremental tokens.

This PR connects those components into an operational pipeline capable of streaming tokens from a local provider. The implementation integrates the provider with the orchestrator and exposes streamed responses through the existing kernel streaming abstractions, along with an end-to-end example demonstrating the compute mesh workflow.

The local provider replaces the previous stub implementation with a lightweight token generator that simulates incremental decoding. This enables the full pipeline to run without heavy model dependencies or GPU hardware while preserving the same streaming interfaces used by real backends such as Candle or llama.cpp, allowing those engines to be integrated later without changes to the orchestrator or routing layers.

Closes #1254.


Problem

The existing compute mesh architecture contains most of the required infrastructure but lacks an operational pipeline. Several pieces of the system were implemented independently but were not wired together.

Local inference providers returned stub responses rather than executing an inference pipeline.

The streaming interface in the orchestrator simulated token streaming by splitting final outputs.

The compute mesh demo demonstrated routing logic but did not execute a real inference flow.

Because of this, the framework could not yet demonstrate a real end-to-end pipeline from workflow execution to token streaming.


Approach

This PR integrates the missing components and implements a unified inference flow across the system.

First, a real local inference provider was implemented in the mofa-local-llm crate. The previous stub implementation has been replaced with a token generation backend that produces incremental tokens and supports streaming responses. This backend simulates realistic token generation while remaining lightweight and not requiring large model weights.

Second, the ModelProvider trait has been extended to support streaming inference through a unified infer_stream interface. This interface produces streaming tokens using the kernel streaming abstractions, allowing different providers to implement incremental generation while maintaining a consistent interface.

Third, the InferenceOrchestrator was extended to support local providers and streaming inference. The orchestrator now routes requests using the existing routing policies and executes streaming inference when a local provider is available. If no provider is configured, the orchestrator falls back to a safe error stream rather than simulated output.

Fourth, the streaming path now connects all layers of the stack. Tokens generated by the provider are delivered through the kernel streaming abstractions, passed through the orchestrator, and returned as a streaming response suitable for SSE delivery by the gateway.

Finally, the compute mesh example has been extended to demonstrate a working pipeline. The example shows how a prompt flows through the workflow runtime, inference routing layer, local inference backend, and streaming response system.


Architecture

The resulting inference pipeline now follows the intended compute mesh architecture:

image

This architecture enables the same interface to support local inference, cloud providers, or hybrid routing policies.


Changes

  • Local inference provider implementation replacing stub backend with token generation logic
  • Streaming inference support added to the ModelProvider trait
  • InferenceOrchestrator extended to support local providers and streaming execution
  • Streaming pipeline integrated using kernel streaming abstractions
  • Compute mesh demo example demonstrating an end-to-end inference pipeline
  • Minor dependency updates and configuration improvements in mofa-local-llm

Testing

The implementation was tested locally using the compute mesh demo.

cargo run --example local_compute_mesh_demo

This runs a pipeline where a prompt is routed through the orchestrator, executed by the local provider, and streamed back token by token. The demo confirms that routing, provider execution, and streaming behavior all function together as expected.

Code quality checks were also run locally including:

  • cargo check
  • cargo fmt
  • cargo clippy
  • cargo test


Example Execution Output

Running the demo:

cargo run --example local_compute_mesh_demo

Example Output

User prompt: Explain photosynthesis

[router] policy: LocalFirstWithCloudFallback
[router] selected backend: local

[stream] Inference
[stream] result
[stream] for:
[stream] Explain
[stream] photosynthesis

[metrics]
latency_ms: 221
time_to_first_token_ms: 1
tokens_streamed: 6
tokens_per_second: 27.1

Execution trace (simplified):

workflow.start
router.policy = LocalFirstWithCloudFallback
router.backend_selection = local
inference.start
streaming.tokens
metrics.latency_ms = 221
workflow.complete

ScreenShots

image image

Impact

This PR converts the compute mesh architecture from a simulated framework into an operational inference pipeline. It provides a reference implementation of how local inference providers integrate with the orchestrator and how token streaming flows through the system.

It also provides a working example for contributors building new inference providers or routing strategies.

Future improvements could extend this architecture with real model backends such as Candle, llama.cpp, or other inference engines while preserving the same provider interface.

This change therefore establishes the core execution path required for the Cognitive Compute Mesh architecture.

@Avi-47 Avi-47 marked this pull request as ready for review March 15, 2026 10:17
@Avi-47
Copy link
Copy Markdown
Contributor Author

Avi-47 commented Mar 15, 2026

Hi @lijingrs @yangrudan @BH3GEI ,

Just a small clarification regarding the local backend in this PR.
The LinuxLocalProvider implemented here is intended as a lightweight reference provider to validate the full compute mesh pipeline rather than a heavy model runtime. It produces incremental tokens so the system can demonstrate a complete working flow without requiring GPU hardware or large model weights.
The goal of this PR is to demonstrate the end-to-end integration of the compute mesh stack:

workflow → routing → provider execution → token streaming → gateway SSE

This allows the framework to run a real pipeline locally while preserving the same ModelProvider streaming interface that real backends (Candle, llama.cpp, etc.) would use later.
The included local_compute_mesh_demo example shows the complete execution path and confirms that routing, provider execution, and token streaming work together.
If maintainers prefer splitting the integration or adjusting the provider abstraction, I’m happy to adapt the implementation.
Thanks!

@Avi-47 Avi-47 force-pushed the feature/compute-mesh-real-inference branch from d8a97ee to df0f4ce Compare March 20, 2026 05:52
@Avi-47 Avi-47 closed this Mar 20, 2026
@Avi-47 Avi-47 force-pushed the feature/compute-mesh-real-inference branch from 7d2be8f to 78ea3d2 Compare March 20, 2026 06:12
@Avi-47
Copy link
Copy Markdown
Contributor Author

Avi-47 commented Mar 20, 2026

closed while adding a comment, opening this

@Avi-47 Avi-47 reopened this Mar 20, 2026
@Avi-47 Avi-47 force-pushed the feature/compute-mesh-real-inference branch 3 times, most recently from 38c98b7 to bf39727 Compare March 20, 2026 08:41
@Avi-47 Avi-47 force-pushed the feature/compute-mesh-real-inference branch from bf39727 to 19bee21 Compare March 20, 2026 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Compute Mesh] Real local inference backend + unified streaming protocol (Task 13 + Task 30)

1 participant