Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions examples/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ members = [
"rag_retrieval_demo",
"rag_indexing_demo",
"adversarial_testing_demo",
"local_compute_mesh_demo",
]

[workspace.package]
Expand Down
15 changes: 15 additions & 0 deletions examples/local_compute_mesh_demo/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
[package]
name = "local_compute_mesh_demo"
version.workspace = true
edition.workspace = true

[dependencies]
tokio.workspace = true
mofa-foundation = {path = "../../crates/mofa-foundation" }
tracing.workspace = true
tracing-subscriber.workspace = true
serde.workspace = true
serde_yaml.workspace = true

[lints]
workspace = true
133 changes: 133 additions & 0 deletions examples/local_compute_mesh_demo/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# Local Compute Mesh Demo

This demo showcases the MoFA compute mesh routing capabilities, demonstrating how inference requests can be routed between local and cloud backends based on configurable policies.

## Overview

The compute mesh enables intelligent routing of inference requests between:

- **Local backends**: Run inference on local hardware (CPUs/GPUs)
- **Cloud backends**: Route to cloud providers (OpenAI, Anthropic, etc.)

## Features

- **Multiple Routing Policies**: LocalFirstWithCloudFallback, LocalOnly, CloudOnly, LatencyOptimized, CostOptimized
- **Memory-Aware Scheduling**: Automatic memory management and model eviction
- **Performance Benchmarking**: Built-in metrics collection for latency and throughput

## Running the Demo

### Basic Usage

```bash
cargo run --example local_compute_mesh_demo -- "Explain photosynthesis"
```

### With Custom Prompt

```bash
cargo run --example local_compute_mesh_demo -- "What is quantum computing?"
```

## Performance Benchmark

The demo includes comprehensive performance benchmarking that measures:

### Metrics Collected

| Metric | Description |
|--------|-------------|
| `latency_ms` | Total time from request start to completion |
| `time_to_first_token_ms` | Time to receive the first token |
| `tokens_streamed` | Total number of tokens generated |
| `tokens_per_second` | Token generation throughput |
| `total_time_ms` | Total streaming duration |

### How Latency is Measured

1. **Request Start**: Timer begins when the inference request is created
2. **First Token**: Records the time when the first token is received/streamed
3. **Completion**: Timer ends when all tokens have been streamed

The latency is measured using `std::time::Instant` for high-resolution timing.

### How Tokens/Second is Calculated

```
tokens_per_second = tokens_streamed / (total_time_ms / 1000.0)
```

This gives you the actual streaming throughput during generation.

### Comparing Local vs Cloud Routing

The demo runs three scenarios to compare different routing policies:

1. **LocalFirstWithCloudFallback**: Tries local first, falls back to cloud if needed
2. **CloudOnly**: Always routes to cloud provider
3. **LocalOnly**: Always uses local backend

Example output:

```
[workflow] executing step: generate_response
[inference] sending request to orchestrator...

[router] policy: LocalFirstWithCloudFallback
[router] selected backend: local

[stream] This
[stream] is
[stream] a
...

[metrics]
backend: local
latency_ms: 820
time_to_first_token_ms: 45
tokens_streamed: 27
tokens_per_second: 32.9
total_time_ms: 910
```

## Configuration

The demo can be configured via the `workflow.yaml` file:

- **routing_policy**: Choose the routing strategy
- **memory_capacity_mb**: Set local model memory limit
- **cloud_provider**: Configure cloud provider fallback

## Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Compute Mesh Demo β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ InferenceOrchestrator β”‚ β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚ β”‚ β”‚ Routing β”‚ β”‚ Model β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ Policy β”‚ β”‚ Pool β”‚ β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β–Ό β–Ό β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚ β”‚ β”‚ Local β”‚ β”‚ Cloud β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ Execution β”‚ β”‚ Fallback β”‚ β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## Use Cases

- **Edge Deployment**: Run models locally on edge devices
- **Cost Optimization**: Minimize cloud API costs
- **Latency Sensitive**: Prioritize fast response times
- **Hybrid Mesh**: Combine local and cloud for best experience

## License

This demo is part of the MoFA project and follows the same license terms.
Loading
Loading