[Umbrella] advanced traffic load balancing algorithms

**What would you like to be added**:

We will focus on LLM-specific characteristics to load-balance traffic, like prefix-cache aware, kv-cache aware, lora-aware, load-aware, request-profile aware(summary or chat) and so on.

They're plugins baked into the envoy gateway.

- [ ] random selection as template and baseline, https://github.com/InftyAI/llmaz/issues/371
- [ ] LoRA aware plugin
- [ ] Fairness sharing
  - https://arxiv.org/abs/2401.00588
  - https://arxiv.org/abs/2501.14312
  - https://arxiv.org/abs/2407.00023
  - https://arxiv.org/abs/2312.07104
- [ ] prefix cache aware plugin
  - https://arxiv.org/pdf/2503.16525
  - https://arxiv.org/pdf/2405.16444

**Why is this needed**:

Better performance.

**Completion requirements**:

This enhancement requires the following artifacts:

- [x] Design doc
- [ ] API change
- [x] Docs update

The artifacts should be linked in subsequent comments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Umbrella] advanced traffic load balancing algorithms #376

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Umbrella] advanced traffic load balancing algorithms #376

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions