Open
Description
What would you like to be added:
We will focus on LLM-specific characteristics to load-balance traffic, like prefix-cache aware, kv-cache aware, lora-aware, load-aware, request-profile aware(summary or chat) and so on.
They're plugins baked into the envoy gateway.
- random selection as template and baseline, Envoy gateway plugin support with random selection #371
- LoRA aware plugin
- Fairness sharing
- prefix cache aware plugin
Why is this needed:
Better performance.
Completion requirements:
This enhancement requires the following artifacts:
- Design doc
- API change
- Docs update
The artifacts should be linked in subsequent comments.