Open
Description
What would you like to be added:
See the whole list: https://github.com/InftyAI/llmaz/milestone/3
We'll focus on three main things:
-
xPyD serving with heterogeneous devices, we need a new orchestration layer build on top of lws
- disaggregate PD serving
- aggregate PD serving
-
More advanced routing policies, e.g. based on request profile & GPU type
-
GPU spot instances scaling ready for production env
Glad to have like:
- Advanced Pod scaling with dedicated scaler
Why is this needed:
Completion requirements:
This enhancement requires the following artifacts:
- Design doc
- API change
- Docs update
The artifacts should be linked in subsequent comments.