-
Notifications
You must be signed in to change notification settings - Fork 47
Open
Description
Create LoRA affinity filter or scorer (Note: Confirm it is not available already in upstream GIE epp packages)
Possible filter behavior:
- Separating pods into two groups: those with target model affinity and those with available capacity
- Using a probability threshold to sometimes select from non-affinity pods to enable load balancing
- Falling back to whatever group has pods if one group is empty
Possible Scorer behavior:
- Provide maximum score for pods that have the required LoRA loaded and zero score for all other pods
Decision required for building list of active LoRAs:
vLLM metrics contain all permutations of LoRAs of running and waiting requests with timestamp
- Option1: use only latest metrics which defines the most recent loras state. Problematic when vLLM load is not 100%. Need to understand how vLLM works, is LoRA is offloaded once the request processing finished
- Option2: use not only the most recent metric to get running and waiting loras. In case the load is lower than 100%, we want to go back to less recent events and collect loras up to max_loras.