Skip to content

Feature: LoRA based filtering and scoring #10

@elevran

Description

@elevran

Create LoRA affinity filter or scorer (Note: Confirm it is not available already in upstream GIE epp packages)

Possible filter behavior:

  • Separating pods into two groups: those with target model affinity and those with available capacity
  • Using a probability threshold to sometimes select from non-affinity pods to enable load balancing
  • Falling back to whatever group has pods if one group is empty

Possible Scorer behavior:

  • Provide maximum score for pods that have the required LoRA loaded and zero score for all other pods

Decision required for building list of active LoRAs:
vLLM metrics contain all permutations of LoRAs of running and waiting requests with timestamp

  • Option1: use only latest metrics which defines the most recent loras state. Problematic when vLLM load is not 100%. Need to understand how vLLM works, is LoRA is offloaded once the request processing finished
  • Option2: use not only the most recent metric to get running and waiting loras. In case the load is lower than 100%, we want to go back to less recent events and collect loras up to max_loras.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions