-
Notifications
You must be signed in to change notification settings - Fork 204
Description
What happened:
While implementing new requestcontrol plugins for latency prediction, we noticed that our ResponseStreaming plugin was being run an additional time after our ResponseComplete plugin finished. Both the streaming and complete hooks are run in HandleResponseBodyModelStreaming and ResponseStreaming is always run first before checking if streamingEndMsg was received and if so, running ResponseComplete hooks. The only way the plugins could be getting called in this order is if HandleResponseBodyModelStreaming is being called an additional time after the streamingEndMsg is received.
What you expected to happen:
When streamingEndMsg is received, HandleResponseBodyModelStreaming should not be called again, as it's assumed that the end token is the final one received (This is also triggers when the request is marked as complete in reqCtx).
How to reproduce it (as minimally and precisely as possible):
Create a ResponseStreaming and ResponseComplete plugin that print logs. When sending a streamed request, you will notice that the ResponseStreaming runs once after ResponseComplete.
Environment:
- Kubernetes version (use
kubectl version):
Client Version: v1.33.5-dispatcher
Kustomize Version: v5.6.0
Server Version: v1.33.5-gke.1162000
- Inference extension version (use
git describe --tags --dirty --always):
e0afb897(commit hash) - Cloud provider or hardware configuration:
GKE - Install tools:
Helm chart via getting started guide (with aforementioned custom requestcontrol plugins see SLO Aware Routing Sidecar + Plugin EPP Integration and Helm Deployment #1839)