-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[ENH]: fix high latency & response errors of frontend -> query service calls during rollout #5316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…e calls during rollout
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
Graceful Shutdown and Memberlist Propagation for Query/Log Services This PR implements coordinated, configurable graceful shutdown logic for Chroma's query and log services to address high-latency client errors and in-flight call blocking observed during Kubernetes pod rollouts. It introduces a shutdown 'grace period' to allow memberlist updates to propagate, prevents pods marked for deletion from being considered healthy by clients, and exposes grace period configuration in service config files and Helm charts for operational tuning. The change is applied to both the Rust (query/log) servers and the Go memberlist watcher, with attention to deployment compatibility and observability. Key Changes• Adds a Affected Areas• rust/worker (query service): config, main entrance, shutdown handling This summary was automatically generated by @propel-code-bot |
Description of changes
We've observed that during rollouts of query service pods the frontend frequently returns errors to clients (originating from the query service) and in-flight calls to a query service pod being shut down can block for 30s+ before the frontend realizes the connection is broken.
I was able to reproduce this locally by running a script that produces concurrent query load and executing
kubectl scale --replicas=1 -n chroma statefulset query-service
(scaling from 2 replicas to 1) while the script was running:The script output shows several queries errored and a few took >20s. For the queries that take >20s, the pattern (as seen in staging/prod) is that the frontend tries twice to make a request to the query service. The first request takes 20-30s before erroring with a variant of a disconnect error. The second attempt succeeds. To be honest, this behavior doesn't completely make sense to me--based on the little documentation I could find, it seems like tonic/hyper is supposed to send clients a GOAWAY frame during server shutdown which should immediately result in an error on the client. Regardless, even if clients immediately errored, there is still the possibility that the client exhausts its retry budget by only retrying against servers that have been shutdown.
Edit: chased down the client disconnect issue
This PR aims to fix the most common cause of these issues by giving the memberlist time to propagate & update on clients before terminating the query service pod. In other words, a pod that is scheduled to shut down will be removed from the memberlist but stay alive for N seconds to allow time for existing connections to drain and clients to update their local memberlist state. The same script after the changes in this PR:
There are still some failure cases:
Test plan
How are these changes tested?
Script used to test querying during scale up/down
Migration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
Observability plan
What is the plan to instrument and monitor this change?
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?