Skip to content

[ENH] Start of a control interface for GC. #5218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rescrv
Copy link
Contributor

@rescrv rescrv commented Aug 6, 2025

Description of changes

This introduces an endpoint for garbage collecting collections from the
command line. The goal is to inject a collection provided on the
command line in the list of collections to clean up.

Test plan

CI

Migration plan

N/A

Observability plan

I plan to add the endpoint and then observe the world as I use it.

Documentation Changes

N/A

This introduces an endpoint for garbage collecting collections from the
command line.  The goal is to inject a collection provided on the
command line in the list of collections to clean up.
Copy link

github-actions bot commented Aug 6, 2025

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

// This is a placeholder service. The garbage collector currently only exposes a health service.
service GarbageCollector {}
message KickoffGarbageCollectionRequest {
string collection = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CompanyBestPractice]

The variable name 'collection' in the proto field and related code doesn't specify the identifier type being used. According to our naming convention guideline, specify whether this is a collection ID, name, or other identifier type for better readability and maintainability.

Consider renaming to be more specific:

  • collection_id if using UUID/ID
  • collection_name if using string name
  • collection_identifier if the type varies

Affected files: idl/chromadb/proto/garbage_collector.proto:6, rust/garbage_collector/src/lib.rs:42

&self,
req: Request<KickoffGarbageCollectionRequest>,
) -> Result<Response<KickoffGarbageCollectionResponse>, Status> {
Err(Status::not_found("resource not found"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The implementation returns a generic "resource not found" error which doesn't provide useful information to callers. Consider using a more descriptive error message that indicates this is a placeholder implementation.

Suggested change
Err(Status::not_found("resource not found"))
Err(Status::unimplemented("garbage collection endpoint not yet implemented"))

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Copy link
Contributor Author

rescrv commented Aug 6, 2025

1 Job Failed:

PR checks / Python tests / test-cluster-rust-frontend (3.9, chromadb/test/property/test_add.py)

No logs available for this step.


Summary: 1 successful workflow, 1 failed workflow

Last updated: 2025-08-08 15:55:37 UTC

Copy link
Contributor

propel-code-bot bot commented Aug 7, 2025

Add Manual Garbage Collection Endpoint and CLI for GC Control

This PR introduces a control interface enabling manual initiation of garbage collection for specific collections by UUID via a new gRPC endpoint and a command-line client. The implementation modifies the garbage collector service from a placeholder to a working endpoint that receives a collection_id and schedules that collection for immediate cleanup. Supporting changes include proto definition updates, Rust code for the GC controller/server, a new CLI tool, integration in the Kubernetes (Tiltfile) workflow, thread-safe manual collection tracking, and dependency additions.

Key Changes

• Introduced KickoffGarbageCollection gRPC endpoint to the service proto and Rust server implementation
• Added chroma-manual-gc Rust CLI tool for triggering manual GC for a given collection
• Modified GC service wiring to register the new endpoint alongside health checks
• Added logic to register and track manual GC requests in garbage_collector_component.rs (thread-safe using parking_lot Mutex/HashSet)
• Updated Kubernetes/Tiltfile configuration to enable GC service port-forwarding for CLI access
• Added parking_lot as a dependency and updated lockfile
• Updated integration points to ensure proper component/handle wiring and graceful shutdown

Affected Areas

• garbage collector Rust crate (src/lib.rs, src/garbage_collector_component.rs, src/bin/chroma-manual-gc.rs)
gRPC proto definition (idl/chromadb/proto/garbage_collector.proto)
• Dependency management (Cargo.toml, Cargo.lock)
• Kubernetes/Tiltfile for deployment, port forwarding

This summary was automatically generated by @propel-code-bot

Comment on lines +6 to +25
#[tokio::main]
async fn main() {
let args = std::env::args().skip(1).collect::<Vec<_>>();
if args.len() != 2 {
eprintln!("USAGE: chroma-manual-gc [HOST] [COLLECTION_UUID]");
std::process::exit(13);
}
let gcservice = Channel::from_shared(args[0].clone())
.expect("could not create channel")
.connect()
.await
.expect("could not connect to gc service");
let mut client = GarbageCollectorClient::new(gcservice);
client
.kickoff_garbage_collection(KickoffGarbageCollectionRequest {
collection_id: args[1].clone(),
})
.await
.expect("could not kickoff gc");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

This new command-line tool has a few issues:

  1. It's not registered as a binary in rust/garbage_collector/Cargo.toml, so it won't be built. You'll need to add a [[bin]] section for it.
  2. The error handling relies on .expect(), which will cause the tool to panic on errors like network issues or an invalid host. This isn't very user-friendly.
  3. Channel::from_shared() expects a valid URI (e.g., http://localhost:50055). If a user provides just localhost:50055, it will panic. The tool should prepend http:// to the host argument.

Here's a suggestion that addresses the error handling and URI issues to make the tool more robust:

Suggested change
#[tokio::main]
async fn main() {
let args = std::env::args().skip(1).collect::<Vec<_>>();
if args.len() != 2 {
eprintln!("USAGE: chroma-manual-gc [HOST] [COLLECTION_UUID]");
std::process::exit(13);
}
let gcservice = Channel::from_shared(args[0].clone())
.expect("could not create channel")
.connect()
.await
.expect("could not connect to gc service");
let mut client = GarbageCollectorClient::new(gcservice);
client
.kickoff_garbage_collection(KickoffGarbageCollectionRequest {
collection_id: args[1].clone(),
})
.await
.expect("could not kickoff gc");
}
#[tokio::main]
async fn main() {
let args: Vec<String> = std::env::args().skip(1).collect();
if args.len() != 2 {
eprintln!("USAGE: chroma-manual-gc [HOST:PORT] [COLLECTION_UUID]");
std::process::exit(1);
}
let host = &args[0];
let collection_id = &args[1];
let endpoint = format!("http://{}", host);
let channel = match Channel::from_shared(endpoint) {
Ok(channel) => channel,
Err(e) => {
eprintln!("Invalid host format '{}': {}", host, e);
std::process::exit(1);
}
};
let gcservice = match channel.connect().await {
Ok(service) => service,
Err(e) => {
eprintln!("Could not connect to gc service at '{}': {}", host, e);
std::process::exit(1);
}
};
let mut client = GarbageCollectorClient::new(gcservice);
if let Err(e) = client
.kickoff_garbage_collection(KickoffGarbageCollectionRequest {
collection_id: collection_id.to_string(),
})
.await
{
eprintln!("Failed to kickoff gc: {}", e);
std::process::exit(1);
}
println!("Successfully kicked off garbage collection for collection {}", collection_id);
}

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Comment on lines +315 to +323
fn manual_garbage_collection_request(
&self,
collection_id: CollectionUuid,
) -> Result<(), GarbageCollectCollectionError> {
tracing::event!(Level::INFO, name = "manual garbage collection", collection_id =? collection_id);
let mut manual_collections = self.manual_collections.lock();
manual_collections.insert(collection_id);
Ok(())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CriticalError]

While the new endpoint correctly adds collection IDs to the manual_collections set, the core logic to actually process these collections seems to be missing. The handle implementation for GarbageCollectMessage fetches collections to GC from sysdb but doesn't appear to use the manual_collections set.

To complete this feature, the GarbageCollectMessage handler needs to be updated to:

  1. Read the collection IDs from self.manual_collections.
  2. Fetch the necessary CollectionToGcInfo for these IDs from sysdb.
  3. Add these collections to the list of collections to be garbage collected in the current run.

You might need to add a new method to SysDb to fetch CollectionToGcInfo for a specific list of collection IDs, as a suitable method doesn't seem to exist yet.

_: &ComponentContext<GarbageCollector>,
) {
if let Err(err) = self.manual_garbage_collection_request(req.collection_id) {
tracing::event!(Level::ERROR, name = "manual compaction failed", error =? err);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The log message here seems to have a copy-paste error from a compaction-related component. It should refer to "garbage collection" instead of "compaction".

Suggested change
tracing::event!(Level::ERROR, name = "manual compaction failed", error =? err);
tracing::event!(Level::ERROR, name = "manual garbage collection failed", error =? err);

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant