Skip to content

[ENH]: Load HNSW index without disk intermediary #5159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ bytemuck = "1.21.0"
rayon = "1.10.0"
validator = { version = "0.19", features = ["derive"] }
rust-embed = { version = "8.5.0", features = ["include-exclude", "debug-embed"] }
hnswlib = { version = "0.8.1", git = "https://github.com/chroma-core/hnswlib.git" }
hnswlib = { git = "https://github.com/chroma-core/hnswlib.git", rev = "736bfd15d05843e6cc6ac3b806edd21566821be" }
reqwest = { version = "0.12.9", features = ["rustls-tls-native-roots", "http2"], default-features = false }
random-port = "0.1.1"
ndarray = { version = "0.16.1", features = ["approx"] }
Expand Down
23 changes: 23 additions & 0 deletions rust/index/src/hnsw.rs
Original file line number Diff line number Diff line change
Expand Up @@ -230,6 +230,29 @@ impl PersistentIndex<HnswIndexConfig> for HnswIndex {
dimensionality: index_config.dimensionality,
persist_path: path.into(),
ef_search,
hnsw_data: hnswlib::HnswData::default(),
})
.map_err(|e| WrappedHnswInitError::Other(e).boxed())?;

Ok(HnswIndex {
index,
id,
distance_function: index_config.distance_function.clone(),
})
}

fn load_from_hnsw_data(
hnsw_data: hnswlib::HnswData,
index_config: &IndexConfig,
ef_search: usize,
id: IndexUuid,
) -> Result<Self, Box<dyn ChromaError>> {
let index = hnswlib::HnswIndex::load_from_hnsw_data(hnswlib::HnswIndexLoadConfig {
distance_function: map_distance_function(index_config.distance_function.clone()),
dimensionality: index_config.dimensionality,
persist_path: "".into(),
ef_search,
hnsw_data: hnsw_data,
})
.map_err(|e| WrappedHnswInitError::Other(e).boxed())?;

Expand Down
94 changes: 71 additions & 23 deletions rust/index/src/hnsw_provider.rs
Original file line number Diff line number Diff line change
Expand Up @@ -207,20 +207,27 @@ impl HnswIndexProvider {

let index_config = IndexConfig::new(dimensionality, distance_function);

let storage_path_str = match new_storage_path.to_str() {
Some(storage_path_str) => storage_path_str,
None => {
return Err(Box::new(HnswIndexProviderForkError::PathToStringError(
new_storage_path,
)));
}
};
// let storage_path_str = match new_storage_path.to_str() {
// Some(storage_path_str) => storage_path_str,
// None => {
// return Err(Box::new(HnswIndexProviderForkError::PathToStringError(
// new_storage_path,
// )));
// }
// };

// Check if the entry is in the cache, if it is, we assume
// another thread has loaded the index and we return it.
match self.get(&new_id, cache_key).await {
Some(index) => Ok(index.clone()),
None => match HnswIndex::load(storage_path_str, &index_config, ef_search, new_id) {
None => match HnswIndex::load_from_hnsw_data(
self.fetch_hnsw_segment(&new_id, prefix_path)
.await
.map_err(|e| Box::new(HnswIndexProviderForkError::FileError(*e)))?,
&index_config,
ef_search,
new_id,
) {
Comment on lines +223 to +230
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CriticalError]

The logic for loading the index within the fork method appears to be incorrect. It attempts to fetch the segment from remote storage using new_id, but the index for new_id doesn't exist in storage yet. The files for source_id have just been copied to a local directory.

The previous implementation using HnswIndex::load(storage_path_str, ...) correctly loaded the index from this new local directory. Since fork is intended to create a mutable, file-backed copy of an index, it seems the original approach of loading from the local path should be restored.

Ok(index) => {
let index = HnswIndexRef {
inner: Arc::new(RwLock::new(DistributedHnswInner {
Expand Down Expand Up @@ -277,10 +284,33 @@ impl HnswIndexProvider {
prefix_path: &str,
) -> Result<(), Box<HnswIndexProviderFileError>> {
// Fetch the files from storage and put them in the index storage path.
let hnsw_data = self.fetch_hnsw_segment(source_id, prefix_path).await?;
let getters = [
|hnsw_data: &hnswlib::HnswData| Arc::new(Vec::from(hnsw_data.header_buffer())),
|hnsw_data: &hnswlib::HnswData| Arc::new(Vec::from(hnsw_data.data_level0_buffer())),
|hnsw_data: &hnswlib::HnswData| Arc::new(Vec::from(hnsw_data.length_buffer())),
|hnsw_data: &hnswlib::HnswData| Arc::new(Vec::from(hnsw_data.link_list_buffer())),
];

for (file, getter) in FILES.iter().zip(getters) {
let file_path = index_storage_path.join(file);
self.copy_bytes_to_local_file(&file_path, getter(&hnsw_data))
.await?;
}
Ok(())
}
Comment on lines +287 to +301
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[PerformanceOptimization]

This function now fetches the entire HNSW segment into an in-memory HnswData object before writing the individual files to disk. The previous implementation streamed each file directly. For large indexes, this change could significantly increase memory usage during the fork operation. Was this change intentional? If fork still needs to write to disk, perhaps restoring the previous file-by-file download logic for this function would be more memory-efficient.


async fn fetch_hnsw_segment(
&self,
source_id: &IndexUuid,
prefix_path: &str,
) -> Result<hnswlib::HnswData, Box<HnswIndexProviderFileError>> {
let mut buffers = Vec::new();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than this pattern of assuming the buffers are in order, can we expose a HnswDataBuilder that will allow us to .add_<named_buffer>() and then .build() returns the HnswData. Less bug prone under changes


for file in FILES.iter() {
Copy link
Collaborator

@HammadB HammadB Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we refactor this to get the files in parallel? Probably a separate PR but seems worth doing while we are in here/

let s3_fetch_span =
tracing::trace_span!(parent: Span::current(), "Read bytes from s3", file = file);
let buf = s3_fetch_span
let _ = s3_fetch_span
.in_scope(|| async {
let key = Self::format_key(prefix_path, source_id, file);
tracing::info!("Loading hnsw index file: {} into directory", key);
Expand All @@ -304,13 +334,24 @@ impl HnswIndexProvider {
bytes_read,
key,
);
Ok(buf)
buffers.push(buf);
Ok(())
})
.await?;
let file_path = index_storage_path.join(file);
self.copy_bytes_to_local_file(&file_path, buf).await?;
}
Ok(())
match hnswlib::HnswData::new_from_buffers(
buffers[0].clone(),
buffers[1].clone(),
buffers[2].clone(),
buffers[3].clone(),
) {
Ok(hnsw_data) => Ok(hnsw_data),
Err(e) => Err(Box::new(HnswIndexProviderFileError::StorageError(
chroma_storage::StorageError::Message {
message: e.to_string(),
},
))),
}
}

pub async fn open(
Expand Down Expand Up @@ -356,20 +397,27 @@ impl HnswIndexProvider {

let index_config = IndexConfig::new(dimensionality, distance_function);

let index_storage_path_str = match index_storage_path.to_str() {
Some(index_storage_path_str) => index_storage_path_str,
None => {
return Err(Box::new(HnswIndexProviderOpenError::PathToStringError(
index_storage_path,
)));
}
};
// let index_storage_path_str = match index_storage_path.to_str() {
// Some(index_storage_path_str) => index_storage_path_str,
// None => {
// return Err(Box::new(HnswIndexProviderOpenError::PathToStringError(
// index_storage_path,
// )));
// }
// };

// Check if the entry is in the cache, if it is, we assume
// another thread has loaded the index and we return it.
let index = match self.get(id, cache_key).await {
Some(index) => Ok(index.clone()),
None => match HnswIndex::load(index_storage_path_str, &index_config, ef_search, *id) {
None => match HnswIndex::load_from_hnsw_data(
self.fetch_hnsw_segment(id, prefix_path)
.await
.map_err(|e| Box::new(HnswIndexProviderOpenError::FileError(*e)))?,
&index_config,
ef_search,
*id,
) {
Comment on lines +413 to +420
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[PerformanceOptimization]

This change successfully loads the index from memory, which aligns with the PR's goal. However, the open function still contains calls that write the index to a temporary directory on disk (create_dir_all, load_hnsw_segment_into_directory) before this memory-based loading occurs. These disk operations now seem redundant.

To fully load without a disk intermediary and improve efficiency, you could remove the calls to create_dir_all, load_hnsw_segment_into_directory, and purge_one_id from this function.

Ok(index) => {
let index = HnswIndexRef {
inner: Arc::new(RwLock::new(DistributedHnswInner {
Expand Down
9 changes: 9 additions & 0 deletions rust/index/src/types.rs
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,15 @@ pub trait PersistentIndex<C>: Index<C> {
) -> Result<Self, Box<dyn ChromaError>>
where
Self: Sized;

fn load_from_hnsw_data(
hnsw_data: hnswlib::HnswData,
index_config: &IndexConfig,
ef_search: usize,
id: IndexUuid,
) -> Result<Self, Box<dyn ChromaError>>
where
Self: Sized;
}

/// IndexUuid is a wrapper around Uuid to provide a type for the index id.
Expand Down
Loading