Skip to content

Conversation

lowener
Copy link
Contributor

@lowener lowener commented Jul 16, 2025

In this PR, a new way to build IVF-PQ is added: build_from_args in Python that use a new overload of build in C++.
This new build method takes views on device memory to construct an IVF-PQ index that is extendable and searchable with as few data transfer as possible.
It is made available in C++ and Python, and will be re-used in a future preprocessing function for product quantization.

Closes #1107.

Copy link

copy-pr-bot bot commented Jul 16, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cjnolet
Copy link
Member

cjnolet commented Jul 16, 2025

Nitpick- but can we please call this "build_from_codebooks" instead of "build_from_centroids"? The codebooks are really the important piece here. Another thing we might consider doing, which would probably be more general would be to just overload the standard build function to accept the codebooks, centroids, and rotation matrix, etc... That would be more precise and more clean IMO.

@lowener lowener added the non-breaking Introduces a non-breaking change label Jul 21, 2025
@cjnolet
Copy link
Member

cjnolet commented Jul 30, 2025

Nitpick- but can we please call this "build_from_codebooks" instead of "build_from_centroids"? The codebooks are really the important piece here.

Actually, it might be more preferable to call this build_from_args. @benfred recently introduced a C API that allows building a CAGRA index from a dataset and graph and the nomenclature used was "build_from_args". I think we should be consistent.

lowener added 2 commits August 4, 2025 09:36
Signed-off-by: Mickael Ide <[email protected]>
@cjnolet cjnolet changed the base branch from branch-25.08 to branch-25.10 August 7, 2025 17:52
@cjnolet cjnolet added the improvement Improves an existing functionality label Aug 8, 2025
@lowener lowener changed the title Add build_from_centroids for IVF-PQ Add build_from_args for IVF-PQ Aug 11, 2025
@lowener lowener marked this pull request as ready for review August 11, 2025 15:35
@lowener lowener requested review from a team as code owners August 11, 2025 15:35
@@ -267,18 +267,55 @@ uint32_t cuvsIvfPqIndexGetNLists(cuvsIvfPqIndex_t index);
/** Get the dimensionality */
uint32_t cuvsIvfPqIndexGetDim(cuvsIvfPqIndex_t index);

/** Get the pq_dim */
uint32_t cuvsIvfPqIndexGetPqDim(cuvsIvfPqIndex_t index);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should decide on the API convention for trivial getters like this in the C-API.

For ivf-pq we have the getters returning the value directly, but in cagra we have the getters returning an error code with the output being passed by a pointer :

cuvsError_t cuvsCagraIndexGetDims(cuvsCagraIndex_t index, int* dim);

I think we should be consistent here across index types, and while I like this version since the corresponding method in the C++ api should be noexcept - I still think we maybe should be returning an error code here instead (just in case a different index can't guarantee noexcept etc).

@lowener @cjnolet thoughts here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a good proposition, I made the changes to support error codes for the getters

Comment on lines 2996 to 2997

void extract_codebook(raft::resources const& res,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void extract_codebook(raft::resources const& res,
/// @copydoc extract_codebook
void extract_codebook(raft::resources const& res,

raft::device_mdspan<const float, pq_centers_extents, raft::row_major> pq_centers() const noexcept;
/** Owning view of the PQ centers */
raft::device_mdspan<float, pq_centers_extents, raft::row_major> pq_centers_owning_view();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to expose both pq_centers and pq_centers_owning_view ? I'm worried this is complicating the public API, and leaking implementation details about whether the underlying memory is owned or not to the user.

CAGRA handles this by returning the graph_view_ directly if its non-owning, and setting the graph_view_ to point at the owned graph if its owning the memory

graph_view_ = graph_.view();
. This means the index.graph() getter handles both cases transparently.

Would something similar work here? (and if not - why do we need to have two different views on potentially different pq_centers inside the index?).

(also the owning_view naming convention here is a bit confusing to me - since I don't see how you can have a mdspan view that also owns the data )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is adding a bit of complexity to the API but it allows us to be able to get non-const pointers to those arrays that can be used in the building step.
If the only way to write to these owned array was by making a full copy, then operations like these that transpose the pq centers from a workspace buffer to the index would require an extra buffer of the same size to make the transpose into, then copy it into the index-owned array.

The user would not get two views of different pq-centers:

  • If IVF-PQ was built with a dataset the 2 views (const and non-const) point to the same memory location
  • If IVF-PQ was built from args, only the const view should be used, the non-const throws an exception

I think there is some value in being able to get a non-const pointer when needed.
What would be a better naming convention for those that would be less confusing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you have this as an overload with the same name? I think they should be distinguishable here at the compile time due to different const-ness of the index. This would also be more along the lines of a typical C++ code.

Copy link
Contributor Author

@lowener lowener Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be an overload with the same name, but it would be much more error-prone for the user.
index.pq_centers().size() would be ambiguous and could return the empty mdarray size (0) instead of the mdspan size.
I just tested locally the overload and for a call to raft::linalg::gemm (here) that requires a const pointer, index.rotation_matrix().data_handle() is ambiguous as well since the compiler can take non-const view and get a const pointer from it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that gemm should probably be replaced with an mdspan-based version instead of using raw pointers :) but I see your point. I guess, one could use std::enable_if_t<...> to disable the inappropriate version, or even make a single function out of the two with an if constexpr inside?..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(We discussed about it in a call, the problem is that having both under the same function name can lead to ambiguity at runtime and it would run into runtime exception)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just keep the const or not keep it and not opt to support both. In practice, I think most users are going to be getting the centers and not modifying them, but we could probably remove the const designator altogether. Would that solve this problem?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I agree w/ the others above- the user shouldn't be burdened with this- they should just be invoking an accessor method on the object, ideally named after the member they are accessing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified that code to use kind of the same mechanism as CAGRA with update_* to only allow setting that memory through copy. The API is less confusing but users now don't have access to a non-const view. Let me know your opinions on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified that code to use kind of the same mechanism as CAGRA with update_* to only allow setting that memory through copy.

We need to find a way to allow zero copy- both for accessing and mutating. They should be able to do the copy themselves if they so need to. The general assumption should be "direct access, copies should be explicit"

achirkin

This comment was marked as outdated.

Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above we've been discussing the peculiarities of accessing the parts of the index depending on how it was constructed. It seems there's no perfect solution: before the PR, IVF-PQ index can be const or non-const, and that implies a user can modify/write its members, such as PQ centers. After the PR, it seems there's no way to override the pq_centers content if the user has built the index using the old api.

Also I'm not a fan of the idea to have IVF-PQ sometimes own its members and sometimes not own. We've tried that with CAGRA and I think that is a sub-optimal user experience.

What if instead we'd pass the mdarrays by rvalue references to the new build overload and to the new constructor? Let the user pass the ownership to IVF-PQ. This would keep the index fully owning all its members and save the user from the burden of managing the lifetimes of multiple dependent objects.

And a small nitpick: could you please add a proper description to the PR? Just adding "closes #issue" doesn't look very nice in the main branch commit history and doesn't reflect how the PR addresses the issue. It is also funny that the build_from_args in the title of the PR only ever appears in the changeset in a docsting :)

benfred added a commit to benfred/cuvs that referenced this pull request Aug 21, 2025
Java doesn't support unsigned types well, and this is leading
to us only supporting 2**31 items in the cagra index - instead
of 2**32.

Fix by changing the getters in the c-api to return int64_t values
for the size/dim etc, instead of uint32_t.

We were also not consistent on returning error codes in ivf_flat/ivf_pq
for these getters - with cagra returning an error code, and the ivf*
methods just returning the value directly. Change to always return an
error code to be consistent. (Note that rapidsai#1123 is also making this change
for ivf-pq - and I'm skipping ivf-pq to avoid merge conflicts).

/** Get the dimensionality */
uint32_t cuvsIvfPqIndexGetDim(cuvsIvfPqIndex_t index);
cuvsError_t cuvsIvfPqIndexGetDim(cuvsIvfPqIndex_t index, uint32_t* dim);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1272 is using int64_t for these getters for ivf-flat and CAGRA? Can we do the same?

raft::device_matrix_view<const float, uint32_t, raft::row_major> centers() const noexcept;
/** Setter for the centers [n_lists, dim_ext] or [n_lists, dim] */
void update_centers(raft::resources const& handle,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How exactly is this different from the set_centers helper that we currently have? Can we re-use set_centers in places where you are calling update_centers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new update_* functions are required because there is no possibility to obtain a non-const pointer to those data members. So any modification to those data members must go through an index member function.
So in the current implementation set_centers can't modify the centers directly anymore, it has to go through update_centers. There is still some value in keeping set_centers because it will compute the rotated centers for the users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Feel free to mark this one as resolved.

Signed-off-by: Mickael Ide <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp improvement Improves an existing functionality non-breaking Introduces a non-breaking change Python
Projects
Development

Successfully merging this pull request may close these issues.

[FEA] Allow IVFPQ index object to be constructed with coarse centroids and codebooks
5 participants