Skip to content

Add API to count number of deleted rows across deletion vector(s)#21963

Open
mhaseeb123 wants to merge 15 commits intorapidsai:mainfrom
mhaseeb123:fea/count-deleted-rows-from-dv
Open

Add API to count number of deleted rows across deletion vector(s)#21963
mhaseeb123 wants to merge 15 commits intorapidsai:mainfrom
mhaseeb123:fea/count-deleted-rows-from-dv

Conversation

@mhaseeb123
Copy link
Copy Markdown
Member

@mhaseeb123 mhaseeb123 commented Mar 31, 2026

Description

Author's note: This PR is really tiny (most of it is just tests). The line count you see on your top right is simply from a refactor (I split a large file into two - with a new header file) and deduplicated some code. Please see my inline comments to skip over code that is moved as is. Thanks!

Closes #21937

This PR adds a new API to count the number of deleted rows across input deletion vectors using the specified index column information.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 31, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 31, 2026
@mhaseeb123 mhaseeb123 added the 2 - In Progress Currently a work in progress label Mar 31, 2026
@mhaseeb123 mhaseeb123 added feature request New feature or request non-breaking Non-breaking change cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Mar 31, 2026
@github-actions github-actions bot added the CMake CMake build issue label Mar 31, 2026
@mhaseeb123 mhaseeb123 changed the title 🚧 Add API to count number of deleted rows across deletion vector(s) Add API to count number of deleted rows across deletion vector(s) Mar 31, 2026

namespace cudf::io::parquet::experimental {

// Type alias for the cuco 64-bit roaring bitmap
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this is moved to deletion_vectors_helpers.hpp/cu as is to declutter this file a bit


namespace cudf::io::parquet::experimental {

void prepend_index_column_to_table_metadata(table_metadata& metadata)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stuff moved from the anonymous section of deletion_vectors.cu as is

cudf::host_span<size_type const> rows_per_deletion_vector,
OutputIterator output,
rmm::cuda_stream_view stream)
{
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored into a common function used from both compute_row_mask_column (old) and compute_deleted_row_count (new)

std::queue<chunked_parquet_reader::roaring_bitmap_impl>& deletion_vectors,
std::queue<size_type>& deletion_vector_row_counts,
rmm::cuda_stream_view stream)
{
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly refactored into a common function called from compute_partial_row_mask_column (old function) and compute_partial_deleted_row_count (new function)

/**
* @brief Opaque wrapper class for cuco's 64-bit roaring bitmap
*/
struct chunked_parquet_reader::roaring_bitmap_impl {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved as is

@mhaseeb123 mhaseeb123 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Mar 31, 2026
@mhaseeb123 mhaseeb123 marked this pull request as ready for review March 31, 2026 21:06
@mhaseeb123 mhaseeb123 requested review from a team as code owners March 31, 2026 21:06
/**
* @copydoc cudf::io::parquet::experimental::compute_num_deleted_rows
*/
[[nodiscard]] size_t compute_num_deleted_rows(deletion_vector_info const& deletion_vector_info,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new function that you should review.

mhaseeb123 and others added 3 commits March 31, 2026 14:21
Co-authored-by: David Wendt <45795991+davidwendt@users.noreply.github.com>
@mhaseeb123 mhaseeb123 requested a review from davidwendt March 31, 2026 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team CMake CMake build issue cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] a new API to compute the deleted row count within row ranges in the deletion vector.

2 participants