Skip to content

Conversation

@jinsolp
Copy link
Contributor

@jinsolp jinsolp commented Nov 7, 2025

Closes #7377

This PR optimizes the build_condensed_hierarchy of HDBSCAN.
Our previous implementation runs a top-down bfs tree traversal, where the GPU kernel is launched for every level of the tree. This is very slow because the tree is not balanced.

This PR introduces a bottom-up approach by pointer-chasing up to the parent on the CPU using omp threads.
This is much faster without any accuracy loss in the final result.

Table below shows two main parts of our HDBSCAN implementation (build linkage, and condense).
adjusted_rand_score is computed against our implementation using brute force graph build + original GPU condense implementation.

BF + orig : Brute force MR graph build + original top-down GPU condense
NND + orig: nn-descent MR graph build + original top-down GPU condense
BF + new: Brute force MR graph build + new bottom-up CPU condense in this PR
NND + new: nn-descent MR graph build + new bottom-up CPU condense in this PR

Screenshot 2025-11-06 at 6 50 26 PM

@jinsolp jinsolp requested a review from a team as a code owner November 7, 2025 02:46
@jinsolp jinsolp requested a review from lowener November 7, 2025 02:46
@jinsolp jinsolp changed the title condense on cpu Optimizing condense hierarchy of HDBSCAN Nov 7, 2025
@jinsolp jinsolp added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Nov 7, 2025
@jinsolp jinsolp changed the title Optimizing condense hierarchy of HDBSCAN Improved condense hierarchy of HDBSCAN Nov 7, 2025
@divyegala divyegala self-requested a review November 7, 2025 04:28
@jinsolp
Copy link
Contributor Author

jinsolp commented Nov 7, 2025

It seems like the number of omp threads affect the performance, but the time doesn't always scale linearly. I believe it depends on what the tree looks like.

I think the number of persistent nodes might matter, because if there are more persistent nodes, each thread is more likely to climb "less levels" of the tree (because it climbs until it runs into a persistent node).
The ratio of (persistent nodes) / (total internal nodes) might have to do with this.
Screenshot 2025-11-07 at 12 38 22 PM

@jinsolp
Copy link
Contributor Author

jinsolp commented Nov 12, 2025

Heuristics are determined after investigating that the persistent/internal node ratio does affect the perf of this CPU implementation

Screenshot 2025-11-12 at 10 32 37 AM

@jinsolp jinsolp changed the base branch from main to release/25.12 November 17, 2025 17:03
@csadorf csadorf requested review from csadorf and removed request for lowener November 17, 2025 17:20
Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review is not yet complete, but here is a first set of comments.


/* Heuristic dispatching to CPU. A high persistent_ratio means there are more chances to stop early
as we climb up the tree, making it more efficient for bottom-up CPU approach*/
bool dispatch_to_cpu(int num_persistent, int n_leaves)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a very good chance that after we introduce this change, all of our testing either dispatches to CPU or does not dispatch to CPU. We need to ensure that we have a variety of test conditions such that both paths are hit.

Comment on lines 158 to 167
if (persistent_ratio >= 0.001) {
return true;
} else if (persistent_ratio >= 0.0001 && num_omp_threads >= 16) {
return true;
} else if (num_omp_threads >= 64) {
return true;
} else {
return false;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this heuristic really be independent of data set size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is independent of dataset size because it's branching based on the ratio, not the absolute number

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that I suspect that this heuristic should take dataset size into account. That's something we should at least evaluate.

divyegala
divyegala previously approved these changes Nov 20, 2025
@divyegala divyegala dismissed their stale review November 20, 2025 16:39

Still in evaluation stage.

@jinsolp
Copy link
Contributor Author

jinsolp commented Nov 20, 2025

Changing target branch to main to target 26.02

@jinsolp jinsolp changed the base branch from release/25.12 to main November 20, 2025 17:32
@jinsolp jinsolp force-pushed the opt-condense-hierarchy branch from 78ce1b0 to d914709 Compare November 21, 2025 00:06
@jinsolp jinsolp requested review from a team as code owners November 21, 2025 00:06
@jinsolp jinsolp requested review from jcrist and msarahan November 21, 2025 00:06
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 21, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added conda conda issue Cython / Python Cython or Python issue ci labels Nov 21, 2025
@jinsolp jinsolp force-pushed the opt-condense-hierarchy branch from d914709 to 1ce68a4 Compare November 21, 2025 00:07
@github-actions github-actions bot removed conda conda issue Cython / Python Cython or Python issue ci labels Nov 21, 2025
@jinsolp jinsolp force-pushed the opt-condense-hierarchy branch from 1ce68a4 to 409acf0 Compare November 21, 2025 00:09
@jinsolp jinsolp removed request for a team, jcrist and msarahan November 21, 2025 00:09
@jinsolp
Copy link
Contributor Author

jinsolp commented Nov 21, 2025

force pushed because of issues while rebasing to the main branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CUDA/C++ improvement Improvement / enhancement to an existing function non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize build_condensed_hierarchy

3 participants