-
Notifications
You must be signed in to change notification settings - Fork 600
Improved condense hierarchy of HDBSCAN #7459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
csadorf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review is not yet complete, but here is a first set of comments.
cpp/src/hdbscan/detail/condense.cuh
Outdated
|
|
||
| /* Heuristic dispatching to CPU. A high persistent_ratio means there are more chances to stop early | ||
| as we climb up the tree, making it more efficient for bottom-up CPU approach*/ | ||
| bool dispatch_to_cpu(int num_persistent, int n_leaves) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a very good chance that after we introduce this change, all of our testing either dispatches to CPU or does not dispatch to CPU. We need to ensure that we have a variety of test conditions such that both paths are hit.
cpp/src/hdbscan/detail/condense.cuh
Outdated
| if (persistent_ratio >= 0.001) { | ||
| return true; | ||
| } else if (persistent_ratio >= 0.0001 && num_omp_threads >= 16) { | ||
| return true; | ||
| } else if (num_omp_threads >= 64) { | ||
| return true; | ||
| } else { | ||
| return false; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this heuristic really be independent of data set size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is independent of dataset size because it's branching based on the ratio, not the absolute number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that I suspect that this heuristic should take dataset size into account. That's something we should at least evaluate.
|
Changing target branch to main to target 26.02 |
78ce1b0 to
d914709
Compare
d914709 to
1ce68a4
Compare
1ce68a4 to
409acf0
Compare
|
force pushed because of issues while rebasing to the main branch |


Closes #7377
This PR optimizes the
build_condensed_hierarchyof HDBSCAN.Our previous implementation runs a top-down bfs tree traversal, where the GPU kernel is launched for every level of the tree. This is very slow because the tree is not balanced.
This PR introduces a bottom-up approach by pointer-chasing up to the parent on the CPU using omp threads.
This is much faster without any accuracy loss in the final result.
Table below shows two main parts of our HDBSCAN implementation (build linkage, and condense).
adjusted_rand_scoreis computed against our implementation using brute force graph build + original GPU condense implementation.BF + orig : Brute force MR graph build + original top-down GPU condense
NND + orig: nn-descent MR graph build + original top-down GPU condense
BF + new: Brute force MR graph build + new bottom-up CPU condense in this PR
NND + new: nn-descent MR graph build + new bottom-up CPU condense in this PR