GPU-Supported UMAP on Large Dataset #1209

carobs9 · 2025-07-08T11:44:19Z

carobs9
Jul 8, 2025

Hello,

I am trying to fit GPU-supported UMAP on a 3.8 million samples dataset to later apply HDBSCAN on the reduced embeddings, which have 15 dimensions. Last, I reduce these 15 dimensions embeddings to two dimensions for visualization.

Unfortunately, the results are not as expected. Instead of obtaining many different clusters, it seems like I get one big cluster that does not contain any structural information from my original 384-dim embeddings. I have tried to tweak parameters like n_epochs, n_neighbors or min_dist, but I still get one big cluster. I have also tried to reduce the initial embeddings to 10 or 5 dimensions instead of 15.

Are there any tweaks that can be done to get more nuanced clusters?

Here are my specifications:

umap_params = {
            "n_epochs": 30_000 if utils.get_device() == "cuda" else None,
            "init": "random",
            "n_neighbors": 20, 
            "n_components": 15,
            "min_dist": 0.5, 
            "low_memory": True,
            "random_state": cfg.SEED,
            "umap_version": "cuML" if "cuml" in str(UMAP) else "umap-learn",
            }
umap_15d = UMAP(
                n_epochs=umap_params["n_epochs"],
                init=umap_params["init"],
                n_neighbors=umap_params["n_neighbors"],
                n_components=umap_params["n_components"],
                min_dist=umap_params["min_dist"],
                low_memory=umap_params["low_memory"],
                random_state=umap_params["random_state"]
            )
embeddings_15d = umap_15d.fit_transform(embeddings) # these initial embeddings have 384 dimensions

umap_2d = UMAP(n_neighbors=20,
                        n_components=2, 
                        min_dist=0.5,
                        spread=2.5,
                        metric='cosine', 
                        random_state=cfg.SEED)
embeddings_2d = umap_2d.fit_transform(embeddings)

Thanks in advance!

DanteTrb · 2025-07-21T06:37:02Z

DanteTrb
Jul 21, 2025

Hi @carobs9! 🫡

Great question — UMAP + HDBSCAN on millions of samples is incredibly powerful, but also sensitive to parameter tuning and workflow design.

Here's what might be happening

You're reducing from 384 → 15 → 2 dimensions. But even with 15D, you're seeing only one blob, this usually points to inadequate preservation of local structure or HDBSCAN seeing noise instead of real clusters.

Let’s go step-by-step to unlock nuanced clusters:

🔍 Step 1: Understand what HDBSCAN "sees"

HDBSCAN relies on:

Density contrast: needs clusters with different densities.
Metric: same as UMAP output (default is euclidean, but cosine might help).
Preservation of high-dimensional relationships: If UMAP is too aggressive, HDBSCAN just sees mush.

🛠️ Step 2: Fix the embedding pipeline

Try this:

umap_15d = UMAP(
    n_neighbors=50,         # ← larger values preserve global structure
    min_dist=0.1,           # ← tighter clusters
    n_components=15,
    metric='cosine',        # ← better for text / embedding-like data
    init='spectral',        # ← helps with separation in high-D
    random_state=SEED
)

X_15d = umap_15d.fit_transform(X_384)

# Then:
umap_2d = UMAP(
    n_neighbors=15,
    min_dist=0.3,
    n_components=2,
    spread=1.5,
    metric='cosine',
    random_state=SEED
)

X_2d = umap_2d.fit_transform(X_15d)

Then, apply HDBSCAN on X_15d, not on the 2D:

from hdbscan import HDBSCAN

clusterer = HDBSCAN(
    min_cluster_size=100,          # tune this based on scale
    metric='euclidean',            # ← try both 'euclidean' and 'cosine'
    cluster_selection_method='eom'
)

labels = clusterer.fit_predict(X_15d)

Why not 2D? Because:
Clustering in 2D is like deciding traffic patterns from a paper map, nice to look at, but low resolution.

Let me know if this solves your issue or if anything remains unclear, happy to help refine it further.
(Also, marking it as resolved could help others in the community who run into similar UMAP+HDBSCAN challenges. 🙌)

1 reply

carobs9 Jul 26, 2025
Author

Hello @DanteTrb. Thank you for your answer! The visualization still displays one big blob, but the implementation you suggested helped improve the results, particularly in terms of topic coherence. Maybe the 2D visualization just cannot summarize all of the embedding nuances correctly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU-Supported UMAP on Large Dataset #1209

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GPU-Supported UMAP on Large Dataset #1209

Uh oh!

Uh oh!

carobs9 Jul 8, 2025

Replies: 1 comment · 1 reply

Uh oh!

DanteTrb Jul 21, 2025

Here's what might be happening

🔍 Step 1: Understand what HDBSCAN "sees"

🛠️ Step 2: Fix the embedding pipeline

Uh oh!

carobs9 Jul 26, 2025 Author

carobs9
Jul 8, 2025

Replies: 1 comment 1 reply

DanteTrb
Jul 21, 2025

carobs9 Jul 26, 2025
Author