UMAP representation with more than three components #1212

alopgar · 2025-08-05T17:25:33Z

alopgar
Aug 5, 2025

Hello!
I have an experiment including two different datasets (one with 800 variables and another with 1500 variables) for the same samples. I am trying to determine which of both datasets is able to cluster my samples better (i.e., with clusters more compact and more separated between them).
I ended up using UMAP for dimensionality reduction plus HDBSCAN of the UMAP coordinates. The pipeline consists in:

Perform UMAP with different hyperparameter combinations and iterate 100 times per combination.
Perform HDBSCAN for each iteration and combination.
Calculate the Davies-Bouldin index for each performed HDBSCAN.
Keep the UMAP parameter combination giving the lowest average DB index.

My problem is the number of components. After the computation, the best clustering comes from a UMAP of 50 components in one dataset and 30 components in the other. For the HDBSCAN this is not a problem, because it seems obvious that more components might give more information to the clustering algorithm.

However, I am not sure about the UMAP plotting with this amount of components. I have been using the two first components for the representation, which is in fact separating the samples according to my HDBSCAN results, but is this correct? Am I properly reflecting the separation between samples using components 1 and 2, or are the rest of components equally important and leading to different representations?

Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UMAP representation with more than three components #1212

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

UMAP representation with more than three components #1212

Uh oh!

Uh oh!

alopgar Aug 5, 2025

Replies: 0 comments

alopgar
Aug 5, 2025