You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello!
I have an experiment including two different datasets (one with 800 variables and another with 1500 variables) for the same samples. I am trying to determine which of both datasets is able to cluster my samples better (i.e., with clusters more compact and more separated between them).
I ended up using UMAP for dimensionality reduction plus HDBSCAN of the UMAP coordinates. The pipeline consists in:
Perform UMAP with different hyperparameter combinations and iterate 100 times per combination.
Perform HDBSCAN for each iteration and combination.
Calculate the Davies-Bouldin index for each performed HDBSCAN.
Keep the UMAP parameter combination giving the lowest average DB index.
My problem is the number of components. After the computation, the best clustering comes from a UMAP of 50 components in one dataset and 30 components in the other. For the HDBSCAN this is not a problem, because it seems obvious that more components might give more information to the clustering algorithm.
However, I am not sure about the UMAP plotting with this amount of components. I have been using the two first components for the representation, which is in fact separating the samples according to my HDBSCAN results, but is this correct? Am I properly reflecting the separation between samples using components 1 and 2, or are the rest of components equally important and leading to different representations?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello!
I have an experiment including two different datasets (one with 800 variables and another with 1500 variables) for the same samples. I am trying to determine which of both datasets is able to cluster my samples better (i.e., with clusters more compact and more separated between them).
I ended up using UMAP for dimensionality reduction plus HDBSCAN of the UMAP coordinates. The pipeline consists in:
My problem is the number of components. After the computation, the best clustering comes from a UMAP of 50 components in one dataset and 30 components in the other. For the HDBSCAN this is not a problem, because it seems obvious that more components might give more information to the clustering algorithm.
However, I am not sure about the UMAP plotting with this amount of components. I have been using the two first components for the representation, which is in fact separating the samples according to my HDBSCAN results, but is this correct? Am I properly reflecting the separation between samples using components 1 and 2, or are the rest of components equally important and leading to different representations?
Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions