Understanding UMAP and HDBSCAN for Embedding Drift Calculation

·Jan 09, 2024 06:10 PM

dropping Sylla C.’s question in this channel instead:

Hello congratulations and thank you very much for the exceptional work you are doing for the machine learning community. I want to understand embeddings the drift calculation better I have a simple question : 1) To do dimension reduction with UMAP and clustering with HDBSCAN, do you use the concatenation of the two datasets, namely the reference dataset and the primary dataset? Or 2) Do you do a UMAP and a HDBSCAN only on the reference dataset then you do a prediction on the primary dataset. Another way to simplify my question is what you do: 1) ds = concat(train_ds, prod_ds) then M = HDBSCAN.fit_predict(UMAP.fit_transform(ds)) Or 2) M = HDBSCAN.fit_predict(UMAP.fit_transform(train_ds)) then M.predict(prod_ds)? Thank you for your answers !