Visualizing Query-Corpus Links in UMAP with Phoenix and Arize
Hi all! 👋 I'm working on embedding analysis for a RAG system and want to visualize the retrieval relationships between my query embeddings and corpus embeddings in UMAP (similar to what's shown in the Arize tutorials with white lines connecting queries to retrieved documents). What I have: - Query embeddings from traced retrievals (OpenTelemetry spans) - Corpus embeddings stored in LanceDB - retrieved_document_ids for each query linking to specific corpus document_ids - Both datasets loaded into Phoenix via px.launch_app(primary=queries, corpus=corpus) What I'm trying to do: Show visual connections/links between query points and their retrieved corpus documents in the UMAP visualization - essentially mapping which queries retrieved which documents. Current setup:
query_df = pd.DataFrame({
'question': [...],
'vector': [...], # 1024-dim embeddings
'retrieved_document_ids': [['doc_uuid_1', 'doc_uuid_2'], ...] #
Links!
})
corpus_df = pd.DataFrame({
'document_id': [...], # UUIDs matching retrieved_document_ids
'text': [...],
'vector': [...] # 1024-dim embeddings
})
session = px.launch_app(
primary=px.Inferences(query_df, query_schema, "queries"),
corpus=px.Inferences(corpus_df, corpus_schema, "corpus")
)Questions: 1. Does Phoenix support visualizing retrieval links in UMAP, or is this an Arize cloud-only feature? 2. If Phoenix supports it, what's the correct schema/data structure to enable the link visualization? 3. Should retrieved_document_ids be in a specific schema field or is there a different way to specify retrieval relationships? I'm getting an error where Phoenix tries to convert the retrieved_document_ids (UUID strings) to floats for UMAP computation, which suggests I may be structuring the data incorrectly. Any guidance would be much appreciated! For reference, I've read: https://phoenix.arize.com/evaluating-and-analyzing-your-rag-pipeline-with-ragas-and-phoenix/ https://arize.com/docs/ax/evaluate/evaluation-concepts/retrieval-evaluation
