Visualizing Query-Corpus Links in UMAP with Phoenix and Arize

Visualizing Query-Corpus Links in UMAP with Phoenix and Arize | Arize AI Community

Hi all! 👋 I'm working on embedding analysis for a RAG system and want to visualize the retrieval relationships between my query embeddings and corpus embeddings in UMAP (similar to what's shown in the Arize tutorials with white lines connecting queries to retrieved documents). What I have: - Query embeddings from traced retrievals (OpenTelemetry spans) - Corpus embeddings stored in LanceDB - retrieved_document_ids for each query linking to specific corpus document_ids - Both datasets loaded into Phoenix via px.launch_app(primary=queries, corpus=corpus) What I'm trying to do: Show visual connections/links between query points and their retrieved corpus documents in the UMAP visualization - essentially mapping which queries retrieved which documents. Current setup:

query_df = pd.DataFrame({
    'question': [...],
    'vector': [...],  # 1024-dim embeddings
    'retrieved_document_ids': [['doc_uuid_1', 'doc_uuid_2'], ...]  # 
Links!
})

corpus_df = pd.DataFrame({
    'document_id': [...],  # UUIDs matching retrieved_document_ids
    'text': [...],
    'vector': [...]  # 1024-dim embeddings
})

session = px.launch_app(
    primary=px.Inferences(query_df, query_schema, "queries"),
    corpus=px.Inferences(corpus_df, corpus_schema, "corpus")
)

Questions: 1. Does Phoenix support visualizing retrieval links in UMAP, or is this an Arize cloud-only feature? 2. If Phoenix supports it, what's the correct schema/data structure to enable the link visualization? 3. Should retrieved_document_ids be in a specific schema field or is there a different way to specify retrieval relationships? I'm getting an error where Phoenix tries to convert the retrieved_document_ids (UUID strings) to floats for UMAP computation, which suggests I may be structuring the data incorrectly. Any guidance would be much appreciated! For reference, I've read: https://phoenix.arize.com/evaluating-and-analyzing-your-rag-pipeline-with-ragas-and-phoenix/ https://arize.com/docs/ax/evaluate/evaluation-concepts/retrieval-evaluation

7 comments

· Sorted by Oldest

RunLLM
APP
·
·
1. Phoenix (open source) does support visualizing retrieval links between queries and corpus documents in the UMAP embedding view, not just in Arize cloud. The white lines connecting queries to retrieved documents are available if your data is structured correctly and the retrieval relationships are specified as expected by Phoenix. See retrieval-and-inferences documentation and RAG evaluation use case.2. To enable link visualization, you must use the correct schema: - In your query dataframe, the column containing the list of retrieved document IDs must be mapped to the context_retrieval_ids_column_name in the schema, and optionally, a column of relevance scores to context_retrieval_scores_column_name. - Use RetrievalEmbeddingColumnNames in your schema for the primary (query) dataset, e.g.:
prediction_id_column_name="id", prompt_column_names=px.RetrievalEmbeddingColumnNames( vector_column_name="vector", raw_data_column_name="question", context_retrieval_ids_column_name="retrieved_document_ids", # context_retrieval_scores_column_name="relevance_scores", # optional ) ) ```LBKS - The corpus dataset should have a schema mapping the document IDs and vectors, e.g.: ```LBKS corpus_schema = px.Schema( id_column_name="document_id", document_column_names=px.EmbeddingColumnNames( vector_column_name="vector", raw_data_column_name="text", ) ) ```LBKS - Make sure the `retrieved_document_ids` in the query dataframe exactly match the `document_id` in the corpus dataframe (including type: string/UUID).LBKSIf you see Phoenix trying to convert UUIDs to floats, it likely means the schema is not set up with `context_retrieval_ids_column_name` and Phoenix is treating the column as a feature instead of retrieval links. See [retrieval-rag.md](https://github.com/Arize-ai/phoenix/blob/main/docs/how-to/define-your-schema/retrieval-rag.md) for a full example.LBKSWould you like a step-by-step code example for your specific dataframes?LBKSSources: - [how-to/define-your-schema/retrieval-rag.md](https://github.com/Arize-ai/phoenix/blob/main/docs/how-to/define-your-schema/retrieval-rag.md) - [retrieval-and-inferences documentation](https://github.com/Arize-ai/phoenix/blob/main/docs/section-learn/retrieval-and-infrences/retrieval-with-embeddings.md) - [RAG evaluation use case](https://github.com/Arize-ai/phoenix/blob/main/docs/use-cases/rag-evaluation.md)
💡 Hint: Mention RunLLM in the thread for followups.
✅1
Nas
·
·
retrieval-with-embeddings.md had the info i needed, thanks.... Currently using two separate Phoenix instances and wondering if there's a better approach. Current Setup: - Docker Phoenix (port 6006): Collecting OTEL traces from queries - Has query embeddings (from attributes.embedding.embeddings) - Has retrieval metadata (from attributes.retrieval.documents) - From what i can tell, does NOT have corpus embeddings - Separate px.launch_app() (port 7007): UMAP visualization - Queries Docker Phoenix to extract query embeddings - Loads corpus embeddings from LanceDB - Uses RetrievalEmbeddingColumnNames with context_retrieval_ids_column_name to create links - Shows UMAP with query→corpus retrieval connections My next Questions: 1. Is there a way to use Docker Phoenix's built-in UMAP viewer instead of launching a separate instance? - I saw "Embeddings Tracing lets you select embedding spans and dive into UMAP visualizer" - does this support corpus embeddings + retrieval links? - Can I log corpus embeddings to Docker Phoenix so everything's in one place? 2. What's the recommended development workflow for iterative testing? - Currently: Run queries > Kill UMAP viewer > restart to see new data - Is there an auto-refresh or live-update approach for embeddings visualisation? To summarise want to know if I can use Docker Phoenix's built-in UMAP viewer instead of launching a separate instance. If two instances is the right approach, any tips for the refresh workflow RunLLM?
RunLLM
APP
·
·
It seems likely that you can use Docker Phoenix's built-in UMAP viewer for embedding analysis, including retrieval links, if both query and corpus embeddings are logged to the same Phoenix instance with the correct schema. You can log corpus embeddings to Docker Phoenix (e.g., via the Python client or OTEL traces) so that both queries and corpus are visualized together, enabling retrieval connections in the UMAP view—see retrieval-with-embeddings.md and Embeddings Analysis.My understanding is that Phoenix does not currently support true live auto-refresh of the UMAP visualization; you typically need to relaunch or refresh the app to see new data. For iterative development, the recommended workflow is to log new data to the running Phoenix instance and then refresh the browser UI. If you use a persistent backend (e.g., SQLite or Postgres), you can avoid restarting the server between updates. See Docker deployment docs and Embeddings Analysis.Would you like more detail on logging corpus embeddings or on configuring persistent storage for smoother iteration?Sources: - retrieval-with-embeddings.md - Embeddings Analysis
💡 Hint: Mention RunLLM in the thread for followups.
Nas
·
·
how exactly do i log corpus embeddings to Docker Phoenix, this is the sticking point for me RunLLM
RunLLM
APP
·
·
It seems likely that you can log corpus embeddings to Docker Phoenix by using the Python client and creating an Inferences object for your corpus, then sending it to the running Phoenix instance. The key is to use the correct schema (with document IDs and embedding vectors) and connect to your Docker Phoenix endpoint using the px.Client or px.launch_app methods—see retrieval-with-embeddings.md and quickstart/phoenix-inferences/README.md.Here’s a minimal example: LBKSimport phoenix as px import pandas as pdLBKScorpus_df = pd.DataFrame({ "document_id": [...], # your UUIDs "text": [...], "vector": [...], # list of embedding vectors })LBKScorpus_schema = px.Schema( id_column_name="document_id", document_column_names=px.EmbeddingColumnNames( vector_column_name="vector", raw_data_column_name="text", ), )LBKS# Connect to your running Docker Phoenix instance client = px.Client(endpoint="http://localhost:6006") corpus_inferences = px.Inferences(corpus_df, corpus_schema, "corpus") client.log_inferences(corpus_inferences)LBKSThis will log your corpus embeddings to the Docker Phoenix instance, making them available for UMAP visualization and retrieval link analysis.Would you like more detail on schema setup or troubleshooting log_inferences?Sources: - retrieval-with-embeddings.md - quickstart/phoenix-inferences/README.md
💡 Hint: Mention RunLLM in the thread for followups.
Nas
·
·
Thanks for the suggestions! Unfortunately the suggested approach doesn't work in Phoenix 12.6.1. What Didn't Work Suggested code:
client = px.Client(endpoint="http://localhost:6006") corpus_inferences = px.Inferences(corpus_df, corpus_schema, "corpus") client.log_inferences(corpus_inferences) # ← Method doesn't exist
Issue: client.log_inferences() doesn't exist in Phoenix 12.6.1. Available methods are: - log_evaluations(), log_traces(), upload_dataset() - query_spans(), get_spans_dataframe() - No method to log Inferences to a running instance Also tried: - px.log_inferences() - doesn't exist - px.launch_app(corpus=...) - corpus loads in backend but doesn't appear in /model UI - px.launch_app(primary=..., corpus=...) - /model tab appears, but OTEL trace embeddings don't show in the UMAP view Working Solution: 2-Instance Approach Instance 1 (port 6006): Trace collector receiving OTEL spans with embeddings Instance 2 (port 7007): UMAP viewer that: 1. Queries Instance 1 for trace spans 2. Extracts query embeddings from attributes.embedding.embeddings 3. Extracts retrieval links from attributes.retrieval.documents 4. Loads corpus from LanceDB 5. Launches separate Phoenix with both as Inferences Code:
# Extract queries from trace collector client = px.Client(endpoint="http://localhost:6006") spans = client.query_spans(project_name="my-project") # Parse embeddings and retrieval links from spans query_df = extract_queries_with_metadata(spans, source_page_to_id) # Load corpus corpus_df = load_from_lancedb() # Define schemas with retrieval links query_schema = px.Schema( prompt_column_names=px.RetrievalEmbeddingColumnNames( vector_column_name="vector", raw_data_column_name="question", context_retrieval_ids_column_name="retrieved_document_ids", # Links! ), ) corpus_schema = px.Schema( id_column_name="document_id", document_column_names=px.EmbeddingColumnNames( vector_column_name="vector", raw_data_column_name="text", ), ) # Launch UMAP viewer session = px.launch_app( port=7007, primary=px.Inferences(query_df, query_schema, "queries"), corpus=px.Inferences(corpus_df, corpus_schema, "corpus"), )
Result: http://localhost:7007/model shows UMAP with corpus + queries + retrieval links ✅ Current workflow:
# Run queries -> restart viewer to see updates python rag_cli.py --trace query my-collection "question" --k 3 lsof -ti:7007 | xargs -r kill -9 python phoenix_umap_enhanced_metadata.py # Re-extracts & relaunches
Core Question (open to be answered by all) Is there a way in Phoenix UI to visualise trace embeddings in UMAP, or is manual extraction the intended workflow? Looking at the https://arize.com/docs/ax/evaluate/evaluation-concepts/retrieval-evaluation, even official examples manually load embeddings with EmbeddingColumnNames. What I'm hoping exists: - A UI feature to visualize attributes.embedding.embeddings from traces in UMAP - Auto-population: corpus from Inferences + queries from incoming OTEL traces - Refresh browser to see updated UMAP (no script restart) If manual extraction is the recommended pattern, that's totally fine! Just want to confirm I'm not missing a more elegant built-in approach.
Roger Y.
·
·
Thank you for the detailed report! The manual extraction with 2-instances is only method currently available. Please feel free to open an issue to document your use case for future enhancements.
👍1

RunLLM
APP
·
·
1. Phoenix (open source) does support visualizing retrieval links between queries and corpus documents in the UMAP embedding view, not just in Arize cloud. The white lines connecting queries to retrieved documents are available if your data is structured correctly and the retrieval relationships are specified as expected by Phoenix. See retrieval-and-inferences documentation and RAG evaluation use case.2. To enable link visualization, you must use the correct schema: - In your query dataframe, the column containing the list of retrieved document IDs must be mapped to the context_retrieval_ids_column_name in the schema, and optionally, a column of relevance scores to context_retrieval_scores_column_name. - Use RetrievalEmbeddingColumnNames in your schema for the primary (query) dataset, e.g.:
prediction_id_column_name="id", prompt_column_names=px.RetrievalEmbeddingColumnNames( vector_column_name="vector", raw_data_column_name="question", context_retrieval_ids_column_name="retrieved_document_ids", # context_retrieval_scores_column_name="relevance_scores", # optional ) ) ```LBKS - The corpus dataset should have a schema mapping the document IDs and vectors, e.g.: ```LBKS corpus_schema = px.Schema( id_column_name="document_id", document_column_names=px.EmbeddingColumnNames( vector_column_name="vector", raw_data_column_name="text", ) ) ```LBKS - Make sure the `retrieved_document_ids` in the query dataframe exactly match the `document_id` in the corpus dataframe (including type: string/UUID).LBKSIf you see Phoenix trying to convert UUIDs to floats, it likely means the schema is not set up with `context_retrieval_ids_column_name` and Phoenix is treating the column as a feature instead of retrieval links. See [retrieval-rag.md](https://github.com/Arize-ai/phoenix/blob/main/docs/how-to/define-your-schema/retrieval-rag.md) for a full example.LBKSWould you like a step-by-step code example for your specific dataframes?LBKSSources: - [how-to/define-your-schema/retrieval-rag.md](https://github.com/Arize-ai/phoenix/blob/main/docs/how-to/define-your-schema/retrieval-rag.md) - [retrieval-and-inferences documentation](https://github.com/Arize-ai/phoenix/blob/main/docs/section-learn/retrieval-and-infrences/retrieval-with-embeddings.md) - [RAG evaluation use case](https://github.com/Arize-ai/phoenix/blob/main/docs/use-cases/rag-evaluation.md)
💡 Hint: Mention RunLLM in the thread for followups.
✅1
Nas
·
·
retrieval-with-embeddings.md had the info i needed, thanks.... Currently using two separate Phoenix instances and wondering if there's a better approach. Current Setup: - Docker Phoenix (port 6006): Collecting OTEL traces from queries - Has query embeddings (from attributes.embedding.embeddings) - Has retrieval metadata (from attributes.retrieval.documents) - From what i can tell, does NOT have corpus embeddings - Separate px.launch_app() (port 7007): UMAP visualization - Queries Docker Phoenix to extract query embeddings - Loads corpus embeddings from LanceDB - Uses RetrievalEmbeddingColumnNames with context_retrieval_ids_column_name to create links - Shows UMAP with query→corpus retrieval connections My next Questions: 1. Is there a way to use Docker Phoenix's built-in UMAP viewer instead of launching a separate instance? - I saw "Embeddings Tracing lets you select embedding spans and dive into UMAP visualizer" - does this support corpus embeddings + retrieval links? - Can I log corpus embeddings to Docker Phoenix so everything's in one place? 2. What's the recommended development workflow for iterative testing? - Currently: Run queries > Kill UMAP viewer > restart to see new data - Is there an auto-refresh or live-update approach for embeddings visualisation? To summarise want to know if I can use Docker Phoenix's built-in UMAP viewer instead of launching a separate instance. If two instances is the right approach, any tips for the refresh workflow RunLLM?
RunLLM
APP
·
·
It seems likely that you can use Docker Phoenix's built-in UMAP viewer for embedding analysis, including retrieval links, if both query and corpus embeddings are logged to the same Phoenix instance with the correct schema. You can log corpus embeddings to Docker Phoenix (e.g., via the Python client or OTEL traces) so that both queries and corpus are visualized together, enabling retrieval connections in the UMAP view—see retrieval-with-embeddings.md and Embeddings Analysis.My understanding is that Phoenix does not currently support true live auto-refresh of the UMAP visualization; you typically need to relaunch or refresh the app to see new data. For iterative development, the recommended workflow is to log new data to the running Phoenix instance and then refresh the browser UI. If you use a persistent backend (e.g., SQLite or Postgres), you can avoid restarting the server between updates. See Docker deployment docs and Embeddings Analysis.Would you like more detail on logging corpus embeddings or on configuring persistent storage for smoother iteration?Sources: - retrieval-with-embeddings.md - Embeddings Analysis
💡 Hint: Mention RunLLM in the thread for followups.
Nas
·
·
how exactly do i log corpus embeddings to Docker Phoenix, this is the sticking point for me RunLLM
RunLLM
APP
·
·
It seems likely that you can log corpus embeddings to Docker Phoenix by using the Python client and creating an Inferences object for your corpus, then sending it to the running Phoenix instance. The key is to use the correct schema (with document IDs and embedding vectors) and connect to your Docker Phoenix endpoint using the px.Client or px.launch_app methods—see retrieval-with-embeddings.md and quickstart/phoenix-inferences/README.md.Here’s a minimal example: LBKSimport phoenix as px import pandas as pdLBKScorpus_df = pd.DataFrame({ "document_id": [...], # your UUIDs "text": [...], "vector": [...], # list of embedding vectors })LBKScorpus_schema = px.Schema( id_column_name="document_id", document_column_names=px.EmbeddingColumnNames( vector_column_name="vector", raw_data_column_name="text", ), )LBKS# Connect to your running Docker Phoenix instance client = px.Client(endpoint="http://localhost:6006") corpus_inferences = px.Inferences(corpus_df, corpus_schema, "corpus") client.log_inferences(corpus_inferences)LBKSThis will log your corpus embeddings to the Docker Phoenix instance, making them available for UMAP visualization and retrieval link analysis.Would you like more detail on schema setup or troubleshooting log_inferences?Sources: - retrieval-with-embeddings.md - quickstart/phoenix-inferences/README.md
💡 Hint: Mention RunLLM in the thread for followups.
Nas
·
·
Thanks for the suggestions! Unfortunately the suggested approach doesn't work in Phoenix 12.6.1. What Didn't Work Suggested code:
client = px.Client(endpoint="http://localhost:6006") corpus_inferences = px.Inferences(corpus_df, corpus_schema, "corpus") client.log_inferences(corpus_inferences) # ← Method doesn't exist
Issue: client.log_inferences() doesn't exist in Phoenix 12.6.1. Available methods are: - log_evaluations(), log_traces(), upload_dataset() - query_spans(), get_spans_dataframe() - No method to log Inferences to a running instance Also tried: - px.log_inferences() - doesn't exist - px.launch_app(corpus=...) - corpus loads in backend but doesn't appear in /model UI - px.launch_app(primary=..., corpus=...) - /model tab appears, but OTEL trace embeddings don't show in the UMAP view Working Solution: 2-Instance Approach Instance 1 (port 6006): Trace collector receiving OTEL spans with embeddings Instance 2 (port 7007): UMAP viewer that: 1. Queries Instance 1 for trace spans 2. Extracts query embeddings from attributes.embedding.embeddings 3. Extracts retrieval links from attributes.retrieval.documents 4. Loads corpus from LanceDB 5. Launches separate Phoenix with both as Inferences Code:
# Extract queries from trace collector client = px.Client(endpoint="http://localhost:6006") spans = client.query_spans(project_name="my-project") # Parse embeddings and retrieval links from spans query_df = extract_queries_with_metadata(spans, source_page_to_id) # Load corpus corpus_df = load_from_lancedb() # Define schemas with retrieval links query_schema = px.Schema( prompt_column_names=px.RetrievalEmbeddingColumnNames( vector_column_name="vector", raw_data_column_name="question", context_retrieval_ids_column_name="retrieved_document_ids", # Links! ), ) corpus_schema = px.Schema( id_column_name="document_id", document_column_names=px.EmbeddingColumnNames( vector_column_name="vector", raw_data_column_name="text", ), ) # Launch UMAP viewer session = px.launch_app( port=7007, primary=px.Inferences(query_df, query_schema, "queries"), corpus=px.Inferences(corpus_df, corpus_schema, "corpus"), )
Result: http://localhost:7007/model shows UMAP with corpus + queries + retrieval links ✅ Current workflow:
# Run queries -> restart viewer to see updates python rag_cli.py --trace query my-collection "question" --k 3 lsof -ti:7007 | xargs -r kill -9 python phoenix_umap_enhanced_metadata.py # Re-extracts & relaunches
Core Question (open to be answered by all) Is there a way in Phoenix UI to visualise trace embeddings in UMAP, or is manual extraction the intended workflow? Looking at the https://arize.com/docs/ax/evaluate/evaluation-concepts/retrieval-evaluation, even official examples manually load embeddings with EmbeddingColumnNames. What I'm hoping exists: - A UI feature to visualize attributes.embedding.embeddings from traces in UMAP - Auto-population: corpus from Inferences + queries from incoming OTEL traces - Refresh browser to see updated UMAP (no script restart) If manual extraction is the recommended pattern, that's totally fine! Just want to confirm I'm not missing a more elegant built-in approach.
Roger Y.
·
·
Thank you for the detailed report! The manual extraction with 2-instances is only method currently available. Please feel free to open an issue to document your use case for future enhancements.
👍1