I can comment on (1), the centroids we've found are helpful to line up the embedding spaces. Its possible it is not needed but if you end up with two very different clusters, one for chunks and one for questions, this is needed to line them up. Mathematically what is happening is the Queries contain things that make them similar, such as all having a "?" / all being a one sentence question. The UMAP picks this up, we are removing this bias out of embedding space (centroid represents in embedding space what is the core common idea across all the query embeddings), so the questions can be aligned on top of the chunks more easily.
Thanks!
if you have any guidance on the other points, I would also really appreciate it
this is actually for a blog post I am creating for your team to publish. it involves using Arize AI with VectorFlow for ingestion
Hey David G. Have you checked out the API docs for the schema? https://docs.arize.com/phoenix/api/dataset-and-schema#phoenix.schema You just have to map your dataframe columns to the right parts of the schema. Let us know if you have any trouble!
hey Mikyo, yeah I checked the docs and I saw a discrepancy. The journal I linked above has these fields:
The columns of the dataframe are:
:id.id:: the query ID
:timestamp.iso_8601:: the time at which the query was made
:feature.text:prompt: the query text
:feature.[float].embedding:prompt: the embedding representation of the query
:prediction.text:response: the final response presented to the user
:feature.[str].retrieved_document_ids:prompt: the list of IDs of the retrieved documents
:feature.[float].retrieved_document_scores:prompt: the lists of cosine similarities between the query and retrieved documents
Plus:
:tag.float:user_feedback: approval or rejection from the user (-1 means thumbs down, +1 means thumbs up)
:tag.str:openai_relevance_0: a binary classification (relevant vs. irrelevant) by GPT-4 predicting whether the first retrieved document is relevant to the query
:tag.str:openai_relevance_1: a binary classification (relevant vs. irrelevant) by GPT-4 predicting whether the second retrieved document is relevant to the queryBut the docs have:
class Schema(
prediction_id_column_name: Optional[str] = None,
timestamp_column_name: Optional[str] = None,
feature_column_names: Optional[List[str]] = None,
tag_column_names: Optional[List[str]] = None,
prediction_label_column_name: Optional[str] = None,
prediction_score_column_name: Optional[str] = None,
actual_label_column_name: Optional[str] = None,
actual_score_column_name: Optional[str] = None,
prompt_column_names: Optional[EmbeddingColumnNames] = None
response_column_names: Optional[EmbeddingColumnNames] = None
embedding_feature_column_names: Optional[Dict[str, EmbeddingColumnNames]] = None,
excluded_column_names: Optional[List[str]] = None,
)which appears to have some overlap but not be identical. So with my query list that I need to transform into a dataset, I just want to understand which column headings must be provided and what values they should have
David G. Yeah so if your data conforms to the naming convention above, you don't have to build a schema as the columns self-describe themselves according to a spec called OpenInference. But if you just have a dataframe you can just explore the vectors by defining a schema. Here's a simpler example (https://colab.research.google.com/github/Arize-ai/phoenix/blob/main/tutorials/llm_summarization_tutorial.ipynb) So if you already have a dataframe you can just map the dataframe columns to the schema like so
schema = px.Schema(
tag_column_names=[
"your_custom_tag_column",
],
prompt_column_names=px.EmbeddingColumnNames(
vector_column_name="query_embedding", raw_data_column_name="query_text"
),
response_column_names=px.EmbeddingColumnNames(
raw_data_column_name="output_colmn_name"
),
)where the strings are the columns in your dataframe. Hope that helps.
That makes sense. I was actually overcomplicating it. Turns out I can do something like this even:
from llama_index.callbacks.open_inference_callback import as_dataframe
query_data_buffer = callback_handler.flush_query_data_buffer()
query_df = as_dataframe(query_data_buffer)and it will conform to the schema given in the journal. Just requires some slight tweaks to calculate the centroids
