Aligning Embedding Spaces with Centroids for Better Clustering

·Nov 02, 2023 10:02 PM

I can comment on (1), the centroids we've found are helpful to line up the embedding spaces. Its possible it is not needed but if you end up with two very different clusters, one for chunks and one for questions, this is needed to line them up. Mathematically what is happening is the Queries contain things that make them similar, such as all having a "?" / all being a one sentence question. The UMAP picks this up, we are removing this bias out of embedding space (centroid represents in embedding space what is the core common idea across all the query embeddings), so the questions can be aligned on top of the chunks more easily.

❤️1

11 comments

· Sorted by Oldest

David G.
·
Thanks!
David G.
·
if you have any guidance on the other points, I would also really appreciate it
Jason
·
Xander S. Any thoughts on 2 and 3?
❤️1
David G.
·
this is actually for a blog post I am creating for your team to publish. it involves using Arize AI with VectorFlow for ingestion
Mikyo
·
Hey David G. Have you checked out the API docs for the schema? https://docs.arize.com/phoenix/api/dataset-and-schema#phoenix.schema You just have to map your dataframe columns to the right parts of the schema. Let us know if you have any trouble!
Mikyo
·
1.
Running the LLM-assisted evaluations
2.
Computing ranking metrics
Yes these can be omitted! You just wont get cluster metrics based on these.
👍1
➕1

David G.

hey Mikyo, yeah I checked the docs and I saw a discrepancy. The journal I linked above has these fields:

The columns of the dataframe are:

:id.id:: the query ID
:timestamp.iso_8601:: the time at which the query was made
:feature.text:prompt: the query text
:feature.[float].embedding:prompt: the embedding representation of the query
:prediction.text:response: the final response presented to the user
:feature.[str].retrieved_document_ids:prompt: the list of IDs of the retrieved documents
:feature.[float].retrieved_document_scores:prompt: the lists of cosine similarities between the query and retrieved documents

Plus:
:tag.float:user_feedback: approval or rejection from the user (-1 means thumbs down, +1 means thumbs up)
:tag.str:openai_relevance_0: a binary classification (relevant vs. irrelevant) by GPT-4 predicting whether the first retrieved document is relevant to the query
:tag.str:openai_relevance_1: a binary classification (relevant vs. irrelevant) by GPT-4 predicting whether the second retrieved document is relevant to the query

But the docs have:

class Schema(
    prediction_id_column_name: Optional[str] = None,
    timestamp_column_name: Optional[str] = None,
    feature_column_names: Optional[List[str]] = None,
    tag_column_names: Optional[List[str]] = None,
    prediction_label_column_name: Optional[str] = None,
    prediction_score_column_name: Optional[str] = None,
    actual_label_column_name: Optional[str] = None,
    actual_score_column_name: Optional[str] = None,
    prompt_column_names: Optional[EmbeddingColumnNames] = None
    response_column_names: Optional[EmbeddingColumnNames] = None
    embedding_feature_column_names: Optional[Dict[str, EmbeddingColumnNames]] = None,
    excluded_column_names: Optional[List[str]] = None,
)

which appears to have some overlap but not be identical. So with my query list that I need to transform into a dataset, I just want to understand which column headings must be provided and what values they should have

Mikyo
·
David G. Yeah so if your data conforms to the naming convention above, you don't have to build a schema as the columns self-describe themselves according to a spec called OpenInference. But if you just have a dataframe you can just explore the vectors by defining a schema. Here's a simpler example (https://colab.research.google.com/github/Arize-ai/phoenix/blob/main/tutorials/llm_summarization_tutorial.ipynb) So if you already have a dataframe you can just map the dataframe columns to the schema like so
schema = px.Schema( tag_column_names=[ "your_custom_tag_column", ], prompt_column_names=px.EmbeddingColumnNames( vector_column_name="query_embedding", raw_data_column_name="query_text" ), response_column_names=px.EmbeddingColumnNames( raw_data_column_name="output_colmn_name" ), )
where the strings are the columns in your dataframe. Hope that helps.
David G.
·
That makes sense. I was actually overcomplicating it. Turns out I can do something like this even:
from llama_index.callbacks.open_inference_callback import as_dataframe query_data_buffer = callback_handler.flush_query_data_buffer() query_df = as_dataframe(query_data_buffer)
and it will conform to the schema given in the journal. Just requires some slight tweaks to calculate the centroids
David G.
·
But I was able to eventually Phoenix running and can now visualize my data and the queries! Thank you so much for your help!!
1
Mikyo
·
Oh yup! The OpenInferenceCallback conforms to that convention. Awesome, glad we got you unblocked!
❤️1

David G.
·
Thanks!
David G.
·
if you have any guidance on the other points, I would also really appreciate it
Jason
·
Xander S. Any thoughts on 2 and 3?
❤️1
David G.
·
this is actually for a blog post I am creating for your team to publish. it involves using Arize AI with VectorFlow for ingestion
Mikyo
·
Hey David G. Have you checked out the API docs for the schema? https://docs.arize.com/phoenix/api/dataset-and-schema#phoenix.schema You just have to map your dataframe columns to the right parts of the schema. Let us know if you have any trouble!
Mikyo
·
1.
Running the LLM-assisted evaluations
2.
Computing ranking metrics
Yes these can be omitted! You just wont get cluster metrics based on these.
👍1
➕1

David G.

hey Mikyo, yeah I checked the docs and I saw a discrepancy. The journal I linked above has these fields:

The columns of the dataframe are:

:id.id:: the query ID
:timestamp.iso_8601:: the time at which the query was made
:feature.text:prompt: the query text
:feature.[float].embedding:prompt: the embedding representation of the query
:prediction.text:response: the final response presented to the user
:feature.[str].retrieved_document_ids:prompt: the list of IDs of the retrieved documents
:feature.[float].retrieved_document_scores:prompt: the lists of cosine similarities between the query and retrieved documents

Plus:
:tag.float:user_feedback: approval or rejection from the user (-1 means thumbs down, +1 means thumbs up)
:tag.str:openai_relevance_0: a binary classification (relevant vs. irrelevant) by GPT-4 predicting whether the first retrieved document is relevant to the query
:tag.str:openai_relevance_1: a binary classification (relevant vs. irrelevant) by GPT-4 predicting whether the second retrieved document is relevant to the query

But the docs have:

class Schema(
    prediction_id_column_name: Optional[str] = None,
    timestamp_column_name: Optional[str] = None,
    feature_column_names: Optional[List[str]] = None,
    tag_column_names: Optional[List[str]] = None,
    prediction_label_column_name: Optional[str] = None,
    prediction_score_column_name: Optional[str] = None,
    actual_label_column_name: Optional[str] = None,
    actual_score_column_name: Optional[str] = None,
    prompt_column_names: Optional[EmbeddingColumnNames] = None
    response_column_names: Optional[EmbeddingColumnNames] = None
    embedding_feature_column_names: Optional[Dict[str, EmbeddingColumnNames]] = None,
    excluded_column_names: Optional[List[str]] = None,
)

Mikyo
·
David G. Yeah so if your data conforms to the naming convention above, you don't have to build a schema as the columns self-describe themselves according to a spec called OpenInference. But if you just have a dataframe you can just explore the vectors by defining a schema. Here's a simpler example (https://colab.research.google.com/github/Arize-ai/phoenix/blob/main/tutorials/llm_summarization_tutorial.ipynb) So if you already have a dataframe you can just map the dataframe columns to the schema like so
schema = px.Schema( tag_column_names=[ "your_custom_tag_column", ], prompt_column_names=px.EmbeddingColumnNames( vector_column_name="query_embedding", raw_data_column_name="query_text" ), response_column_names=px.EmbeddingColumnNames( raw_data_column_name="output_colmn_name" ), )
where the strings are the columns in your dataframe. Hope that helps.
David G.
·
That makes sense. I was actually overcomplicating it. Turns out I can do something like this even:
from llama_index.callbacks.open_inference_callback import as_dataframe query_data_buffer = callback_handler.flush_query_data_buffer() query_df = as_dataframe(query_data_buffer)
and it will conform to the schema given in the journal. Just requires some slight tweaks to calculate the centroids
David G.
·
But I was able to eventually Phoenix running and can now visualize my data and the queries! Thank you so much for your help!!
1
Mikyo
·
Oh yup! The OpenInferenceCallback conforms to that convention. Awesome, glad we got you unblocked!
❤️1