David G.

Commented on Aligning Embedding Spaces with Centroids for Bette...·Posted inPhoenix Support

But I was able to eventually Phoenix running and can now visualize my data and the queries! Thank you so much for your help!!

Commented on Aligning Embedding Spaces with Centroids for Bette...·Posted inPhoenix Support

David G.

That makes sense. I was actually overcomplicating it. Turns out I can do something like this even:

from llama_index.callbacks.open_inference_callback import as_dataframe

query_data_buffer = callback_handler.flush_query_data_buffer()
query_df = as_dataframe(query_data_buffer)

and it will conform to the schema given in the journal. Just requires some slight tweaks to calculate the centroids

Commented on Aligning Embedding Spaces with Centroids for Bette...·Posted inPhoenix Support

David G.

hey Mikyo, yeah I checked the docs and I saw a discrepancy. The journal I linked above has these fields:

The columns of the dataframe are:

:id.id:: the query ID
:timestamp.iso_8601:: the time at which the query was made
:feature.text:prompt: the query text
:feature.[float].embedding:prompt: the embedding representation of the query
:prediction.text:response: the final response presented to the user
:feature.[str].retrieved_document_ids:prompt: the list of IDs of the retrieved documents
:feature.[float].retrieved_document_scores:prompt: the lists of cosine similarities between the query and retrieved documents

Plus:
:tag.float:user_feedback: approval or rejection from the user (-1 means thumbs down, +1 means thumbs up)
:tag.str:openai_relevance_0: a binary classification (relevant vs. irrelevant) by GPT-4 predicting whether the first retrieved document is relevant to the query
:tag.str:openai_relevance_1: a binary classification (relevant vs. irrelevant) by GPT-4 predicting whether the second retrieved document is relevant to the query

But the docs have:

class Schema(
    prediction_id_column_name: Optional[str] = None,
    timestamp_column_name: Optional[str] = None,
    feature_column_names: Optional[List[str]] = None,
    tag_column_names: Optional[List[str]] = None,
    prediction_label_column_name: Optional[str] = None,
    prediction_score_column_name: Optional[str] = None,
    actual_label_column_name: Optional[str] = None,
    actual_score_column_name: Optional[str] = None,
    prompt_column_names: Optional[EmbeddingColumnNames] = None
    response_column_names: Optional[EmbeddingColumnNames] = None
    embedding_feature_column_names: Optional[Dict[str, EmbeddingColumnNames]] = None,
    excluded_column_names: Optional[List[str]] = None,
)

which appears to have some overlap but not be identical. So with my query list that I need to transform into a dataset, I just want to understand which column headings must be provided and what values they should have

Commented on Aligning Embedding Spaces with Centroids for Bette...·Posted inPhoenix Support

David G.

this is actually for a blog post I am creating for your team to publish. it involves using Arize AI with VectorFlow for ingestion

Commented on Aligning Embedding Spaces with Centroids for Bette...·Posted inPhoenix Support

David G.

if you have any guidance on the other points, I would also really appreciate it

Commented on Aligning Embedding Spaces with Centroids for Bette...·Posted inPhoenix Support

David G.

Thanks!

Posted in Phoenix Support·

David G.

Questions About Llama Index Search and Retrieval Tutorial Steps

Hey, I am trying to follow the llama_index_search_and_retrieval_tutorial.ipynb notebook to build a way to visualize some q&a over my dataset. I would like to conduct a very simple visualization of the embeddings and see where the queries are landing. My understanding is that it is possible to do this with Phoenix. Based on this notebook though, there a few steps that I am wondering if they could be omitted:

1.
generating the centroids in step five. is this really necessary with a smaller dataset?
2.
Running the LLM-assisted evaluations
3.
Computing ranking metrics

Also, there is a chunk of code where a query set is downloaded and converted into a dataframe:

query_df = pd.read_parquet(
    "http://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/llama-index/arize-docs/query_data_complete3.parquet",
)
query_df.head()

Since I have my own dataset, I was wondering if the recommended schema for this dataset is listed somewhere or if I should just copy it from the dataframe above, Thank you for your help!

0Comments