Arize AI Community Icon

Phoenix Support

Phoenix Support

Questions on Using Phoenix for RAG System Benchmarking

·Sep 02, 2024 04:40 PM

Hi all, After starting to use phoenix (specifically run_experiment() ) to benchmark our agentic RAG system on a golden dataset, a few questions came up:

1.
We are storing input parameters associated with an experiment eg. model id, chunking strategy, system prompt etc. in the Phoenix experiment metadata. Is this what you would recommend, or is there a better place to put it? Also, I'm curious if it is possible to do a diff on the metadata to track changes between experiments?
2.
Is there any functionality around labelling ie. to be able to run an experiment over a subset of examples in a dataset, or subset of evaluators for certain examples? Perhaps the best way of handling this would be the metadata, or the dataset we upload?
3.
What is the best way to get run_experiment results into Arize? I have been trying to piece together a dataframe with label, score, explanation, and span_id columns from the output of run_experiment (the RanExperiment class), but I currently can't find the span_id, and it is getting a bit complex. I wondered if there is a straightforward way to do it? It would be nice to be able to see what Arize offers, in terms of sorting experiments in order of decreasing evaluation metric score, applying other filtering to results, and graphing results over time and things, which I don't believe is possible in Phoenix? I have Phoenix experiment data persisted in a SQLite database in case that simplifies getting data into Arize?
4.
For RAG evaluations, I'm displaying a JSON with the question, answer, and source documents in the Phoenix UI as the output for each example in an experiment, as it seems that is necessary to feed all the information into the evaluators (given evaluator input and task output displayed by the UI are the same thing). Is that best practice? It is slightly harder to digest by eye when looking over the results, but not a huge deal.

Thanks!

RunLLM
APP
·
1. Storing Input Parameters: Storing input parameters in the Phoenix experiment metadata is a good approach. This allows you to keep track of the configurations used in each experiment. However, there is no explicit mention of a diff functionality for metadata in the provided data sources. You might need to manually compare metadata across experiments.
2. Labelling and Subsets: You can use metadata to label and filter examples within a dataset. This allows you to run experiments on specific subsets of examples or evaluators. According to Phoenix Concepts: Datasets, metadata can be used to track different segments and use-cases, which can help in filtering subsets.
3. Exporting Results to Arize: There is no direct mention of exporting run_experiment results to Arize in the provided data sources. You might need to manually construct a DataFrame with the necessary columns and then use Arize's API to upload the data. The span_id might not be directly available from the RanExperiment class, so you may need to track it separately.
4. RAG Evaluations: Displaying a JSON with the question, answer, and source documents in the Phoenix UI is a practical approach. It ensures that all necessary information is available for the evaluators. While it might be harder to digest visually, it is essential for comprehensive evaluations. This practice aligns with the need to provide detailed context for evaluations, as mentioned in Phoenix User Guide: Evaluation.
Would you like more detail on any specific point?
Sources: - Phoenix Concepts: Datasets - Phoenix User Guide: Evaluation
💡 Hint: Mention RunLLM in the thread for followups.
Dustin N.
·
Hi Simon W. Thanks for trying out Phoenix experiments! Arize is on holiday today so we might not be able to respond right away but we’ll definitely get back to you as soon as we can 🙂
👍1

·

hi Simon W., regarding your question 2: while we don't have a built-in way to skip over certain subsets of dataset examples, you can use your metadata as both an input to your experiment task callable and your evaluation functions to try and achieve this. You can write your task as a function of both the dataset example "input" as well as the metadata to skip over examples with a specific metadata field, you can imagine something like:

def skipping_task(input, metadata):
    if metadata.get("skip", False):
        return None
    return original_task(input)

The evaluators might also return a special value for skipped examples, or metadata attributes:

def skipping_evaluator(output, expected, metadata):
    if output is None:  # skipped in the task run
        return "skipped"  # the evaluation label will be "skipped"
    if metadata.get("skip_evaluation", False):
        return "skipped"  # skip the evaluation on metadata
    return original_evaluation(output, expected)

Alternatively you can create more detailed evaluation results by returning a full EvaluationResult instead of just the string "skipped":

from phoenix.experiments.types import EvaluationResult

skipped_result = EvaluationResult(score=0.0, label="skipped-due-to-metadata-flag")

def skipping_evaluator(output, expected, metadata):
    if metadata.get("skip_evaluation", False):
        return skipped_result  # skip the evaluation on metadata
    return EvaluationResult(score=1, label="not=skipped")  # or any arbitrary code to compute the score

Maybe using a technique like this will help you subset your experiment task runs and evaluations on your input dataset as desired?

Simon W.
·
Thank you both for your replies, I really appreciate it. For question 2 that was going to be my approach, I just wondered if there were a more built in way of doing it, but the alternative is easy enough. One further question which has come up recently, is that we have been running up against the evaluation results not matching up with the question/answers from the golden dataset in the Phoenix UI, oddly enough they appear to be reversed. I wondered if that is something you have ever encountered? I am uploading a dataset (with dataset=px_client.upload_dataset, trying either .csv or dataframe), and passing the resulting dataset (an instance of the Dataset class from phoenix.experiments.types) directly into the run_experiment function. I have been investigating for some time whether I could be messing up the order somehow, or even if I could manually intervene and reorder the examples myself to line up with the evaluation results, but to no avail so far. Would love to hear your thoughts.
Dustin N.
·
That's a good note, thanks Simon W., as far as I remember I don't believe we enforce any kind of specific order when it comes to dataset examples. We are probably listing dataset examples latest-first, motivated by the pattern where someone slowly add examples over time via the the UI, so you'd always see the latest example. I can totally understand how when uploading the entire dataset at a time this could be very confusing
Simon W.
·
Yes the examples do appear in reverse order compared to the csv dataset, but that isn't a big deal, the problem is that the evaluation results are not in reverse order! So they don't match up with the examples in the UI. This is easy to see for me when using an 'exact match' evaluator. Is that something you've ever encountered? I don't see how I could be messing up the order, since I only pass the dataset, evaluator, and task to the run_experiment() function, it seems to be handled internally by Phoenix, but I could be wrong.
Dustin N.
·
oh! This definitely sounds like it might be a bug on our part... can you please make a small reproducible example and I'll check it out right away

·

This is not the most elegant, but a simple example where the task randomly outputs one of 2 strings, either "non-refusal", or "Sorry, I am unable to assist you with this request.". The evaluator is supposed to return 1.0 if it is a refusal (output="Sorry, I am unable to assist you with this request.") or 0.0 if it is a non-refusal. When I view the result in the UI, the eval scores aren't matched up with the examples, the ordered is reversed. Let me know if you get the same thing or not.

import phoenix
import pandas as pd
from phoenix.experiments import run_experiment
import pandas as pd
import random

phoenix.launch_app()
phoenix_client = phoenix.Client()

questions = [
    "What is the capital of France?",
    "Who wrote 'Romeo and Juliet'?",
    "What is the largest planet in our solar system?",
    "What year did World War II end?",
    "What is the chemical symbol for gold?"
]

answers = [
    "Paris",
    "William Shakespeare",
    "Jupiter",
    "1945",
    "Au"
]

df = pd.DataFrame({
    'Questions': questions,
    'Ground-Truth Answers': answers
})

dataset = phoenix_client.upload_dataset(
        dataframe=df,
        dataset_name="test",
        input_keys=['Questions'],
        output_keys=['Ground-Truth Answers'],
    )

def task(input):
        return random.choice(["Sorry, I am unable to assist you with this request.", "non-refusal"])

def refusal_evaluator(input, output, expected):
    refusal_response = "Sorry, I am unable to assist you with this request."
    if output == refusal_response:
        print(f"refusal: \n Question: {input['Questions']}, \n Answer: {output}\n\n")
        return 1.0
    else:
        print(f"non-refusal: \n Question: {input['Questions']}, \n Answer: {output}\n\n")
        return 0.0

run_experiment(
        dataset=dataset,
        task=task,
        evaluators=[refusal_evaluator],
        experiment_name="test_experiment",
        experiment_description="reproducing misaligned evaluators and examples",
    )

Simon W.
·
requirements.txt: opentelemetry-sdk==1.27.0 arize_otel==0.5.1 arize-phoenix==4.33.0 pandas==2.2.2
Dustin N.
·
Thank you so much Simon, I’m on it
Dustin N.
·
Simon W. looks like we had a small UI bug, a fix should be up very shortly!
👍1
Simon W.
·
Thanks!
Dustin N.
·
hey Simon W. if you get a chance please update the version of Phoenix your server/UI is on and you should see this fixed!
Simon W.
·
I tried with the latest version of phoenix (4.33.1) but still had the same issue. Am I looking in the right place?
Dustin N.
·
hey Simon W. if you were running a separate server, is the server version updated too?