Viewing Evals in Phoenix During Experiments: What You Need to Know | Arize AI Community

Arize AI Community Icon

RunLLM
APP
·
No, evaluations (evals) in Phoenix can be viewed both as part of experiments and as annotations on traces, spans, or documents. If you have tracing enabled, you can log evaluations directly to traces and view them in the UI as annotations. For standalone datasets without tracing, you use the Datasets & Experiments feature, which provides a UI for evaluation results without requiring experiments to be run in the traditional sense. Evaluations are a type of annotation and can be attached and viewed independently of experiments as well as within them, depending on your workflow and data source type (<https://github.com/Arize-ai/phoenix/issues/7922|source>, <https://arize.com/docs/phoenix/evaluation|source>, <https://arize.com/docs/phoenix/concepts/llm-evals/#evaluation|source>). Would you like more detail on how to log and view evals outside of experiments? Sources: - <https://github.com/Arize-ai/phoenix/issues/7922|https://github.com/Arize-ai/phoenix/issues/7922> - <https://arize.com/docs/phoenix/evaluation|https://arize.com/docs/phoenix/evaluation> - <https://arize.com/docs/phoenix/concepts/llm-evals/#evaluation|https://arize.com/docs/phoenix/concepts/llm-evals/#evaluation>
💡 Hint: Mention RunLLM in the thread for followups.

·

I'm running

Client().log_evaluations(
        SpanEvaluations(
            dataframe=results_df,
            eval_name="test-eval",
        ),
    )

but don't see where the evals should be

Mikyo
·
Hey Pranav P. probably need a bit more details to help you. In general experiments have their own evaluation definitions in the evaluators . If you are trying to eval the trace as it runs, you can log annotations to the traces as well. Have to hop but will follow along
Pranav P.
·
adding project_name gave me a "this is no longer a support keyword"
Pranav P.
·
i'd like to define golden datasets in code, run evals on them, and then see the results I think this probably maps to setting up an experiment, right?

·

🔒[private user] yes - I'm just working through the tutorial for now. here is my full script

from phoenix.evals import HallucinationEvaluator, OpenAIModel, QAEvaluator, run_evals
import pandas as pd

from phoenix import Client
from phoenix.trace import SpanEvaluations
import os
from dotenv import load_dotenv

load_dotenv()


def generate_evals_df() -> pd.DataFrame:
    df = pd.DataFrame(
        [
            {
                "span_id": "123",
                "reference": "The Eiffel Tower is located in Paris, France. It was constructed in 1889 as the entrance arch to the 1889 World's Fair.",
                "query": "Where is the Eiffel Tower located?",
                "response": "The Eiffel Tower is located in Paris, France.",
            },
            # {
            #     "reference": "The Great Wall of China is over 13,000 miles long. It was built over many centuries by various Chinese dynasties to protect against nomadic invasions.",
            #     "query": "How long is the Great Wall of China?",
            #     "response": "The Great Wall of China is approximately 13,171 miles (21,196 kilometers) long.",
            # },
            # {
            #     "reference": "The Amazon rainforest is the largest tropical rainforest in the world. It covers much of northwestern Brazil and extends into Colombia, Peru and other South American countries.",
            #     "query": "What is the largest tropical rainforest?",
            #     "response": "The Amazon rainforest is the largest tropical rainforest in the world. It is home to the largest number of plant and animal species in the world.",
            # },
        ]
    )

    # We have to make some minor changes to our dataframe to use the column names expected by our evaluators
    # for `hallucination_evaluator` the input df needs to have columns 'output', 'input', 'context'
    # for `qa_evaluator` the input df needs to have columns 'output', 'input', 'reference'
    df["context"] = df["reference"]
    df.rename(columns={"query": "input", "response": "output"}, inplace=True)
    assert all(
        column in df.columns for column in ["output", "input", "context", "reference"]
    )

    return df


def run():
    eval_model = OpenAIModel(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY"))

    hallucination_evaluator = HallucinationEvaluator(eval_model)
    qa_evaluator = QAEvaluator(eval_model)

    # Run the evaluators, each evaluator will return a dataframe with evaluation results
    # We upload the evaluation results to Phoenix in the next step

    df = generate_evals_df()

    hallucination_eval_df, qa_eval_df = run_evals(
        dataframe=df,
        evaluators=[hallucination_evaluator, qa_evaluator],
        provide_explanation=True,
    )

    results_df = df.copy()
    # results_df["hallucination_eval"] = hallucination_eval_df["label"]
    # results_df["hallucination_explanation"] = hallucination_eval_df["explanation"]
    results_df["label"] = qa_eval_df["label"]
    results_df["explanation"] = qa_eval_df["explanation"]

    Client().log_evaluations(
        SpanEvaluations(
            dataframe=results_df,
            eval_name="test-eval",
        ),
        project_name="my-llm-app",
    )


if __name__ == "__main__":
    run()

·

🔒[private user] for posterity

Keyword argument `project_name` is no longer necessary and is ignored.

Pranav P.
·
no worries!
Pranav P.
·
also i don't understand how to differentiate evals from evaluators
Pranav P.
·
got it, just followed the experiments guide. what is best practice for running evals? i noticed that I cannot use the evals in the evals package when listing evaluators in run_experiment (but my syntax may be wrong). the following does not work:
from phoenix.evals import HallucinationEvaluator, OpenAIModel, QAEvaluator run_experiment( dataset, task, experiment_name="initial-experiment", evaluators=[ jaccard_similarity, accuracy, hallucination_evaluator, qa_evaluator, ], )
TypeError: <phoenix.evals.evaluators.HallucinationEvaluator object at 0x30c3079e0> is not a callable object
Pranav P.
·
had to edit the function to only return the score
ValueError: Unsupported evaluation result type: <class 'dict'>
is there a way to allow the label and explanation as well?

·

got it, i was using this

from phoenix.experiments import run_experiment