is the only way to view evals in Phoenix when they're run as part of an experiment?
No, evaluations (evals) in Phoenix can be viewed both as part of experiments and as annotations on traces, spans, or documents. If you have tracing enabled, you can log evaluations directly to traces and view them in the UI as annotations. For standalone datasets without tracing, you use the Datasets & Experiments feature, which provides a UI for evaluation results without requiring experiments to be run in the traditional sense. Evaluations are a type of annotation and can be attached and viewed independently of experiments as well as within them, depending on your workflow and data source type (source, source, source).Would you like more detail on how to log and view evals outside of experiments?Sources: - https://github.com/Arize-ai/phoenix/issues/7922 - https://arize.com/docs/phoenix/evaluation - https://arize.com/docs/phoenix/concepts/llm-evals/#evaluation
馃挕聽Hint: Mention RunLLM in the thread for followups.
I'm running
Client().log_evaluations(
SpanEvaluations(
dataframe=results_df,
eval_name="test-eval",
),
)but don't see where the evals should be
adding project_name gave me a "this is no longer a support keyword"
i'd like to define golden datasets in code, run evals on them, and then see the results I think this probably maps to setting up an experiment, right?
馃敀[private user] yes - I'm just working through the tutorial for now. here is my full script
from phoenix.evals import HallucinationEvaluator, OpenAIModel, QAEvaluator, run_evals
import pandas as pd
from phoenix import Client
from phoenix.trace import SpanEvaluations
import os
from dotenv import load_dotenv
load_dotenv()
def generate_evals_df() -> pd.DataFrame:
df = pd.DataFrame(
[
{
"span_id": "123",
"reference": "The Eiffel Tower is located in Paris, France. It was constructed in 1889 as the entrance arch to the 1889 World's Fair.",
"query": "Where is the Eiffel Tower located?",
"response": "The Eiffel Tower is located in Paris, France.",
},
# {
# "reference": "The Great Wall of China is over 13,000 miles long. It was built over many centuries by various Chinese dynasties to protect against nomadic invasions.",
# "query": "How long is the Great Wall of China?",
# "response": "The Great Wall of China is approximately 13,171 miles (21,196 kilometers) long.",
# },
# {
# "reference": "The Amazon rainforest is the largest tropical rainforest in the world. It covers much of northwestern Brazil and extends into Colombia, Peru and other South American countries.",
# "query": "What is the largest tropical rainforest?",
# "response": "The Amazon rainforest is the largest tropical rainforest in the world. It is home to the largest number of plant and animal species in the world.",
# },
]
)
# We have to make some minor changes to our dataframe to use the column names expected by our evaluators
# for `hallucination_evaluator` the input df needs to have columns 'output', 'input', 'context'
# for `qa_evaluator` the input df needs to have columns 'output', 'input', 'reference'
df["context"] = df["reference"]
df.rename(columns={"query": "input", "response": "output"}, inplace=True)
assert all(
column in df.columns for column in ["output", "input", "context", "reference"]
)
return df
def run():
eval_model = OpenAIModel(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY"))
hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_evaluator = QAEvaluator(eval_model)
# Run the evaluators, each evaluator will return a dataframe with evaluation results
# We upload the evaluation results to Phoenix in the next step
df = generate_evals_df()
hallucination_eval_df, qa_eval_df = run_evals(
dataframe=df,
evaluators=[hallucination_evaluator, qa_evaluator],
provide_explanation=True,
)
results_df = df.copy()
# results_df["hallucination_eval"] = hallucination_eval_df["label"]
# results_df["hallucination_explanation"] = hallucination_eval_df["explanation"]
results_df["label"] = qa_eval_df["label"]
results_df["explanation"] = qa_eval_df["explanation"]
Client().log_evaluations(
SpanEvaluations(
dataframe=results_df,
eval_name="test-eval",
),
project_name="my-llm-app",
)
if __name__ == "__main__":
run()馃敀[private user] for posterity
Keyword argument `project_name` is no longer necessary and is ignored.no worries!
also i don't understand how to differentiate evals from evaluators
got it, just followed the experiments guide. what is best practice for running evals? i noticed that I cannot use the evals in the evals package when listing evaluators in run_experiment (but my syntax may be wrong). the following does not work:
from phoenix.evals import HallucinationEvaluator, OpenAIModel, QAEvaluator
run_experiment(
dataset,
task,
experiment_name="initial-experiment",
evaluators=[
jaccard_similarity,
accuracy,
hallucination_evaluator,
qa_evaluator,
],
)TypeError: <phoenix.evals.evaluators.HallucinationEvaluator object at 0x30c3079e0> is not a callable object
had to edit the function to only return the score
ValueError: Unsupported evaluation result type: <class 'dict'>is there a way to allow the label and explanation as well?
