is the only way to view evals in Phoenix when they're run as part of an experiment?
馃挕聽Hint: Mention RunLLM in the thread for followups.
I'm running
Client().log_evaluations(
SpanEvaluations(
dataframe=results_df,
eval_name="test-eval",
),
)but don't see where the evals should be
adding project_name gave me a "this is no longer a support keyword"
i'd like to define golden datasets in code, run evals on them, and then see the results I think this probably maps to setting up an experiment, right?
馃敀[private user] yes - I'm just working through the tutorial for now. here is my full script
from phoenix.evals import HallucinationEvaluator, OpenAIModel, QAEvaluator, run_evals
import pandas as pd
from phoenix import Client
from phoenix.trace import SpanEvaluations
import os
from dotenv import load_dotenv
load_dotenv()
def generate_evals_df() -> pd.DataFrame:
df = pd.DataFrame(
[
{
"span_id": "123",
"reference": "The Eiffel Tower is located in Paris, France. It was constructed in 1889 as the entrance arch to the 1889 World's Fair.",
"query": "Where is the Eiffel Tower located?",
"response": "The Eiffel Tower is located in Paris, France.",
},
# {
# "reference": "The Great Wall of China is over 13,000 miles long. It was built over many centuries by various Chinese dynasties to protect against nomadic invasions.",
# "query": "How long is the Great Wall of China?",
# "response": "The Great Wall of China is approximately 13,171 miles (21,196 kilometers) long.",
# },
# {
# "reference": "The Amazon rainforest is the largest tropical rainforest in the world. It covers much of northwestern Brazil and extends into Colombia, Peru and other South American countries.",
# "query": "What is the largest tropical rainforest?",
# "response": "The Amazon rainforest is the largest tropical rainforest in the world. It is home to the largest number of plant and animal species in the world.",
# },
]
)
# We have to make some minor changes to our dataframe to use the column names expected by our evaluators
# for `hallucination_evaluator` the input df needs to have columns 'output', 'input', 'context'
# for `qa_evaluator` the input df needs to have columns 'output', 'input', 'reference'
df["context"] = df["reference"]
df.rename(columns={"query": "input", "response": "output"}, inplace=True)
assert all(
column in df.columns for column in ["output", "input", "context", "reference"]
)
return df
def run():
eval_model = OpenAIModel(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY"))
hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_evaluator = QAEvaluator(eval_model)
# Run the evaluators, each evaluator will return a dataframe with evaluation results
# We upload the evaluation results to Phoenix in the next step
df = generate_evals_df()
hallucination_eval_df, qa_eval_df = run_evals(
dataframe=df,
evaluators=[hallucination_evaluator, qa_evaluator],
provide_explanation=True,
)
results_df = df.copy()
# results_df["hallucination_eval"] = hallucination_eval_df["label"]
# results_df["hallucination_explanation"] = hallucination_eval_df["explanation"]
results_df["label"] = qa_eval_df["label"]
results_df["explanation"] = qa_eval_df["explanation"]
Client().log_evaluations(
SpanEvaluations(
dataframe=results_df,
eval_name="test-eval",
),
project_name="my-llm-app",
)
if __name__ == "__main__":
run()馃敀[private user] for posterity
Keyword argument `project_name` is no longer necessary and is ignored.no worries!
also i don't understand how to differentiate evals from evaluators
got it, just followed the experiments guide. what is best practice for running evals? i noticed that I cannot use the evals in the evals package when listing evaluators in run_experiment (but my syntax may be wrong). the following does not work:
from phoenix.evals import HallucinationEvaluator, OpenAIModel, QAEvaluator
run_experiment(
dataset,
task,
experiment_name="initial-experiment",
evaluators=[
jaccard_similarity,
accuracy,
hallucination_evaluator,
qa_evaluator,
],
)TypeError: <phoenix.evals.evaluators.HallucinationEvaluator object at 0x30c3079e0> is not a callable object
had to edit the function to only return the score
ValueError: Unsupported evaluation result type: <class 'dict'>is there a way to allow the label and explanation as well?
