Experiments - evaluation question Hi, I see that in most examples llm_classify is used for classification eval. I tried to use examples from the experiment folder but got stuck. How can I recreate the functionality of the llm_classify to output categorical labels so that phoenix will calculate precision/ recall/ f1 for me? This is my current code, would appreciate any points how to convert it to classification eval
# Setup OpenAI and Phoenix Client
URL = "https://XXXXXXX.com"
phoenix_client = px.Client(endpoint=URL)
openai_client = OpenAI(base_url="https://XXXXXX/v1", api_key="XXXXX)
# Upload Dataset
now = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
dataset = phoenix_client.upload_dataset(
dataframe=df,
dataset_name=f"sentiment-analysis-{now}",
input_keys=["query"],
output_keys=["ground_truth"])
emotions_unique = ", ".join(df['ground_truth'].unique())
print(emotions_unique)
prompt_template = """
Classify the emotion present in the text below. You should only respond with the name of the emotion, no other words.
The emotion must be one of the provided values.
Input
=======
[Text]: {text}
[Provided Values]: {emotions}
"""
def make_emotion_task(prompt_template: str, model_name: str):
def task(input: Dict[str, Any]) -> str:
formatted_prompt = prompt_template.format(text=input["query"], emotions=emotions_unique)
response = openai_client.chat.completions.create(
model=model_name,
messages=[{"role": "user", "content": formatted_prompt}]
)
assert response.choices
return response.choices[0].message.content.strip()
return task
# Instantiate task
task = make_emotion_task(prompt_template, model_name="meta-llama/Meta-Llama-3-8B-Instruct")
# Evaluation prompt for model-as-a-judge
eval_prompt = """
Your task is to evaluate whether the predicted emotion below describes the supplied input text.
We are also including the correct emotion as a piece of data.
Begin Data:
[input text]: {input}
[correct emotion]: {expected}
[predicted emotion]: {output}
It's possible that the predicted emotion is another word for the correct emotion, and the two are
roughly equivalent. If the two emotions are equivalent, respond with the word 'correct'. If they
are note equivalent, respond with the word 'incorrect'. Do not include any other words in your
response
"""
# Evaluator function
@create_evaluator(kind="llm")
def llm_as_a_judge_eval(input: Dict[str, Any], output: str, expected: Dict[str, Any]) -> float:
message_content = eval_prompt.format(
input=input["query"],
expected=expected["ground_truth"],
output=output
)
response = openai_client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": message_content}]
)
assert response.choices
return 1.0 if response.choices[0].message.content.strip().lower() == "correct" else 0.0
@create_evaluator(kind="llm")
def classification_eval(input, output, expected) -> str:
predicted = output.strip().lower()
actual = expected["ground_truth"].strip().lower()
return "correct" if predicted == actual else "incorrect"
experiment = run_experiment(
dataset=dataset,
task=task,
experiment_name="sentiment-experiment",
experiment_description="baseline experiment",
experiment_metadata={"vendor": "OpenAI", "model": "Meta-Llama-3-8B-Instruct"},
evaluators=[classification_eval],
dry_run=None
)
To recreate llm_classify-style categorical evaluation and have Phoenix compute precision/recall/F1, you should use the llm_classify function directly, which is designed for this purpose. llm_classify takes your dataframe, a prompt template, a model, and a list of possible classes (rails), and outputs a dataframe with predicted labels. You can then compare these predictions to your ground truth to compute classification metrics. Your current approach with a custom evaluator returns "correct"/"incorrect" but does not produce per-class predictions needed for precision/recall/F1 by class. Here's how you can adapt your code to use llm_classify for categorical classification: ```from phoenix.evals import llm_classify, OpenAIModel# Define your rails (unique emotion classes)
rails = list(df['ground_truth'].unique())# Use llm_classify to get predictions
classification_df = llm_classify(
dataframe=df,
template=prompt_template,
model=OpenAIModel(model_name="meta-llama/Meta-Llama-3-8B-Instruct"),
rails=rails,
verbose=True
)# Now, classification_df["label"] contains the predicted class for each row.
馃挕聽Hint: Mention RunLLM in the thread for followups.
RunLLM I modifed the code to this one:
rails = ["correct", "incorrect"]
# Use llm_classify to get predictions
classification_df = llm_classify(
data=df,
template=eval_prompt,
model=OpenAIModel(model="meta-llama/Meta-Llama-3-8B-Instruct", api_key="EMPTY"),
rails=rails,
verbose=True
)But got an error: retries exhausted after 1 attempts: Missing template variable: 'input
The error occurs because your dataframe must have columns matching all template variables used in your eval_prompt. Your template uses {input}, {expected}, and {output}, but your dataframe likely does not have columns named input, expected, and output. All template variable names must appear as column names in the dataframe for llm_classify to work properly. To fix this, rename or create columns in your dataframe to match the template variables, e.g., input, expected, and output (source, source). Would you like step-by-step guidance on how to adjust your dataframe?Sources:
- Slack: Missing template variable 'input'
- GitHub Issue #4019
馃挕聽Hint: Mention RunLLM in the thread for followups.
hey SimpleStudent - no doubt using llm_classify with experiments is a bit tough and it might be easier just to use an LLM SDK right now. I don't currently have the bandwidth to repro your use-case but maybe I can get someone to take a closer look in the next few days. cc Donny L.
Got it - thanks
Would appreciate any examples you could share with me that achieve similar thing
Hey SimpleStudent - let me get John G. and Donny L. lean in here. The most relevant example is probably here: https://arize.com/docs/phoenix/cookbook/datasets-and-experiments/summarization
Hey SimpleStudent - I'd actually recommend looking into this example: https://github.com/Arize-ai/phoenix/blob/main/tutorials/experiments/agents-cookbook.ipynb . The "Create an Experiment" section shows an example of using LLM classify in the evaluator of an experiment
The key here, and where your issue is coming from, is that your df must have column names that match the variables in your eval_prompt. Based on the error you sent, your prompt is looking for a column named input that doesn't exist in your dataframe
As Mikyo said, LLM classify and experiments don't exactly work together super simply right now LLM classify is designed to evaluate a dataframe, and experiments send individual rows through your evaluator methods at one time. What you need to do generally is to create a one-row dataframe in your experiment evaluators, and then send that through LLM Classify. The notebook I shared uses that approach
Thank you for the help - to be fair, that seems like a bit overcomplicating the issue, so I will write the logic myself and use phoenix just for the visualization itself. Given dataframe with the following structure and the classification report. Is there a way to upload those results to phoenix just for visualization? John G.
SimpleStudent - apologies for the delay here, that makes sense In order to upload the results to Phoenix, you'd either need to start from a dataset and use an experiment, or to start from traces that you pull down from Phoenix, evaluate, and push back up. We don't have a way to simply upload metrics at this time
