Certainly! Here’s a complete code example for running your eval pipeline with Phoenix, including dataset creation, experiment setup, and evaluation. This example assumes you have Phoenix running locally and your data in a pandas DataFrame.
```
import pandas as pd
from phoenix.client import Client
from phoenix.experiments import run_experiment
# Sample data
records = [
{
"user_query": "Write a Sensor for Windows in powershell to check TPM lockout heal time",
"reference_gt_code": "code_implementation_a",
"llm_response": "code_implementation_b",
"human_review_notes": "wrong method: needs to use get-tpm",
"scripting_language": "powershell",
"platform": "windows",
"gt_reference_label": "incorrect"
}
]
df = pd.DataFrame(records)
# Connect to Phoenix
px_client = Client()
# Create dataset with multiple input keys
dataset = px_client.datasets.create_dataset(
name="code_equivalence_eval",
dataframe=df,
input_keys=["user_query", "reference_gt_code", "llm_response", "platform", "scripting_language"],
output_keys=[],
metadata_keys=["human_review_notes", "gt_reference_label"]
)
# Define the prompt template
prompt_template = """
Platform: {platform}
Scripting Language: {scripting_language}
llm_code: {llm_response}
reference_code: {reference_gt_code}
Check if both are functionally equivalent, and return "true" or "false".
"""
# Define the task function
def task(input, metadata):
# Format the prompt with input fields
prompt = prompt_template.format(
platform=input["platform"],
scripting_language=input["scripting_language"],
llm_response=input["llm_response"],
reference_gt_code=input["reference_gt_code"]
)
# Call your LLM here (pseudo-code, replace with your LLM call)
# response = your_llm_api(prompt)
# For demonstration, return a dummy result
return {"label": "false"}
# Define a simple evaluator (replace with your logic)
def evaluator(metadata, output):
return output["label"] == metadata["gt_reference_label"]
# Run the experiment
experiment = run_experiment(
dataset=dataset,
task=task,
evaluators=[evaluator],
concurrency=1
)
```
This will run your eval pipeline and log results to your local Phoenix instance. Note: The UI will still display input fields as a dictionary, but the experiment and evaluation will work as expected.
Would you like more detail on customizing the LLM call or viewing results in the UI?
Sources:
- <https://github.com/Arize-ai/phoenix/blob/main/tutorials/ai_evals_course/hw3_phoenix/phoenix_methods_guide.md|https://github.com/Arize-ai/phoenix/blob/main/tutorials/ai_evals_course/hw3_phoenix/phoenix_methods_guide.md>