I want to optimize my llm as a judge using phoenix. Task : check functional equivalence between code a and code b , with additional fields like platform and scripting_language as supporting variables. Here is my dataset sample: records : List[str] = [ { "user_query": "Write a Sensor for Windows in powershell to check TPM lockout heal time", "reference_gt_code": ''code implementation_a" "llm_response": "code_implementation_b", "human_review_notes": "wrong method: needs to use get-tpm", "scripting_language": "powershell", "gt_reference_label": "false" } ] Prompt Template:
Platform : {platform}
Scripting Language : {scripting_language}
llm_code : {llm_response}
reference_code : {reference_gt_code}
check if both are functionally equivalent , and return "true" or "false"how can i create a phoenix dataset from the data and run eval experiments on it ? When i specify multiple input keys in the px_client.datasets.create_dataset , in the UI i see all of them as a single input field , represented as a dictionary and its not so convinient to manually annotate, like excel where i will still see 'n' columns and my expected and predicted label. Am i doing it wrong ?
You are not doing anything wrong—Phoenix datasets store multiple input keys as a dictionary, so in the UI, all input fields are grouped together rather than shown as separate columns. This is the expected behavior and currently, Phoenix does not support displaying each input key as a separate column for manual annotation; all input fields are bundled as a single dictionary in the UI (source, source).To create your dataset and run eval experiments, use px_client.datasets.create_dataset with your input keys (e.g., ["user_query", "reference_gt_code", "llm_response", "platform", "scripting_language"]), and then run your experiment with a task function that uses your prompt template and an evaluator for functional equivalence. The UI will show all input fields as a dictionary, but this does not affect experiment execution or evaluation (source, source).Would you like a step-by-step code example for dataset creation and running the experiment?Sources:
- https://github.com/Arize-ai/phoenix/issues/8301
- https://arize.com/docs/phoenix/datasets-and-experiments/concepts-datasets#datasets
- https://github.com/Arize-ai/phoenix/blob/main/docs/datasets-and-experiments/how-to-datasets/creating-datasets.md
💡 Hint: Mention RunLLM in the thread for followups.
Certainly! Here’s a complete code example for running your eval pipeline with Phoenix, including dataset creation, experiment setup, and evaluation. This example assumes you have Phoenix running locally and your data in a pandas DataFrame.LBKSimport pandas as pd from phoenix.client import Client from phoenix.experiments import run_experimentLBKS# Sample data records = [ { "user_query": "Write a Sensor for Windows in powershell to check TPM lockout heal time", "reference_gt_code": "code_implementation_a", "llm_response": "code_implementation_b", "human_review_notes": "wrong method: needs to use get-tpm", "scripting_language": "powershell", "platform": "windows", "gt_reference_label": "incorrect" } ] df = pd.DataFrame(records)LBKS# Connect to Phoenix px_client = Client()LBKS# Create dataset with multiple input keys dataset = px_client.datasets.create_dataset( name="code_equivalence_eval", dataframe=df, input_keys=["user_query", "reference_gt_code", "llm_response", "platform", "scripting_language"], output_keys=[], metadata_keys=["human_review_notes", "gt_reference_label"] )LBKS# Define the prompt template prompt_template = """ Platform: {platform} Scripting Language: {scripting_language}LBKSllm_code: {llm_response} reference_code: {reference_gt_code}LBKSCheck if both are functionally equivalent, and return "true" or "false". """LBKS# Define the task function def task(input, metadata): # Format the prompt with input fields prompt = prompt_template.format( platform=input["platform"], scripting_language=input["scripting_language"], llm_response=input["llm_response"], reference_gt_code=input["reference_gt_code"] ) # Call your LLM here (pseudo-code, replace with your LLM call) # response = your_llm_api(prompt) # For demonstration, return a dummy result return {"label": "false"}LBKS# Define a simple evaluator (replace with your logic) def evaluator(metadata, output): return output["label"] == metadata["gt_reference_label"]LBKS# Run the experiment experiment = run_experiment( dataset=dataset, task=task, evaluators=[evaluator], concurrency=1 )LBKSThis will run your eval pipeline and log results to your local Phoenix instance. Note: The UI will still display input fields as a dictionary, but the experiment and evaluation will work as expected.Would you like more detail on customizing the LLM call or viewing results in the UI?Sources:
- https://github.com/Arize-ai/phoenix/blob/main/tutorials/ai_evals_course/hw3_phoenix/phoenix_methods_guide.md
💡 Hint: Mention RunLLM in the thread for followups.
Hi AJITH S., we actually have a few cookbooks on this
Build a Custom LLM Eval using a Benchmark Dataset: Complete code example of refining an LLM judge prompt based on human annotated data
Optimizing LLM as a Judge prompts: Techniques for optimizing llm as a judge as a prompt (walk through of techniques, not as much as full pipeline)
Our Cookbooks have a section on Human-in-the-Loop workflows I would recommend checking for end-to-end examples: https://arize.com/docs/phoenix/cookbook
The AI generated response was correct about input keys for the dataset. You can't annotate dataset examples, but any annotations you have on traces and spans that are later put in a dataset will have the annotations preserved as metadata. You can however, edit dataset fields. If you are looking to annotate, I would edit the metadata field with your specific annotation. Hope that is helpful!
