Optimizing LLM with Phoenix for Code Comparison and Dataset Creation | Arize AI Community

Arize AI Community Icon

RunLLM
APP
·
·
You are not doing anything wrong—Phoenix datasets store multiple input keys as a dictionary, so in the UI, all input fields are grouped together rather than shown as separate columns. This is the expected behavior and currently, Phoenix does not support displaying each input key as a separate column for manual annotation; all input fields are bundled as a single dictionary in the UI (<https://github.com/Arize-ai/phoenix/issues/8301|source>, <https://arize.com/docs/phoenix/datasets-and-experiments/concepts-datasets#datasets|source>). To create your dataset and run eval experiments, use `px_client.datasets.create_dataset` with your input keys (e.g., `<https://arize.com/docs/phoenix/datasets-and-experiments/how-to-datasets/creating-datasets#from-pandas|"user_query", "reference_gt_code", "llm_response", "platform", "scripting_language"]`), and then run your experiment with a task function that uses your prompt template and an evaluator for functional equivalence. The UI will show all input fields as a dictionary, but this does not affect experiment execution or evaluation ([source>, <https://github.com/Arize-ai/phoenix/blob/main/docs/datasets-and-experiments/how-to-datasets/creating-datasets.md|source>). Would you like a step-by-step code example for dataset creation and running the experiment? Sources: - <https://github.com/Arize-ai/phoenix/issues/8301|https://github.com/Arize-ai/phoenix/issues/8301> - <https://arize.com/docs/phoenix/datasets-and-experiments/concepts-datasets#datasets|https://arize.com/docs/phoenix/datasets-and-experiments/concepts-datasets#datasets> - <https://github.com/Arize-ai/phoenix/blob/main/docs/datasets-and-experiments/how-to-datasets/creating-datasets.md|https://github.com/Arize-ai/phoenix/blob/main/docs/datasets-and-experiments/how-to-datasets/creating-datasets.md>
💡 Hint: Mention RunLLM in the thread for followups.
AJITH S.
·
·
RunLLM The clubbing of fields removes code syntax formatting and becomes harder to go through
AJITH S.
·
·
RunLLM Can you give me a complete code example on how i can run my eval pipeline , i have hosted pheonix locally
RunLLM
APP
·
·
Certainly! Here’s a complete code example for running your eval pipeline with Phoenix, including dataset creation, experiment setup, and evaluation. This example assumes you have Phoenix running locally and your data in a pandas DataFrame. ``` import pandas as pd from phoenix.client import Client from phoenix.experiments import run_experiment # Sample data records = [ { "user_query": "Write a Sensor for Windows in powershell to check TPM lockout heal time", "reference_gt_code": "code_implementation_a", "llm_response": "code_implementation_b", "human_review_notes": "wrong method: needs to use get-tpm", "scripting_language": "powershell", "platform": "windows", "gt_reference_label": "incorrect" } ] df = pd.DataFrame(records) # Connect to Phoenix px_client = Client() # Create dataset with multiple input keys dataset = px_client.datasets.create_dataset( name="code_equivalence_eval", dataframe=df, input_keys=["user_query", "reference_gt_code", "llm_response", "platform", "scripting_language"], output_keys=[], metadata_keys=["human_review_notes", "gt_reference_label"] ) # Define the prompt template prompt_template = """ Platform: {platform} Scripting Language: {scripting_language} llm_code: {llm_response} reference_code: {reference_gt_code} Check if both are functionally equivalent, and return "true" or "false". """ # Define the task function def task(input, metadata): # Format the prompt with input fields prompt = prompt_template.format( platform=input["platform"], scripting_language=input["scripting_language"], llm_response=input["llm_response"], reference_gt_code=input["reference_gt_code"] ) # Call your LLM here (pseudo-code, replace with your LLM call) # response = your_llm_api(prompt) # For demonstration, return a dummy result return {"label": "false"} # Define a simple evaluator (replace with your logic) def evaluator(metadata, output): return output["label"] == metadata["gt_reference_label"] # Run the experiment experiment = run_experiment( dataset=dataset, task=task, evaluators=[evaluator], concurrency=1 ) ``` This will run your eval pipeline and log results to your local Phoenix instance. Note: The UI will still display input fields as a dictionary, but the experiment and evaluation will work as expected. Would you like more detail on customizing the LLM call or viewing results in the UI? Sources: - <https://github.com/Arize-ai/phoenix/blob/main/tutorials/ai_evals_course/hw3_phoenix/phoenix_methods_guide.md|https://github.com/Arize-ai/phoenix/blob/main/tutorials/ai_evals_course/hw3_phoenix/phoenix_methods_guide.md>
💡 Hint: Mention RunLLM in the thread for followups.
Sanjana
External
·
Hi AJITH S., we actually have a few cookbooks on this
1.
Build a Custom LLM Eval using a Benchmark Dataset: Complete code example of refining an LLM judge prompt based on human annotated data
2.
Optimizing LLM as a Judge prompts: Techniques for optimizing llm as a judge as a prompt (walk through of techniques, not as much as full pipeline)
Sanjana
External
·
Our Cookbooks have a section on Human-in-the-Loop workflows I would recommend checking for end-to-end examples: https://arize.com/docs/phoenix/cookbook
2
Sanjana
External
·
The AI generated response was correct about input keys for the dataset. You can't annotate dataset examples, but any annotations you have on traces and spans that are later put in a dataset will have the annotations preserved as metadata. You can however, edit dataset fields. If you are looking to annotate, I would edit the metadata field with your specific annotation. Hope that is helpful!
👍1
Xander S.
·
·
cc Elizabeth H. Dustin N.