Optimizing LLM with Phoenix for Code Comparison and Dataset Creation
I want to optimize my llm as a judge using phoenix. Task : check functional equivalence between code a and code b , with additional fields like platform and scripting_language as supporting variables. Here is my dataset sample: records : List[str] = [ { "user_query": "Write a Sensor for Windows in powershell to check TPM lockout heal time", "reference_gt_code": ''code implementation_a" "llm_response": "code_implementation_b", "human_review_notes": "wrong method: needs to use get-tpm", "scripting_language": "powershell", "gt_reference_label": "false" } ] Prompt Template:
Platform : {platform}
Scripting Language : {scripting_language}
llm_code : {llm_response}
reference_code : {reference_gt_code}
check if both are functionally equivalent , and return "true" or "false"how can i create a phoenix dataset from the data and run eval experiments on it ? When i specify multiple input keys in the px_client.datasets.create_dataset , in the UI i see all of them as a single input field , represented as a dictionary and its not so convinient to manually annotate, like excel where i will still see 'n' columns and my expected and predicted label. Am i doing it wrong ?
