I want to optimize my llm as a judge using phoenix.
Task : check functional equivalence between code a and code b , with additional fields like platform and scripting_language as supporting variables.
Here is my dataset sample:
records : List[str] = [
{
"user_query": "Write a Sensor for Windows in powershell to check TPM lockout heal time",
"reference_gt_code": ''code implementation_a"
"llm_response": "code_implementation_b",
"human_review_notes": "wrong method: needs to use get-tpm",
"scripting_language": "powershell",
"gt_reference_label": "false"
}
]
Prompt Template:
Platform : {platform}
Scripting Language : {scripting_language}
llm_code : {llm_response}
reference_code : {reference_gt_code}
check if both are functionally equivalent , and return "true" or "false"how can i create a phoenix dataset from the data and run eval experiments on it ?
When i specify multiple input keys in the px_client.datasets.create_dataset , in the UI i see all of them as a single input field , represented as a dictionary and its not so convinient to manually annotate, like excel where i will still see 'n' columns and my expected and predicted label.
Am i doing it wrong ?