How to Run Server-Side Evaluators Directly on Dataset Rows Without Requiring an LLM Task in Phoenix v13
Hy everyone, I am here again. I was really looking forward to the server-side evaluators feature, so I’m glad to see it released in v13. One workflow I expected this to enable is rapidly iterating from traces → dataset → evaluation directly inside Phoenix, without having to write additional code. However, unless I’m missing something, this workflow doesn’t seem fully supported yet. What I’m trying to do I created a dataset from LLM spans. In my case:
The input contains the tool schema and user message.
The tool call produced by the model is the value I want to evaluate.
However, since Phoenix datasets require input and reference, the tool call ends up being stored as the reference. This does not represent ground truth — it’s simply the observed output that I want to run evaluators against. Example input:
{
"tools": [...],
"messages": [
{
"role": "user",
"content": "Some 7 e 93."
}
]
}Example reference (the tool call produced by the model in the trace):
{
"messages": [
{
"role": "assistant",
"tool_calls": [
{
"function": {
"name": "SumTool",
"arguments": {
"a": 7,
"b": 93
}
}
}
]
}
]
}I then adapted the tool selection evaluator template so it reads from reference instead of output. Original template:
<output>
{{#output.messages}}
...
{{/output.messages}}
</output>Modified template:
<output>
{{#reference.messages}}
...
{{/reference.messages}}
</output>This works because the value I want to evaluate already exists in the dataset. Problem When running an experiment in the playground, Phoenix still requires an LLM task that generates an output, even though in this case:
The value I want to evaluate already exists in the dataset (stored as reference)
The evaluator does not depend on a newly generated model output
Current workaround To trigger the evaluator, I created a dummy task with a prompt like:
Output the content "1"This technically works, but it is:
hacky
unnecessarily expensive, since it calls an LLM for every dataset row even though the result is irrelevant
Additional issue While using this workaround, I frequently encounter the following error:
Key (experiment_run_id, name)=(18188, tool_selection) already exists in experiment_run_annotationsThe failure point appears random — sometimes after processing 2 rows, sometimes 3, etc. Question Is there currently a way to run server-side evaluators directly on dataset rows (input + reference) without requiring an LLM task to generate an output? That would make the trace → dataset → evaluator workflow much smoother for cases where we want to evaluate previously observed outputs rather than generate new ones.
