Thanks! I'll keep an eye on it.
Hy everyone, I am here again. I was really looking forward to the server-side evaluators feature, so I’m glad to see it released in v13. One workflow I expected this to enable is rapidly iterating from traces → dataset → evaluation directly inside Phoenix, without having to write additional code. However, unless I’m missing something, this workflow doesn’t seem fully supported yet. What I’m trying to do I created a dataset from LLM spans. In my case:
The input contains the tool schema and user message.
The tool call produced by the model is the value I want to evaluate.
However, since Phoenix datasets require input and reference, the tool call ends up being stored as the reference. This does not represent ground truth — it’s simply the observed output that I want to run evaluators against. Example input:
{
"tools": [...],
"messages": [
{
"role": "user",
"content": "Some 7 e 93."
}
]
}Example reference (the tool call produced by the model in the trace):
{
"messages": [
{
"role": "assistant",
"tool_calls": [
{
"function": {
"name": "SumTool",
"arguments": {
"a": 7,
"b": 93
}
}
}
]
}
]
}I then adapted the tool selection evaluator template so it reads from reference instead of output. Original template:
<output>
{{#output.messages}}
...
{{/output.messages}}
</output>Modified template:
<output>
{{#reference.messages}}
...
{{/reference.messages}}
</output>This works because the value I want to evaluate already exists in the dataset. Problem When running an experiment in the playground, Phoenix still requires an LLM task that generates an output, even though in this case:
The value I want to evaluate already exists in the dataset (stored as reference)
The evaluator does not depend on a newly generated model output
Current workaround To trigger the evaluator, I created a dummy task with a prompt like:
Output the content "1"This technically works, but it is:
hacky
unnecessarily expensive, since it calls an LLM for every dataset row even though the result is irrelevant
Additional issue While using this workaround, I frequently encounter the following error:
Key (experiment_run_id, name)=(18188, tool_selection) already exists in experiment_run_annotationsThe failure point appears random — sometimes after processing 2 rows, sometimes 3, etc. Question Is there currently a way to run server-side evaluators directly on dataset rows (input + reference) without requiring an LLM task to generate an output? That would make the trace → dataset → evaluator workflow much smoother for cases where we want to evaluate previously observed outputs rather than generate new ones.
Cool. Unfortunately I did not have the time to implement it from the scratch and open a proper PR, but looking forward to an official implementation.
Hi team, I built a small feature that adds a visual prompt-version diff viewer to Phoenix. It lets users pick two prompt versions and see a side-by-side diff with added/removed/modified lines highlighted. This improves prompt management usability: without a clear diff view it's hard to know exactly what changed between prompt versions, and teams want to review differences before promoting a prompt to production, much like a code review. The absence of a prompt diff viewer was a major reason our team avoided using the prompt-management feature from Phoenix. You can see the implementation and usage in the repo: https://github.com/jmarcosmn/phoenix-prompt-diff Disclaimer: the code is AI generated. Can your team consider implementing something similar in the near future, or do you think I can adapt this code to match your contribution guidelines and open a PR? Here are some pics of the feature:
