Yes, you can implement fully custom evaluation logic (e.g., diffing nested JSONs) and log the results to Phoenix so they appear in the UI, linked to traces, and treated as native evaluations. Phoenix supports logging any evaluation results (label, score, explanation) as annotations on spans or traces, regardless of whether they were generated by built-in, LLM-based, or code-based evaluators. You simply need to generate a DataFrame with your custom logic, ensure it includes the correct span IDs, and use px.Client().log_evaluations(SpanEvaluations(...)) to upload the resultsβthese will be visible in the UI and filterable like any other evals (<https://github.com/Arize-ai/phoenix/blob/main/docs/evaluation/how-to-evals/evaluating-phoenix-traces.md|docs>, <https://arize-ai.slack.com/archives/C04R3GXC8HK/p1716404752.254739|slack discussion>).
You are not limited to LLM or template-based evaluators: Phoenix explicitly supports custom Python functions for evaluation, including complex logic such as JSON diffing, and these can be run offline or as part of your pipeline. The only requirement is that your results are formatted with the expected columns (label, score, explanation, and span_id) before logging. See <https://github.com/Arize-ai/phoenix/blob/main/docs/evaluation/how-to-evals/evaluating-phoenix-traces.md|this example> and <https://github.com/Arize-ai/phoenix/blob/main/docs/evaluation/how-to-evals/bring-your-own-evaluator.md|this guide> for details.
Would you like a step-by-step code example for a custom JSON diff evaluator?
Sources:
- <https://github.com/Arize-ai/phoenix/blob/main/docs/evaluation/how-to-evals/evaluating-phoenix-traces.md|Evaluating Phoenix Traces>
- <https://github.com/Arize-ai/phoenix/blob/main/docs/evaluation/how-to-evals/bring-your-own-evaluator.md|Bring Your Own Evaluator>
- <https://arize-ai.slack.com/archives/C04R3GXC8HK/p1716404752.254739|Slack discussion: custom evaluators>