Exploring Custom Evaluation Logic in Phoenix: Need Insights

·Jun 10, 2025 03:57 PM

Hey team, 👋 I am currently exploring Phoenix for tracing and evaluations. While the built-in eval templates are helpful, our use case requires more complex logic than what the current templates or the custom evaluator support (categorical/numerical) allows. Specifically, we want to define our own structured evaluation logic (e.g., diffing nested JSONs), and ideally log these results fully integrated into Phoenix, meaning visible in the UI, linked to traces, and treated just like native evaluation results. Has anyone tried bypassing the default evaluator path to implement this kind of custom logic? Would love to hear how others have approached more advanced or structured judgment tasks. Thanks in advance!

6 comments

· Sorted by Oldest

RunLLM
APP
·
Yes, you can implement fully custom evaluation logic (e.g., diffing nested JSONs) and log the results to Phoenix so they appear in the UI, linked to traces, and treated as native evaluations. Phoenix supports logging any evaluation results (label, score, explanation) as annotations on spans or traces, regardless of whether they were generated by built-in, LLM-based, or code-based evaluators. You simply need to generate a DataFrame with your custom logic, ensure it includes the correct span IDs, and use px.Client().log_evaluations(SpanEvaluations(...)) to upload the results—these will be visible in the UI and filterable like any other evals (docs, slack discussion).You are not limited to LLM or template-based evaluators: Phoenix explicitly supports custom Python functions for evaluation, including complex logic such as JSON diffing, and these can be run offline or as part of your pipeline. The only requirement is that your results are formatted with the expected columns (label, score, explanation, and span_id) before logging. See this example and this guide for details.Would you like a step-by-step code example for a custom JSON diff evaluator?Sources: - Evaluating Phoenix Traces - Bring Your Own Evaluator - Slack discussion: custom evaluators
💡 Hint: Mention RunLLM in the thread for followups.
Dustin N.
·
hi Nikhileshwar r. thanks so much for your interest in using Phoenix! At the moment this level of deep customization is possible but not explicitly supported in phoenix.evals — I can sketch out a hacky solution if you'd like but it wouldn't be the path I recommend. We have much fuller support for custom evaluation in the phoenix.experiments module, which fully supports defining things like JSON diffing. The caveat is you'll have to structure the evaluation you'd like to do in the form of a dataset (with an input, an expected output, and a task to run, essentially a lambda that operates on the input and produces an output). Evaluators can be run on the experiment, comparing each task output with each example's expected output, where you can define a JSON diff evaluation (which can be continuous, simply a label, or whatever you want)
Nikhileshwar r.
·
Dustin N. Thank you for the reply, I would really appreciate if you can point me towards the right documentation or GitHub repos, that can help me to get started with phoenix.experiments .
Dustin N.
·
Here's the datasets and experiments quickstart! https://arize.com/docs/phoenix/datasets-and-experiments/quickstart-datasets
Dustin N.
·
notably, each experiment will have a trace associated it, and if your lambda contains instrumented code, the spans should be properly nested under each trace
Nikhileshwar r.
·
Thank you!

RunLLM
APP
·
Yes, you can implement fully custom evaluation logic (e.g., diffing nested JSONs) and log the results to Phoenix so they appear in the UI, linked to traces, and treated as native evaluations. Phoenix supports logging any evaluation results (label, score, explanation) as annotations on spans or traces, regardless of whether they were generated by built-in, LLM-based, or code-based evaluators. You simply need to generate a DataFrame with your custom logic, ensure it includes the correct span IDs, and use px.Client().log_evaluations(SpanEvaluations(...)) to upload the results—these will be visible in the UI and filterable like any other evals (docs, slack discussion).You are not limited to LLM or template-based evaluators: Phoenix explicitly supports custom Python functions for evaluation, including complex logic such as JSON diffing, and these can be run offline or as part of your pipeline. The only requirement is that your results are formatted with the expected columns (label, score, explanation, and span_id) before logging. See this example and this guide for details.Would you like a step-by-step code example for a custom JSON diff evaluator?Sources: - Evaluating Phoenix Traces - Bring Your Own Evaluator - Slack discussion: custom evaluators
💡 Hint: Mention RunLLM in the thread for followups.
Dustin N.
·
hi Nikhileshwar r. thanks so much for your interest in using Phoenix! At the moment this level of deep customization is possible but not explicitly supported in phoenix.evals — I can sketch out a hacky solution if you'd like but it wouldn't be the path I recommend. We have much fuller support for custom evaluation in the phoenix.experiments module, which fully supports defining things like JSON diffing. The caveat is you'll have to structure the evaluation you'd like to do in the form of a dataset (with an input, an expected output, and a task to run, essentially a lambda that operates on the input and produces an output). Evaluators can be run on the experiment, comparing each task output with each example's expected output, where you can define a JSON diff evaluation (which can be continuous, simply a label, or whatever you want)
Nikhileshwar r.
·
Dustin N. Thank you for the reply, I would really appreciate if you can point me towards the right documentation or GitHub repos, that can help me to get started with phoenix.experiments .
Dustin N.
·
Here's the datasets and experiments quickstart! https://arize.com/docs/phoenix/datasets-and-experiments/quickstart-datasets
Dustin N.
·
notably, each experiment will have a trace associated it, and if your lambda contains instrumented code, the spans should be properly nested under each trace
Nikhileshwar r.
·
Thank you!