Logging Individual Evals to Experiments with Phoenix Client

Ben M. · 2025-10-16T16:30:58.849Z

Is there a method in the new phoenix client to log individual evals to an experiment? I haven’t found one, but I’d like to scale evals out to multiple machines, all ending up in the same experiment run. Our process is that we take our task, operate that async using queues in the cloud and then publish results to the experiment async. Right now, this is all done through the REST api but it would be nice to leverage the client if possible

10 comments

· Sorted by Oldest

RunLLM
APP
·
·
There is currently no method in the new Phoenix client to log individual evaluations directly to an experiment from multiple machines; evaluations are added via the evaluate_experiment method, which runs evaluators over the entire experiment or dataset in batch, not as individual or incremental logs. The client supports running additional evaluations post-hoc (after the experiment is created) using evaluate_experiment, and you can retrieve a completed experiment with get_experiment, but there is no API for streaming or logging single evals from distributed workers directly into an experiment run—this is only possible via the REST API endpoints for experiment evaluations, as you are currently doing (docs, run experiments guide, REST API PR).If you want to scale out evals across machines and aggregate them into a single experiment, you must continue using the REST API to POST individual evaluations to the appropriate experiment run endpoint; the Python/TypeScript clients do not currently expose a method for logging single evals incrementally from distributed processes (see REST endpoints). Would you like more detail on the REST API usage or possible workarounds?Sources: - docs - run experiments guide
💡 Hint: Mention RunLLM in the thread for followups.
Ben M.
·
·
Cool. What I’d thought. For the arize team, it would be cool to have a lower level method that I can call 🙂 Part of the desire here is for resiliency and making sure that a failure in the pipeline doesn’t kill the whole experiment but the other is parallelization potentially to different regions for increased capacity With this, we could operate at scale and across teams with a common eval pipeline for simple evaluations
Priyan
External
·
·
hi Ben M., I think thats a great idea, and something I've wanted in the past myself. Could you file a ticket in the Phoenix repo, explaining when you would need it? I will plus one.
Ben M.
·
·
You got it!
Ben M.
·
·
https://github.com/Arize-ai/phoenix/issues/9907
Priyan
External
·
·
Thanks!
Priyan
External
·
·
Ben M. turns out the method already exists. its called evaluate_experiment https://github.com/Arize-ai/phoenix/blob/42061a1d3e04a4fffe3087f9fb29fab40fd9cb01/src/phoenix/experiments/functions.py#L646 this is what you were looking for, correct? Let me know if not.
Priyan
External
·
·
ah, but this is not in the new client. Let's keep your ticket up, because It would be great to have it in the new client. but you can use this function in the meantime!
Ben M.
·
·
That does get me 1 step closer. I would like to get to a point where we can also run individual tasks in the client. Meaning, I would love to be able to pull my dataset and also scale out the actual task completions and logging results. I put together a quick mermaid diagram to show. I’d love each cell to be able to be its own lambda function, reading from the queue and asynchronously building up our outputs and then once generated running all of our evals in parallel in other cloud functions. I think the function you shared lets me do that last step but I need to figure out if the client has something for taking an example by id and logging them to an experiment run
Ben M.
·
·
I may end up just using httpx to do this via the API for now. It’s not that much code in the grand scheme of things

RunLLM
APP
·
·
There is currently no method in the new Phoenix client to log individual evaluations directly to an experiment from multiple machines; evaluations are added via the evaluate_experiment method, which runs evaluators over the entire experiment or dataset in batch, not as individual or incremental logs. The client supports running additional evaluations post-hoc (after the experiment is created) using evaluate_experiment, and you can retrieve a completed experiment with get_experiment, but there is no API for streaming or logging single evals from distributed workers directly into an experiment run—this is only possible via the REST API endpoints for experiment evaluations, as you are currently doing (docs, run experiments guide, REST API PR).If you want to scale out evals across machines and aggregate them into a single experiment, you must continue using the REST API to POST individual evaluations to the appropriate experiment run endpoint; the Python/TypeScript clients do not currently expose a method for logging single evals incrementally from distributed processes (see REST endpoints). Would you like more detail on the REST API usage or possible workarounds?Sources: - docs - run experiments guide
💡 Hint: Mention RunLLM in the thread for followups.
Ben M.
·
·
Cool. What I’d thought. For the arize team, it would be cool to have a lower level method that I can call 🙂 Part of the desire here is for resiliency and making sure that a failure in the pipeline doesn’t kill the whole experiment but the other is parallelization potentially to different regions for increased capacity With this, we could operate at scale and across teams with a common eval pipeline for simple evaluations
Priyan
External
·
·
hi Ben M., I think thats a great idea, and something I've wanted in the past myself. Could you file a ticket in the Phoenix repo, explaining when you would need it? I will plus one.
Ben M.
·
·
You got it!
Ben M.
·
·
https://github.com/Arize-ai/phoenix/issues/9907
Priyan
External
·
·
Thanks!
Priyan
External
·
·
Ben M. turns out the method already exists. its called evaluate_experiment https://github.com/Arize-ai/phoenix/blob/42061a1d3e04a4fffe3087f9fb29fab40fd9cb01/src/phoenix/experiments/functions.py#L646 this is what you were looking for, correct? Let me know if not.
Priyan
External
·
·
ah, but this is not in the new client. Let's keep your ticket up, because It would be great to have it in the new client. but you can use this function in the meantime!
Ben M.
·
·
That does get me 1 step closer. I would like to get to a point where we can also run individual tasks in the client. Meaning, I would love to be able to pull my dataset and also scale out the actual task completions and logging results. I put together a quick mermaid diagram to show. I’d love each cell to be able to be its own lambda function, reading from the queue and asynchronously building up our outputs and then once generated running all of our evals in parallel in other cloud functions. I think the function you shared lets me do that last step but I need to figure out if the client has something for taking an example by id and logging them to an experiment run
Ben M.
·
·
I may end up just using httpx to do this via the API for now. It’s not that much code in the grand scheme of things