Is there a method in the new phoenix client to log individual evals to an experiment? I haven鈥檛 found one, but I鈥檇 like to scale evals out to multiple machines, all ending up in the same experiment run. Our process is that we take our task, operate that async using queues in the cloud and then publish results to the experiment async. Right now, this is all done through the REST api but it would be nice to leverage the client if possible
There is currently no method in the new Phoenix client to log individual evaluations directly to an experiment from multiple machines; evaluations are added via the evaluate_experiment method, which runs evaluators over the entire experiment or dataset in batch, not as individual or incremental logs. The client supports running additional evaluations post-hoc (after the experiment is created) using evaluate_experiment, and you can retrieve a completed experiment with get_experiment, but there is no API for streaming or logging single evals from distributed workers directly into an experiment run鈥攖his is only possible via the REST API endpoints for experiment evaluations, as you are currently doing (docs, run experiments guide, REST API PR).If you want to scale out evals across machines and aggregate them into a single experiment, you must continue using the REST API to POST individual evaluations to the appropriate experiment run endpoint; the Python/TypeScript clients do not currently expose a method for logging single evals incrementally from distributed processes (see REST endpoints). Would you like more detail on the REST API usage or possible workarounds?Sources: - docs - run experiments guide
馃挕聽Hint: Mention RunLLM in the thread for followups.
Cool. What I鈥檇 thought. For the arize team, it would be cool to have a lower level method that I can call 馃檪 Part of the desire here is for resiliency and making sure that a failure in the pipeline doesn鈥檛 kill the whole experiment but the other is parallelization potentially to different regions for increased capacity With this, we could operate at scale and across teams with a common eval pipeline for simple evaluations
hi Ben M., I think thats a great idea, and something I've wanted in the past myself. Could you file a ticket in the Phoenix repo, explaining when you would need it? I will plus one.
You got it!
Thanks!
Ben M. turns out the method already exists. its called evaluate_experiment https://github.com/Arize-ai/phoenix/blob/42061a1d3e04a4fffe3087f9fb29fab40fd9cb01/src/phoenix/experiments/functions.py#L646 this is what you were looking for, correct? Let me know if not.
ah, but this is not in the new client. Let's keep your ticket up, because It would be great to have it in the new client. but you can use this function in the meantime!
That does get me 1 step closer. I would like to get to a point where we can also run individual tasks in the client. Meaning, I would love to be able to pull my dataset and also scale out the actual task completions and logging results. I put together a quick mermaid diagram to show. I鈥檇 love each cell to be able to be its own lambda function, reading from the queue and asynchronously building up our outputs and then once generated running all of our evals in parallel in other cloud functions. I think the function you shared lets me do that last step but I need to figure out if the client has something for taking an example by id and logging them to an experiment run
I may end up just using httpx to do this via the API for now. It鈥檚 not that much code in the grand scheme of things
