Phoenix does not natively support running evaluations centrally from the UI; all evaluation logic (including LLM-as-a-judge or heuristics) must be executed via the Python SDK or scripts, with results then logged back to Phoenix for visualization. The UI is strictly a visualization and annotation layer, not an execution environment for evals or batch jobs鈥攖his is confirmed in the evals_quickstart notebook and official docs.For centralized or team-scale workflows, the recommended pattern is to set up a shared server (or scheduled job, e.g., with cron) that periodically pulls new traces, runs evaluations, and logs results back to Phoenix鈥攕ee the online evals guide and example script. This allows teams to automate and centralize evaluation runs, avoiding the need for each user to run SDK code locally.Would you like more detail on how to set up such a centralized evaluation workflow?Sources: - Online Evals | Phoenix Docs - Online Evals Example Script - evals_quickstart notebook
馃挕聽Hint: Mention RunLLM in the thread for followups.
It鈥檚 not built into phoenix, but having this in a separate service isn鈥檛 terrible
Thanks Ben M. for the comment. In general Phoenix right now doesn't run evals as part of a job, it relies on you to pull data and run evaluations on a regular cadence. We will add this level of online evaluation but we want to iron out all our evals first. If you are looking for evals as a managed service, you can try out Arize Ax, which does have job infrastructure.
