Exploring Centralized Evaluations for Phoenix Forecasting System
Hey team! 👋 We’re exploring the feasibility of running evaluations for our forecasting system using Phoenix — specifically trying to understand if Phoenix can support a centralized, Arize-like interface rather than requiring local SDK-based runs on individual machines. From what we’ve gathered so far:
The current Phoenix architecture seems to treat the UI purely as a visual layer, with all evaluation logic needing to be run locally via the Python SDK (e.g., pip install arize-phoenix).
The typical flow involves exporting trace data from Phoenix, running our evaluation logic (heuristics, LLM-as-a-judge, etc.) locally, and then pushing the annotated results back to Phoenix so they appear in the UI.
We’re referring to this notebook as our base setup: evals_quickstart.ipynb
Our main question: 👉 Is there any way to run these evaluations centrally or on a server, rather than requiring every team member to do this manually via the SDK on their own machine? If not currently supported, are there any best practices or community-led patterns for scaling this workflow in a team setting? Appreciate any clarification or guidance from folks who’ve tackled this before!
