Setting Up Longitudinal Framework for Eval Performance in Phoenix

·Mar 30, 2025 10:51 PM

Hi guys, having a lot of phun ( 😉 ) using phoenix in our dev workflow.. We've been attempting to set up a longitudinal framework which shows us performance of evals across multiple experiments. In the self-hosted docker version this doesn't seem to be supported. Is this a premium feature?

8 comments

· Sorted by Oldest

RunLLM
APP
·
It seems likely that Phoenix does not have premium features, as it is an open-source platform. According to a <https://github.com/Arize-ai/phoenix/issues/2670|GitHub issue>, Phoenix is entirely open-source and free to use, with no paid features. However, the ability to track performance across multiple experiments might not be directly supported in the self-hosted version. Would you like more detail on how to potentially set up a framework for tracking performance across experiments using Phoenix? Sources: - <https://github.com/Arize-ai/phoenix/issues/2670|GitHub Issue on Phoenix Features>
💡 Hint: Mention RunLLM in the thread for followups.
Damith S.
·
Hi Jason, basically I want to set up an evaluator for an experiment, and see a chart of how this evaluator performed over multiple experiment runs. Currently, I can see them listed in a table but is there a feature to view them as a time-series over time?
Mikyo
·
Hey Damith S. lots of improvements to experiments coming down the pipeline https://github.com/Arize-ai/phoenix/issues/3398 Time series and histograms will be a part of that upcoming work. Phoenix doesn’t have any feature gates so these will be available as we ship them! You can download and pull experiments via the API which can be used to accomplish this right now. Please let us know if you have any additional ideas on the issue and we can prioritize them:)
Damith S.
·
Thanks Mikyo (sorry for radio silence). Also, wondering if anyone has used Phoenix for voice evals. E.g., for openai's realtime api evaluations.
Damith S.
·
Would be a really big plus to have
Mikyo
·
We do have some affordances to evaluate transcriptions but it might be missing some documentation. Will get this prioritized with John G.!
John G.
·
Hey Damith S. - we have an example of evaluating text and and audio here. That uses gpt-4o-audio-preview to perform the eval We can work on an example showing data pulled from realtime api, but hopefully the above can get you started on audio evals
Damith S.
·
John G., thanks for this! It's confirming a pattern we've been thinking of adopting (cdn to store audio). Cheers.
🙌1

RunLLM
APP
·
It seems likely that Phoenix does not have premium features, as it is an open-source platform. According to a <https://github.com/Arize-ai/phoenix/issues/2670|GitHub issue>, Phoenix is entirely open-source and free to use, with no paid features. However, the ability to track performance across multiple experiments might not be directly supported in the self-hosted version. Would you like more detail on how to potentially set up a framework for tracking performance across experiments using Phoenix? Sources: - <https://github.com/Arize-ai/phoenix/issues/2670|GitHub Issue on Phoenix Features>
💡 Hint: Mention RunLLM in the thread for followups.
Damith S.
·
Hi Jason, basically I want to set up an evaluator for an experiment, and see a chart of how this evaluator performed over multiple experiment runs. Currently, I can see them listed in a table but is there a feature to view them as a time-series over time?
Mikyo
·
Hey Damith S. lots of improvements to experiments coming down the pipeline https://github.com/Arize-ai/phoenix/issues/3398 Time series and histograms will be a part of that upcoming work. Phoenix doesn’t have any feature gates so these will be available as we ship them! You can download and pull experiments via the API which can be used to accomplish this right now. Please let us know if you have any additional ideas on the issue and we can prioritize them:)
Damith S.
·
Thanks Mikyo (sorry for radio silence). Also, wondering if anyone has used Phoenix for voice evals. E.g., for openai's realtime api evaluations.
Damith S.
·
Would be a really big plus to have
Mikyo
·
We do have some affordances to evaluate transcriptions but it might be missing some documentation. Will get this prioritized with John G.!
John G.
·
Hey Damith S. - we have an example of evaluating text and and audio here. That uses gpt-4o-audio-preview to perform the eval We can work on an example showing data pulled from realtime api, but hopefully the above can get you started on audio evals
Damith S.
·
John G., thanks for this! It's confirming a pattern we've been thinking of adopting (cdn to store audio). Cheers.
🙌1