Inquiry about Longitudinal Framework in Phoenix Docker Version

Damith S. · 2025-03-30T22:51:02.305Z

Hi Guys, been having so much fun using Phoenix. We've been attempting to set up a longitudinal framework which shows us performance of evals across multiple experiments. In the self-hosted docker version this doesn't seem to be supported. Is this a premium feature?

13 comments

· Sorted by Oldest

John G.
·
Hey Damith, great to hear it's been going well so far! You can view multiple evals across multiple experiments in Phoenix - adding some examples here It sounds like what you might be looking for is a graphical representation of the eval scores over time? If that's the case, that is something in the enterprise Arize platform (Arize AX) today, and something that's planned for Phoenix as well
Tiago F.
·
John G. that issue was closed automatically by a bot as not planned. Maybe a mistake?
Tiago F.
·
this an interesting feature in langsmith also: https://docs.smith.langchain.com/evaluation/how_to_guides/evaluate_pairwise
Tiago F.
·
they have a ui to compare the pairs visually
Tiago F.
·
also if this is available in the paid arize, where can I find info about it and a comphensive feature list? thanks
Tiago F.
·
I am currently evaluating langsmith vs phoenix and arize pro/enterprise but I can't find info on these small features to know what is included where
John G.
·
Tiago F. - which issue are you looking at there that got closed? Comparison views for experiments is still on the docket from my understanding. In terms of a dock comparing the two platforms, we have a somewhat granular comparison here that might help between Phoenix and Arize AX In terms of the comparison to Langsmith, some key features that would be missing in their platform are:
Ability to handle scale - we have a custom built database specific for ai workloads that significantly outpaces theirs
Monitoring options, custom monitors, connections to pagerduty etc
Arize's copilot agent that can help debug traces, surface insights, run semantic search over traces
Multimodal evals
Would be happy to run through more details live with you, or fill out your own requirements doc
Tiago F.
·
John G. I meant the issue you just reopened 🙂 - https://github.com/Arize-ai/phoenix/issues/3738
Tiago F.
·
I meant that pricing page does not mention for example that arize AX has a pairwise evaluations UI (you mentioned arize ax has a graphical representation of the eval scores over time, but I mean more the langsmith example of comparing 2 evals: https://docs.smith.langchain.com/evaluation/how_to_guides/evaluate_pairwise#view-pairwise-experiments Its a UI to visually compare the rating of 2 different responses for example (can be llm judge or human annotation I guess) I know anything can be done in code, so I am mostly comparing UI features (edited)
Tiago F.
·
all those you listed are for AX not Phoenix, so I guess at the moment Phoenix is less complete than Langsmith (missing the dynamic few-shot UI and pairwise UI at least). So AX is better for production use
John G.
·
Tiago F. - good callout here! We don't currently have a pairwise visualization for experiments in Arize AX. I've also filed a ticket for our team to look at that on the AX side I think generally it's best to compare Arize AX to LangSmith in this circumstance. Phoenix has a bit fewer features than AX at the moment, though it does have the benefit of being open-source. Speaking as unbiasedly as I can, I'd say that LangSmith sits between the two platforms at the moment - it has more features than Phoenix, but isn't as comprehensive as AX. There are a ton of specific needs, views, and features that each platform has - so please let me know if there are any that you're wondering about in AX. Happy to give a deeper dive walkthrough of the platform as well
Tiago F.
·
mostly interested in automating the eval pipeline for a non-technical QA team, for RAG apps including LLM Judge evals and summarization evals, dynamic few-shot prompting, trace replays. any feature that makes it easier to setup and adjust things from the UI is a plus for the QA team. a walkthrough of AX would be good but I think they want to start simple and self-hosted and then move to a more comprehensive solution. happy to hop on a demo. Langsmith has very extensive documentation with lots of examples, cookbooks and premade evals, which is a plus due to lack of experienced developers, but leaning towards arize PX -> AX path
John G.
·
Awesome - coordinating on that demo time via dm!

John G.
·
Hey Damith, great to hear it's been going well so far! You can view multiple evals across multiple experiments in Phoenix - adding some examples here It sounds like what you might be looking for is a graphical representation of the eval scores over time? If that's the case, that is something in the enterprise Arize platform (Arize AX) today, and something that's planned for Phoenix as well
Tiago F.
·
John G. that issue was closed automatically by a bot as not planned. Maybe a mistake?
Tiago F.
·
this an interesting feature in langsmith also: https://docs.smith.langchain.com/evaluation/how_to_guides/evaluate_pairwise
Tiago F.
·
they have a ui to compare the pairs visually
Tiago F.
·
also if this is available in the paid arize, where can I find info about it and a comphensive feature list? thanks
Tiago F.
·
I am currently evaluating langsmith vs phoenix and arize pro/enterprise but I can't find info on these small features to know what is included where
John G.
·
Tiago F. - which issue are you looking at there that got closed? Comparison views for experiments is still on the docket from my understanding. In terms of a dock comparing the two platforms, we have a somewhat granular comparison here that might help between Phoenix and Arize AX In terms of the comparison to Langsmith, some key features that would be missing in their platform are:
Ability to handle scale - we have a custom built database specific for ai workloads that significantly outpaces theirs
Monitoring options, custom monitors, connections to pagerduty etc
Arize's copilot agent that can help debug traces, surface insights, run semantic search over traces
Multimodal evals
Would be happy to run through more details live with you, or fill out your own requirements doc
Tiago F.
·
John G. I meant the issue you just reopened 🙂 - https://github.com/Arize-ai/phoenix/issues/3738
Tiago F.
·
I meant that pricing page does not mention for example that arize AX has a pairwise evaluations UI (you mentioned arize ax has a graphical representation of the eval scores over time, but I mean more the langsmith example of comparing 2 evals: https://docs.smith.langchain.com/evaluation/how_to_guides/evaluate_pairwise#view-pairwise-experiments Its a UI to visually compare the rating of 2 different responses for example (can be llm judge or human annotation I guess) I know anything can be done in code, so I am mostly comparing UI features (edited)
Tiago F.
·
all those you listed are for AX not Phoenix, so I guess at the moment Phoenix is less complete than Langsmith (missing the dynamic few-shot UI and pairwise UI at least). So AX is better for production use
John G.
·
Tiago F. - good callout here! We don't currently have a pairwise visualization for experiments in Arize AX. I've also filed a ticket for our team to look at that on the AX side I think generally it's best to compare Arize AX to LangSmith in this circumstance. Phoenix has a bit fewer features than AX at the moment, though it does have the benefit of being open-source. Speaking as unbiasedly as I can, I'd say that LangSmith sits between the two platforms at the moment - it has more features than Phoenix, but isn't as comprehensive as AX. There are a ton of specific needs, views, and features that each platform has - so please let me know if there are any that you're wondering about in AX. Happy to give a deeper dive walkthrough of the platform as well
Tiago F.
·
mostly interested in automating the eval pipeline for a non-technical QA team, for RAG apps including LLM Judge evals and summarization evals, dynamic few-shot prompting, trace replays. any feature that makes it easier to setup and adjust things from the UI is a plus for the QA team. a walkthrough of AX would be good but I think they want to start simple and self-hosted and then move to a more comprehensive solution. happy to hop on a demo. Langsmith has very extensive documentation with lots of examples, cookbooks and premade evals, which is a plus due to lack of experienced developers, but leaning towards arize PX -> AX path
John G.
·
Awesome - coordinating on that demo time via dm!