Examples of Phoenix Repos for Building Eval Suites with Chat Agents | Arize AI Community

Arize AI Community Icon

Chris
·
For example, I am working on an agent which helps a user identify the right code from a long list of proprietary codes. In my dataframe I have three items like below: Item 1
question: "What is the code for foo bar?"
answer: "The code is xyz123"
keywords: ["xyz123"]
Item 2
question: "What is the code for foo buzz?"
answer: "The code is abc123"
keywords: ["abc123"]
Item 3
question: "What is the capitol of the USA"
answer: "Please ask a question about our proprietary codes, I will not answer off topic questions."
For items 1 and 2 I can use an llm eval like show here to have the task run my question, then have an llm evaluate my answer, sure. Really I only need to write a keyword eval for those questions though. For item 3 I want to evaluate that is an off topic question is asked, my agent won't respond. It looks to me that using experiments is not the right fit for the sorts of tests. What I was hoping to find is a system where I can try the performance of evals over time, it looks like experiments is really a system for running 1 specific set of evals over a large data set. For the case outlined above, how would the phoenix team recommend I architect my eval suit? I'd like to have many sets of tests against specific questions like shown above. I could have 1 item per dataframe and then a separate set of evals, but that doesn't seem to be what the system was designed for. Am I thinking about evals wrong?
Chris
·
If anyone wants to work on some bet practices docs for how to structure eval suits, I'd be happy to help develop some materials once I've got it figured out.
Mikyo
·
Hey Chris catching up here. Yeah so you are right the experiments are designed more along the lines of having datasets as "test fixtures" for a specific use-case. So you might have a dataset that exclusively has "off topic questions" and you would have an evaluator that just ensures that the task DOESN"T answer the question under those conditions. This is definitely the cleanest approach as it makes the evaluation logic a bit simpler. You can think of this as a test suite of common patterns. I get it though, you want to just have a dataset that has a mix of questions and validate that the synthesis is doing the right thing. I think this is where having an "expected" output can help. In this case you can use heuristics like "similarity", JSON distance, or just leveraging an LLM to perform reflection to see if the synthesis aligns with what is "expected" can work. This forms more of a unit-test like approach where each row gets evaluated semi independently based on a criteria you build. Hopefully that helps. Would love to review your materials!
Chris
·
Thanks for the response, it makes sense to manage the datasets like fixtures. How do you manage your test suits? Run them via pytest? I want to track metrics over time, like token use, success % of a specific eval, response speed, stuff like that. I was rolling my own system to collect those sorts of metrics but I want to do whatever is most standard.
Mikyo
·
I think leveraging things like pytest to start is the best way to start for sure. You can certainly enable tracing during the test running and annotate back the results on the traces. That's how one of our teams has things structured right now. Doing broad regression suites across a dataset also makes sense to me where you want to track the version of the fixtures. I think this is where experiments really shine because you can version the fixture data as a dataset. I think it kinda depends on what stage of development you are at. If you need to add data from your deployed app to datasets and run regression suites using real-life scenarios, the datasets / experiments flow becomes really compelling. We should probably have some more thought on how to integrate into CI/CD flows in the near future and will share that back
Chris
·
Are there plans to add more ability to explore the data in the experiments UI? I'd love to add custom columns. Right now there is average latency, I'd like to add median, longest latency, shortest latency, p99. p95, p75, etc.
Chris
·
In the custom evals metrics tracker I was writing I also had stuff like success rate when no errors vs success rate overall. I guess I'd just need a really customizable set of columns.
Mikyo
·
Oh interesting - We do have some improvements planned like repetitions and metrics but those are not on my radar. Def. file us a issue on GH and we'll tackle it as part of the next experiments improvements!
Chris
·
https://github.com/Arize-ai/phoenix/issues/4254
🙌1