In the custom evals metrics tracker I was writing I also had stuff like success rate when no errors vs success rate overall. I guess I'd just need a really customizable set of columns.
Are there plans to add more ability to explore the data in the experiments UI? I'd love to add custom columns. Right now there is average latency, I'd like to add median, longest latency, shortest latency, p99. p95, p75, etc.
Thanks for the response, it makes sense to manage the datasets like fixtures. How do you manage your test suits? Run them via pytest? I want to track metrics over time, like token use, success % of a specific eval, response speed, stuff like that. I was rolling my own system to collect those sorts of metrics but I want to do whatever is most standard.
If anyone wants to work on some bet practices docs for how to structure eval suits, I'd be happy to help develop some materials once I've got it figured out.
For example, I am working on an agent which helps a user identify the right code from a long list of proprietary codes. In my dataframe I have three items like below: Item 1
question: "What is the code for foo bar?"
answer: "The code is xyz123"
keywords: ["xyz123"]
Item 2
question: "What is the code for foo buzz?"
answer: "The code is abc123"
keywords: ["abc123"]
Item 3
question: "What is the capitol of the USA"
answer: "Please ask a question about our proprietary codes, I will not answer off topic questions."
For items 1 and 2 I can use an llm eval like show here to have the task run my question, then have an llm evaluate my answer, sure. Really I only need to write a keyword eval for those questions though. For item 3 I want to evaluate that is an off topic question is asked, my agent won't respond. It looks to me that using experiments is not the right fit for the sorts of tests. What I was hoping to find is a system where I can try the performance of evals over time, it looks like experiments is really a system for running 1 specific set of evals over a large data set. For the case outlined above, how would the phoenix team recommend I architect my eval suit? I'd like to have many sets of tests against specific questions like shown above. I could have 1 item per dataframe and then a separate set of evals, but that doesn't seem to be what the system was designed for. Am I thinking about evals wrong?
Are there any good example repos out there using phoenix to implement a production ready eval suit? I am trying to use phoenix experiments as my go to tool but struggling to figure out how to architect a repo well. My project has a collection chat agents built with LangChain / LlamaIndex. I want to develop a set of evals for each agent, just like you might write integration tests against each API endpoint. For each agent I'll want a specific set of tests. Learning about experiments, my first though was to develop one dataframe per chat agent and run one run_experiment() against this. This architecture doesn't seem to be flexable enough, I run into the issue that I really want to have different sorts of evals for each question in the dataframe. For example, if I have two separate question which I want to run independent keyword evals against, there doesn't seem to be a good way of making am eval conditional, at least not the out of the box ones like ContainsAnyKeyword. I would love to see a basic example repo showing how to make a good suit of evals.
I see what you mean about the LLM as a judge vs just a code check. I think the first two tests could potentially be code checks designed to validate some data is present in the response could be static checks, while the third test which refuses to respond in some way probably needs to use an LLM judge. Would you still call static code checks evals? How would you design them in Phoenix? In this example use case, what would some of your evals look like? I am trying to figure out conceptually what a good eval stack looks like. For standard automated testing in software we have the testing pyramid with unit tests, integration/acceptance tests, and e2e tests. I am wondering if there are some guidelines for LLMs like this.
In Phoenix I see there is an example of using GroundTruth in an llm_classify. I want to use GroundTruth as an Evaluator in run_evals() but I see there is no GroundTruthEvaluator. Is there a reason for that? What I want to do is build a system which makes a set of the same test calls x number of times with a known correct_answer for every question, then use run_evals to evaluate them all at once, then store that data somewhere else so I can graph the performance of certain queries over time. I'd like to have this set of test queries run x number of times, run multiple evaluators over them, and graph the results of these evals of all of these over time with extra data like run time, tokens used, etc. This way I can built a nice overview of how my agents have performed over time on the same queries. Does this sound like the right way to move forward with evals? I want to produce a production quality agent and it seems this is what will be needed. I do not see capabilities for this sort of graphing over time in Phoenix, have I missed it?
I am looking to create a specific set of evals for for an agent to evaluate accuracy of a response. For example, let's say my agent is a train ticket booking assistant I want to evaluate:
Given my test data has 3 direct trains from Amsterdam to Paris on Aug 5th, asking for that route on that day should return some output containing all 3 of those routes
Give the same test data above has prices, asking for the cheapest route should specifically return info on the cheapest route, briefly mentioning the other more expensive routes and offering to provide more details about those route if asked
Give the above test data, asking for which traits fit in some time window should only return the routes which actually do fit into that time range
If I ask my agent for flights from Amsterdam to Paris the agent should refuse to answer and let the user know it only handles train routes
I am think that all of these should be GroundTruth evals given that I have a correct_answer prepared ahead of time and I want to query the agent and compare it's answer against the know correct to get a correct/incorrect. Am I thinking about this correctly?
