Examples of Phoenix Repos for Building Eval Suites with Chat Agents
Are there any good example repos out there using phoenix to implement a production ready eval suit? I am trying to use phoenix experiments as my go to tool but struggling to figure out how to architect a repo well. My project has a collection chat agents built with LangChain / LlamaIndex. I want to develop a set of evals for each agent, just like you might write integration tests against each API endpoint. For each agent I'll want a specific set of tests. Learning about experiments, my first though was to develop one dataframe per chat agent and run one run_experiment() against this. This architecture doesn't seem to be flexable enough, I run into the issue that I really want to have different sorts of evals for each question in the dataframe. For example, if I have two separate question which I want to run independent keyword evals against, there doesn't seem to be a good way of making am eval conditional, at least not the out of the box ones like ContainsAnyKeyword. I would love to see a basic example repo showing how to make a good suit of evals.
