Creating GroundTruth Evals for Train Booking Accuracy
I am looking to create a specific set of evals for for an agent to evaluate accuracy of a response. For example, let's say my agent is a train ticket booking assistant I want to evaluate:
Given my test data has 3 direct trains from Amsterdam to Paris on Aug 5th, asking for that route on that day should return some output containing all 3 of those routes
Give the same test data above has prices, asking for the cheapest route should specifically return info on the cheapest route, briefly mentioning the other more expensive routes and offering to provide more details about those route if asked
Give the above test data, asking for which traits fit in some time window should only return the routes which actually do fit into that time range
If I ask my agent for flights from Amsterdam to Paris the agent should refuse to answer and let the user know it only handles train routes
I am think that all of these should be GroundTruth evals given that I have a correct_answer prepared ahead of time and I want to query the agent and compare it's answer against the know correct to get a correct/incorrect. Am I thinking about this correctly?
