Using GroundTruth for Evaluations in LLM Classifications
In Phoenix I see there is an example of using GroundTruth in an llm_classify. I want to use GroundTruth as an Evaluator in run_evals() but I see there is no GroundTruthEvaluator. Is there a reason for that? What I want to do is build a system which makes a set of the same test calls x number of times with a known correct_answer for every question, then use run_evals to evaluate them all at once, then store that data somewhere else so I can graph the performance of certain queries over time. I'd like to have this set of test queries run x number of times, run multiple evaluators over them, and graph the results of these evals of all of these over time with extra data like run time, tokens used, etc. This way I can built a nice overview of how my agents have performed over time on the same queries. Does this sound like the right way to move forward with evals? I want to produce a production quality agent and it seems this is what will be needed. I do not see capabilities for this sort of graphing over time in Phoenix, have I missed it?
