Designing Effective Eval Stacks for LLMs in Automated Testing
I see what you mean about the LLM as a judge vs just a code check. I think the first two tests could potentially be code checks designed to validate some data is present in the response could be static checks, while the third test which refuses to respond in some way probably needs to use an LLM judge. Would you still call static code checks evals? How would you design them in Phoenix? In this example use case, what would some of your evals look like? I am trying to figure out conceptually what a good eval stack looks like. For standard automated testing in software we have the testing pyramid with unit tests, integration/acceptance tests, and e2e tests. I am wondering if there are some guidelines for LLMs like this.
