Spent a lot of time thinking about LLM evals in compiling this just-published definitive guide! Check it out for a deep dive on LLM model evals versus LLM system evals, how to build and run LLM evals, how to benchmark evals, and a lot more. Would love your thoughts!
This is awesome, really great comprehensive guide! I’m definitely gonna share it around 🙂 I recently gave a few talks around evals and there were a few reoccurring discussion topics and questions that seem to always come up that I didn’t find in this post but i’d love to get your thoughts on if theres a 2nd part to this down the line.
How to generate a “golden dataset”
Ideal size / balanced sample sets / when to combine real and synthetic test data / etc.
How often to run evals?
Most Agent based devs seem to run pretty strict regression testing every PR while more enterprise companies don’t know when to run evals and it’s pretty spread out
How to incorporate non AI based evals and better explanations of the pros & cons of AI based evals
for example people notice that typically longest output summaries are preferred by gpt-4 evals regardless of any other factors so when to mix in other metrics)
Lots of talk about regression testing and how to build a separate dataset for this
Also wanna say this is an awesome resource, it def covers like 99% of it all in one great write up 👏. Just wanted to share the other topics i’ve been seeing discussed with evals
