I wanted do drop a couple notes from the Evals trenches:
Half of the challenge of designing a good eval, is designing the test data to make certain the Eval template works
Using synthetic data can be useful to bootstrap your testing of Evals, where maybe you don't have great production data yet
On the tests I've seen GPT-4 works fairly good for a wide range of analysis tasks
Included are some of the pillars from our benchmarking session earlier in the week: