My latest blog with my colleague Roger Y. benchmarks OpenAI's GPT models with function calling and explanations against various performance metrics, showing how the models perform on correctly classifying hallucinated and relevant responses. It also covers the trade-offs between speed and performance for different LLM application systems.
Check it out: https://arize.com/blog/calling-all-functions-benchmarking-openai-function-calling-and-explanations/
This is good stuff!! I am really focused on the value of explanations - putting aside the instruct model, it seems pretty mixed in the utility of using explanations - is that fair?
Rajiv S. We are finding incredible value in explanations right now. Almost every customer we are working with are making use of Explanations for Evals during the development process and many are using explanations in production.
I was just running through one this morning with a customer for a code functionality Eval check for a code Generation LLM, the explanations pointed exactly to the problems in the code generation. Pointing us to exactly where we should consider fixes. In this case it was fixes around what other information we should add to the context window.
Phoenix now has a flag for generation of an explanation with any template including custom templates you create for Evals.
https://docs.arize.com/phoenix/llm-evals/evals-with-explanations