We have a new demo video + blog + cookbook on how to evaluate tool-calling agents.
Phoenix includes two prebuilt LLM-as-a-judge evaluators specifically for this โ plus a full evaluation workflow in the UI that lets you write prompts, run experiments, add evaluators, and compare results without writing any code.
This new tutorial + companion notebook walks through the full workflow using a travel assistant demo: what the evaluators measure, how to validate alignment, and how to use the results to improve both your assistant prompt and your evaluators.
https://arize.com/blog/how-to-evaluate-tool-calling-agents/