Hi Team, we have multiple multi-agentic systems built using the Agno framework. We want to evaluate them. How can we evaluate these multi-agentic chatbots or systems? Is there any support for this in Arize or Phoenix? For simple RAG, we evaluate based on the question, answer, and context. But how should we evaluate multi-agentic systems? What metrics are relevant here鈥攅specially for Agno agents? Please guide us?
Phoenix provides built-in support for tracing and evaluating Agno-based multi-agent systems. You can instrument your Agno agents using the OpenInference package, which streams all agent interactions, tool usage, and reasoning steps to Phoenix for observability and evaluation. This enables you to monitor, debug, and assess agent behavior in real time, including multi-agent conversations and workflows. Metrics and evaluation strategies for multi-agentic systems include agent path efficiency, router correctness, tool/skill performance, plan quality, and reflection, in addition to standard RAG metrics like retrieval relevance and response correctness. For Agno, these evaluations can be applied at the agent, skill, and path levels using Phoenix鈥檚 evaluation templates and experiment features, and results can be visualized in the Phoenix UI or exported for further analysis. See the Agno tracing integration guide and agent evaluation documentation for details.Relevant metrics for multi-agentic systems include: (1) router accuracy (did the agent choose the right next step/tool?), (2) parameter extraction correctness, (3) plan quality (is the agent鈥檚 plan valid and efficient?), (4) skill/tool execution correctness, (5) path efficiency (did the agent take the optimal number of steps?), and (6) reflection (can the agent self-correct or retry when needed?). Phoenix provides templates and code examples for each of these, and you can combine them for comprehensive evaluation. For Agno, all agent traces and evaluations are automatically captured once tracing is enabled. Would you like a step-by-step example or code snippet for setting this up?Sources: - Agno Tracing Integration - Agent Evaluation Guide - Agent Evaluation Templates - Agent Cookbooks
馃挕聽Hint: Mention RunLLM in the thread for followups.
