Evaluating the Hype: Insights on Multi-Agent Systems

Vibhu S. · 2025-06-13T06:36:21.101Z

Follow up to my random hot take not really thought out response of multi agent systems being overhyped and ngmi - Cognition just put this out and I like 70% agree but this is a decent framing of current issues w/ multi agent - wish it was a bit more forward thinking too but decent quick read https://cognition.ai/blog/dont-build-multi-agents#applying-the-principles

Evaluating the Hype: Insights on Multi-Agent Systems | Arize AI Community

7 comments

· Sorted by Oldest

Vibhu S.
·
nice counter article w/ anthropic on how their deep research works - https://www.anthropic.com/engineering/built-multi-agent-research-system
🔥1
Vibhu S.
·
"We found that a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval."
Vibhu S.
·
"There is a downside: in practice, these architectures burn through tokens fast. In our data, agents typically use about 4× more tokens than chat interactions, and multi-agent systems use about 15× more tokens than chats"
Vibhu S.
·
1.
"We’ve found that multi-agent systems excel at valuable tasks that involve heavy parallelization, information that exceeds single context windows, and interfacing with numerous complex tools."
Sebastian S.
·
My thoughts on their thoughts: - (in response to the sequence task decomposition diagram) Its hard to think about this without taking into account the outputs as part of the context. In this example you may not need the subtask 1 when prompting subtask 2 given that the output of subtast 1 is provided. - “Depending on the domain, you might even consider fine-tuning a smaller model (this is in fact something we’ve done at Cognition).” (“Cognition | Don’t Build Multi-Agents”) this violates the idea that smaller is dumber and thus best fitted for less impact jobs (in fact the team seems to support this sentiment in the section "edit apply models") - “So, builders had the large models output markdown explanations of code edits and then fed these markdown explanations to small models to actually rewrite the files. However, these systems would still be very faulty. Often times, for example, the small model would misinterpret the instructions of the large model and make an incorrect edit due to the most slight ambiguities in the instructions. Today, the edit decision-making and applying are more often done by a single model in one action.” (“Cognition | Don’t Build Multi-Agents”) would love to see where other coding agents did this. - “However, agents today are not quite able to engage in this style of long-context proactive discourse with much more reliability than you would get with a single agent. Humans are quite efficient at communicating our most important knowledge to one another, but this efficiency takes nontrivial intelligence.” (“Cognition | Don’t Build Multi-Agents”) if you operate under the aforementioned principles of context engineering then more than likely that discoruse is not proactive. Since it would be nothing more than chain of thought. This is based on the speculation that the reason this works for humans is the diversity of knowledge that makes these discussions proactive - “I don’t see anyone putting a dedicated effort to solving this difficult cross-agent context-passing problem.” (“Cognition | Don’t Build Multi-Agents”) A2A addresses context passing
Sebastian S.
·
keep it coming Vibhu S.!!
❤️1
Dat
·
Is the performance gain worth the complexity that is the question

Vibhu S.
·
nice counter article w/ anthropic on how their deep research works - https://www.anthropic.com/engineering/built-multi-agent-research-system
🔥1
Vibhu S.
·
"We found that a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval."
Vibhu S.
·
"There is a downside: in practice, these architectures burn through tokens fast. In our data, agents typically use about 4× more tokens than chat interactions, and multi-agent systems use about 15× more tokens than chats"
Vibhu S.
·
1.
"We’ve found that multi-agent systems excel at valuable tasks that involve heavy parallelization, information that exceeds single context windows, and interfacing with numerous complex tools."
Sebastian S.
·
My thoughts on their thoughts: - (in response to the sequence task decomposition diagram) Its hard to think about this without taking into account the outputs as part of the context. In this example you may not need the subtask 1 when prompting subtask 2 given that the output of subtast 1 is provided. - “Depending on the domain, you might even consider fine-tuning a smaller model (this is in fact something we’ve done at Cognition).” (“Cognition | Don’t Build Multi-Agents”) this violates the idea that smaller is dumber and thus best fitted for less impact jobs (in fact the team seems to support this sentiment in the section "edit apply models") - “So, builders had the large models output markdown explanations of code edits and then fed these markdown explanations to small models to actually rewrite the files. However, these systems would still be very faulty. Often times, for example, the small model would misinterpret the instructions of the large model and make an incorrect edit due to the most slight ambiguities in the instructions. Today, the edit decision-making and applying are more often done by a single model in one action.” (“Cognition | Don’t Build Multi-Agents”) would love to see where other coding agents did this. - “However, agents today are not quite able to engage in this style of long-context proactive discourse with much more reliability than you would get with a single agent. Humans are quite efficient at communicating our most important knowledge to one another, but this efficiency takes nontrivial intelligence.” (“Cognition | Don’t Build Multi-Agents”) if you operate under the aforementioned principles of context engineering then more than likely that discoruse is not proactive. Since it would be nothing more than chain of thought. This is based on the speculation that the reason this works for humans is the diversity of knowledge that makes these discussions proactive - “I don’t see anyone putting a dedicated effort to solving this difficult cross-agent context-passing problem.” (“Cognition | Don’t Build Multi-Agents”) A2A addresses context passing
Sebastian S.
·
keep it coming Vibhu S.!!
❤️1
Dat
·
Is the performance gain worth the complexity that is the question