How Multi-Agent Teams Solve AI Hallucination and Make AI Reliable

Introduction: The Explosive Growth of AI Programming

A year ago, we were still debating AI's hallucination problem; today, AI can independently write production-grade code. Anthropic's CodeWork was built entirely by AI from the start, without a single line of human code; 75% of new code at Google is AI-generated; Kimi used pure AI programming to build a reasoning engine from scratch, performing 20% faster than human teams.

Programming is extremely rigorous work—a single comma in the wrong encoding can crash a program. So how exactly did humans make the "constantly hallucinating" AI this reliable within just one year?

bilibili source: 【科普】为什么今年各家 AI 都在做 Agent 团队？

The Root Cause of AI Hallucination: Context Rot and Memory Bias

Why Critical Information Gets Missed

When ChatGPT first went viral in 2023, people started giving it increasingly heavy tasks. Ask AI to review a contract spanning dozens of pages, and it would produce a professional-looking analysis report—but skip over critical clauses entirely. A Stanford team conducted an experiment called "Lost in the Middle" and found that when key information was placed in the middle of the context, model accuracy plummeted by over 30%.

This research revealed a fundamental limitation of the Transformer architecture. Modern large language models process input text using the Attention Mechanism, where theoretically every token can "attend" to any other token in the sequence. But in practice, attention weight distribution is uneven—models tend to assign higher attention weights to positions at the beginning and end of sequences. This is strikingly similar to the "Serial Position Effect" in human cognitive psychology. As context windows expanded from 4K to 128K and beyond, this problem didn't improve but actually worsened due to information density dilution.

Models, like humans, are better at remembering the beginning and end, and start "zoning out" in the middle—except they won't tell you. Researchers call this "Context Rot."

Why Post-Training Amplifies the Hallucination Problem

Even more troubling, post-training (RLHF alignment) actually amplifies the hallucination problem. RLHF (Reinforcement Learning from Human Feedback) is the current mainstream model alignment technique, following a three-step process: first, fine-tune the base model with supervised learning; then train a reward model to simulate human preferences; finally, optimize the generation strategy using reinforcement learning algorithms like PPO (Proximal Policy Optimization). The problem is that in the reward model's training data, "helpful and detailed answers" almost always receive higher scores, causing the model to learn an implicit strategy: even when uncertain, generate responses that look complete and professional.

Academia calls this phenomenon "sycophancy"—the model would rather fabricate than admit ignorance. So its first reaction after making an error isn't to stop and admit "I don't know," but to start "patching things up"—apologize first, then continue fabricating.

A lawyer in the United States used ChatGPT to write a legal brief. The AI cited six case precedents, properly formatted and professional-looking, but when the judge checked, every single one was fabricated. The lawyer was directly fined and nearly ended his career.

Why Self-Correction Fails: The Intern Dilemma

To solve the hallucination problem, the most intuitive approach is to have AI check its own output. But a 2024 paper delivered a brutal conclusion: having a model correct its own reasoning sometimes changes correct answers to wrong ones.

This research was a systematic evaluation of techniques like "Self-Consistency" and "Chain-of-Thought." Researchers found that when a model is asked to verify its own reasoning steps, it tends to develop "confirmation bias" toward already-generated content—because the generation process itself is based on what the model considers the most likely correct path. In cognitive science, this is called a "metacognitive blind spot": a system struggles to detect errors using the same reasoning mechanism that produced them. Subsequent research from Google DeepMind further confirmed that self-correction only works reliably when external verification signals are introduced (such as code execution results or search engine returns).

Imagine an intern—you ask them a question they don't know the answer to, but during onboarding they were repeatedly told "always provide an answer," so they'll confidently make one up. Have this intern check their own work? They simply can't catch it.

Verification in Practice

The same phenomenon can be observed in actual workflows. Ask AI to write a document and then self-check against a checklist to "remove AI-sounding language," and it will tell you "done," but the problems in the article remain. This isn't the model being lazy—it's the same cognitive framework being unable to detect its own blind spots.

Multi-Agent Architecture: From Safety Inspectors to Team Collaboration

Dual-Agent Mode—Adding an Independent Safety Inspector

The breakthrough lies in: having another AI do the checking. Anthropic introduced an "automode agent" across the entire Claude Code system—before executing any dangerous operation, an independent smaller model performs a safety check first.

This design embodies the "Defense in Depth" security engineering philosophy. In traditional software security, this means not relying on a single security layer, but setting up independent checkpoints at multiple levels. Specifically in Claude Code's implementation, the safety Agent is a smaller-parameter model that has undergone specialized safety training. It doesn't need to understand the complete logic of the code—it only needs to identify dangerous patterns (such as deleting files, modifying system configurations, accessing sensitive paths, etc.). The elegance of this design lies in the fact that smaller models have shorter context windows, making them actually less likely to "zone out," and their inference cost is extremely low, not significantly impacting user experience.

Data shows that users directly click approve on 93% of permission pop-ups (what Anthropic calls "approval fatigue"). After introducing the safety Agent, both convenience is preserved and dangerous operations are blocked.

Multi-Agent Teams—Expanding from Safety Inspector to Organizational Structure

What if you expand the safety inspector into an entire team? This is the core idea behind Multi-Agent systems.

Multi-Agent Systems (MAS) are not a new concept in AI—their theoretical foundations trace back to distributed artificial intelligence research in the 1980s. But current multi-agent architectures differ fundamentally from traditional MAS: in traditional systems, each Agent was a simple rule-driven program, while in modern approaches, each Agent is a complete large language model instance. This allows inter-Agent communication to happen in natural language, dramatically reducing system design complexity.

Anthropic's Agent Teams: One Lead Agent acts as the "boss," while each Sub-agent receives only a small segment of context and is responsible for only one thing. Context rot, zoning out—none of these are problems anymore. Test results show a 90.2% improvement over single-Agent approaches. Anthropic uses a "Hierarchical" architecture where the Lead Agent handles task decomposition and result integration, while Sub-agents focus on executing their respective subtasks.

xAI's Grok with Built-in Adversarial Mechanism: An even more extreme approach—directly embedding four roles inside the model, where one role's sole purpose is to play "devil's advocate" against the other three. This design is closer to a "Debate" architecture, inspired by OpenAI's 2018 "AI Safety via Debate" theory. The core assumption is: even if a single model can make mistakes, adversarial discussion among multiple models can converge on the correct answer. Every time a user asks a question, the four roles debate internally before delivering an answer, reducing the hallucination rate from 12% to 4.2%.

Extreme Testing: Real-World Performance of a 300-Agent Cluster

What if it's not 4 roles, but 300? Kimi's Agent cluster provides an extremely valuable case study: up to 300 sub-Agents running continuously for 12 hours.

Cluster Organizational Design Details

Running 300 Agents continuously for 12 hours presents engineering challenges far beyond what's apparent on the surface. First is the state synchronization problem: when Agent A modifies a worldbuilding setting, how do you ensure Agent B perceives this change in its next generation? This involves the classic "Consistency" problem in distributed systems. Second is context management: each Agent's context window is limited, so the system must design efficient information retrieval and summarization mechanisms, allowing Agents to access relevant historical information when needed without being overwhelmed by irrelevant content.

This cluster has several noteworthy design features:

Coordinator role: Humans cannot supervise 300 Agents one by one, so the system includes a coordinating overseer
Employee card mechanism: Each Agent has an "employee card" noting its name, responsibilities, and prompt. This is essentially a lightweight Role Constraint—preventing Agents from overstepping or experiencing role drift by explicitly defining boundaries in system prompts, borrowing from the "Single Responsibility Principle" in microservice architecture
Visual differentiation: Most Agent avatars are visually distinct, with virtually no duplicate names

Test Results: Consistency Verification Across 200,000 Characters

In one test, a single prompt instructed the cluster to create a cyberpunk-fantasy world and write a novel. After running for several hours, the system delivered a 200,000+ character work called The Model Borrower. Cross-checking by other AIs revealed:

Settings had no major errors
Characters remained consistent from beginning to end
The worldbuilding contained no self-contradictions
Symmetric structure, with six narrative lines each reaching different endings

300 Agents maintained a complete philosophical theme across 200,000+ characters—something nearly impossible in the single-Agent era. Keep in mind that a single large language model tends to experience "character drift" and "setting amnesia" after generating just a few thousand characters. By assigning different chapters and character arcs to specialized Agents, each Agent only needs to maintain consistency for its own small portion, while global consistency is ensured by the coordinator through information aggregation and conflict detection.

Core Insight: Why Multi-Agent Is the Right Direction for Improving AI Reliability

Multi-Agent architecture works because it essentially transfers the wisdom of human organizational management to AI systems:

Division of labor reduces complexity: Each Agent processes only a small segment of context, fundamentally avoiding "context rot." This aligns with modular design in software engineering—splitting a complex system into multiple single-responsibility modules keeps each module's complexity manageable, actually improving overall system reliability
Adversarial checking: Deliberately assigning "devil's advocate" roles simulates peer review mechanisms. In academia, Peer Review is considered the cornerstone of knowledge quality control; in engineering, Code Review is a critical step in preventing defects from reaching production. Adversarial roles in multi-Agent systems are the digital mapping of these human practices
Hierarchical coordination: The coordinator handles global consistency while sub-Agents handle local precision. This "layered abstraction" thinking permeates everything from operating system design to enterprise management

From single-Agent brute force to multi-Agent collaboration, the improvement in AI reliability doesn't come from making a single model stronger, but from the evolution of system architecture. This perhaps also hints at a direction: making intelligent systems more reliable might first require giving your Agent a "cyber buddy."