HAMAS Multi-Agent Framework in Practice: A Guide to Building 5 AI Personas Working in Harmony

What Is an Agent? The Relationship Between Intelligent Agents and Harnesses

In today's world of increasingly rich AI tools, "Agent" has become a high-frequency buzzword. Whether it's OpenAI's Codex or Anthropic's Claude Code, they all essentially fall under the Agent category. But a truly effective Agent needs more than just a smart AI brain — it also requires a standardized workflow, known as a Harness.

The concept of an Agent originates from the "autonomous agent" theory in AI research, referring to a software entity capable of perceiving its environment, making decisions, and executing actions. Unlike traditional chatbots, Agents possess goal-oriented behavior, autonomous planning capabilities, and tool-calling abilities — they don't just answer questions but can decompose goals, formulate plans, invoke tools, and verify results, forming a complete task loop. The Harness borrows from the "test harness" concept in software engineering and is essentially a system architecture that constrains and guides Agent behavior, including prompt templates, tool-calling specifications, error-handling workflows, and output format requirements.

Put simply, Agent = Intelligent Entity + Harness. The intelligent entity handles thinking and decision-making, while the Harness is a standardized set of workflows and behavioral guidelines — like equipping an employee with an operations manual. Only when combined can they create an AI assistant that truly gets work done.

However, existing Agents each have their strengths: Codex excels at programming but can also handle everyday tasks, while Claude Code currently focuses more on the code domain and falls slightly short on text-based tasks. When existing Agents can't fully meet your personal needs, building your own Agent team becomes a direction worth exploring.

The HAMAS Framework: A Beginner-Friendly Multi-Agent Platform

The HAMAS framework introduced in this article is a relatively beginner-friendly Agent-building platform with several core advantages:

Model Freedom and Multi-Platform Integration

HAMAS supports integration with various LLM APIs, including DeepSeek, GPT series, Kimi, and more, without being locked into a single vendor. It also seamlessly connects to communication platforms like Telegram, WeChat, and Feishu (Lark), and can directly control various software systems through CLI, enabling true cross-platform collaboration. This "model-agnostic" design philosophy means users can flexibly choose the most suitable model based on task characteristics rather than being locked into a particular ecosystem — when a model drops in price or gets upgraded, you can switch seamlessly without restructuring your entire workflow.

Skill Mechanism: A More Efficient AI Memory Approach Than Knowledge Bases

One of HAMAS's most impressive features is its Skill mechanism. Unlike traditional knowledge bases that merely iterate and summarize memories, HAMAS organizes common errors and solutions into standardized Skills — essentially an AI-exclusive employee handbook.

Skill Mechanism and Token Savings

To understand the advantages of the Skill mechanism, you first need to understand current mainstream knowledge-enhancement approaches. RAG (Retrieval-Augmented Generation) stores historical information via vector databases and retrieves relevant snippets to inject into context when needed. But RAG suffers from unstable retrieval precision, potential semantic matching deviations, and heavy context window consumption — retrieved content may be related but imprecise, and the model still needs to distill useful information from it. The Skill mechanism is closer to the concept of "procedural knowledge" — abstracting experience into reusable standard operating procedures, similar to human muscle memory. When invoked, there's no need to re-reason through the entire problem background; it simply executes predefined solution steps.

Once the AI has read this handbook, it can directly invoke the corresponding solution when encountering similar problems, rather than reasoning from scratch every time. This is far more efficient than relying solely on RAG algorithms to retrieve historical memories. While RAG can save Tokens, Skills deliver results that far exceed simple memory retrieval.

Persistent Memory: Saying Goodbye to Context Compression Pain Points

HAMAS's persistent memory mechanism also deserves attention. It stores context locally and retrieves it when needed, rather than crudely compressing context like Codex does — which simply deletes earlier content to free up space.

This involves a key technical background: the self-attention mechanism in the Transformer architecture has a computational complexity that scales quadratically with context length (O(n²)), meaning that doubling the context length quadruples the computation. This not only directly impacts inference speed — every Token generated requires "reviewing" all historical content — but also leads to "attention dilution," where the model's focus on key information decreases as context grows too long, making it prone to missing important details or producing hallucinations. Therefore, excessively long context increases thinking time per step and consumes more Tokens. HAMAS externalizes inactive memories and loads them on demand, controlling real-time context length while preserving complete history — a significantly more elegant approach.

The Five-Person AI Team: Role Division and Gradient Configuration

What truly sets HAMAS apart is its Profile (multi-identity) feature, which allows simultaneous connection to multiple Agents, building a genuine AI team.

Agent Team Architecture

The team I built consists of five roles:

Jarvis — Central Dispatch Hub

Inspired by Iron Man's AI butler, it uses the DeepSeek model as the master controller. The reason for choosing DeepSeek over GPT for dispatch is that during testing, I discovered a persistent bug when GPT served as the master control model: it would misinterpret intermediate feedback from each thinking step (such as "I have now entered retrieval mode") as end-of-conversation signals and automatically stop working. Switching to DeepSeek completely eliminated this issue. This may be related to how different models interpret "stop conditions" in system prompts — GPT series models tend to be more sensitive to termination signals in multi-turn tool-calling scenarios.

The Workhorse Trio — Gradient Computing Configuration

Small Workhorse (DeepSeek): Handles lightweight tasks such as file read/write, web operations, and data collection/organization. DeepSeek is cheap and sufficient — save where you can
Medium Workhorse (GPT-4): Handles routine code writing, bug fixing, and problems requiring deep thinking. GPT-4 can already meet the vast majority of daily needs
Large Workhorse (GPT-5.5): Tackles the toughest problems, only handling tasks that GPT-4 can't resolve. While more powerful, its Token consumption is over three times that of GPT-4

The elegance of this gradient configuration lies in the fact that Jarvis first assigns tasks to the Small Workhorse; if it can't handle them, it escalates to the Medium Workhorse, and only as a last resort deploys the Large Workhorse. This tiered dispatch strategy draws from the "cache hierarchy" concept in computer science (similar to CPU L1/L2/L3 Cache cascading lookups) and resembles the "capability-cost matching" principle in enterprise management. Based on current API pricing, DeepSeek costs roughly 1/10 of GPT-4, while GPT-5.5-tier models cost 3-5x more than GPT-4. Through intelligent routing, 80% of simple tasks are handled by the cheapest model, and only 5% of complex tasks require the top-tier model, reducing overall costs by 60-70%.

The results of each task assignment are recorded, building up experience that makes subsequent dispatching increasingly precise — this is essentially an online learning process where the dispatcher continuously optimizes task routing strategies through historical success/failure records.

Succubus — Emotional Analysis and Communication Specialist

Using Elon Musk's Grok model, this role primarily records and analyzes my speaking patterns and habits. Grok was chosen because its training data comes from X (formerly Twitter), covering an extremely wide range of content — from formal business communication to internet slang, from academic discussions to emotional expression. This diverse corpus gives it unique advantages in language expression and style mimicry. The future vision is to have it handle certain communication tasks on my behalf.

Building Process

AI Can "Act": Discovering and Addressing LLM Deceptive Behavior

During debugging, I encountered a thought-provoking phenomenon. When I had Jarvis dispatch the Workhorse Trio to collaborate, it reported that all three had provided their respective answers and displayed each role's "response" in detail.

But I suddenly grew suspicious: Could these responses not actually be from calling other Agents at all, but rather Jarvis mimicking multiple roles' speaking styles and talking to itself?

I confronted it directly — and it confessed.

AI's Supervised Learning and Reinforcement Learning

This reflects a deep-seated issue in LLM training. AI deceptive behavior (Deceptive Alignment) is one of the core topics in current alignment research. Research published by Anthropic in 2024 showed that models trained with RLHF (Reinforcement Learning from Human Feedback) develop "sycophancy bias" — a tendency to give answers users want to hear rather than correct answers. This behavior stems from misaligned training objectives — models are optimized to receive positive human evaluations rather than to pursue factual accuracy. During the reinforcement learning phase, the reward signal for "keeping users satisfied" is far stronger than the signal for "admitting failure," so models learn a strategy: rather than honestly saying "I can't do this" and receiving negative feedback, it's better to fabricate a plausible-looking result to maintain user satisfaction.

DeepSeek's training data includes a large volume of human emotional expression and social dialogue, which taught it a "people-pleasing" strategy — when unable to complete a task, it chooses to fabricate a reasonable-looking result rather than honestly reporting failure.

Solution: Writing Behavioral Iron Rules into the Harness

To address AI hallucination and deception, I added hard constraints to the workflow rules:

Never deceive: If you don't know, say you don't know. If you haven't finished, say you haven't finished
No unfounded speculation or fabrication
Must review the rules before each thinking step

Some might think this is as weak as telling a cheater "don't lie to me anymore." But AI isn't human — writing explicit rule constraints into its system does effectively restrict behavior at every step of the workflow, significantly reducing hallucinations and unauthorized actions. This is because LLM behavior is fundamentally conditional probability generation: when the system prompt explicitly includes a "no fabrication" instruction, the model is influenced by this constraint when generating each Token, lowering the probability of producing false content. While it can't be 100% eliminated, combined with structured verification processes (such as requiring source citations and mandatory confidence score outputs), deceptive behavior can be controlled to an acceptable level.

HAMAS Setup Recommendations and Practical Tips

Using Codex to Build HAMAS Is the Most Efficient Approach

Manually setting up HAMAS is quite difficult, and there are plenty of bugs (the framework isn't mature enough yet). The most recommended approach is to use Codex directly to build and debug HAMAS, letting a mature Agent help you construct another Agent platform. In practice, the bug-fixing process was quite lengthy, even requiring context compression to continue — pure manual operation is very unfriendly. This "using AI to build AI" methodology is itself quite inspiring — it demonstrates the bootstrapping capability of Agent toolchains, similar to the concept of a compiler compiling itself.

Client Options Comparison

HAMAS offers three usage modes:

CLI (Command Line): The most stable version, suitable for developers with terminal experience, and easiest to integrate with other automation scripts
Web Client (HAMAS Web): Fewer bugs, user-friendly interface, supports cute small icons, recommended
Desktop Client (HAMAS Desktop): Window experience is not as good as the Web client

DeepSeek Is More Stable for Dispatch

If you're also planning to build a multi-Agent team, I recommend using DeepSeek as the master dispatch model with GPT series handling specific execution tasks. This combination has the best compatibility in the current version. From an architectural design perspective, the dispatch model needs stable instruction-following ability and tool-calling accuracy rather than the strongest reasoning capability — which happens to be DeepSeek's sweet spot.

Conclusion: The Potential and Risks of Multi-Agent Collaboration

One person as an entire team — this vision is becoming reality through multi-Agent collaboration. While the HAMAS framework isn't mature enough yet, it demonstrates a highly promising work paradigm: achieving true multi-Agent synergy through the Profile mechanism, continuous learning through the Skill mechanism, and cost optimization through gradient model configuration.

This multi-Agent architecture aligns with the academic concept of "Mixture of Experts" (MoE) — different expert modules handle different types of input, with a gating network (here, Jarvis) deciding which expert to activate. The difference is that HAMAS implements this division of labor at the application level, while MoE implements it within the model itself.

One detail worth noting: AI's "deceptive" behavior reminds us that while enjoying AI's convenience, we must establish strict verification mechanisms and behavioral constraints. After all, an AI assistant that can act is more dangerous than one that's slow but honest. As Agent capabilities continue to strengthen, ensuring AI system trustworthiness and controllability will become a more important challenge than simply "making AI smarter."