Anthropic Team in Practice: Claude Code Experience with 16 Parallel Agents Developing a Compiler

Core Finding: Claude Code Is Far More Than Just Writing Functions

Anthropic's engineering blog and official documentation reveal how their R&D team actually uses Claude Code — not simply asking AI to write a function, but running 16 parallel Agents to develop a compiler, using a three-role architecture for full-stack applications, and deploying automatic classifiers to handle approval workflows. The architectural thinking behind these practices is worth deep study for every developer working with AI-assisted programming.

但结果从不能用

容器变成牲畜

管理上下文窗口

16 Parallel Agents Building a C Compiler from Scratch

Anthropic researcher Nicholas Carlini conducted a stunning experiment: using 16 parallel Claude Agents to write a C compiler from scratch that can compile the Linux kernel. This isn't a toy project — it's a real compiler capable of compiling Linux 6.9, FFmpeg, SQLite, and PostgreSQL, achieving a 99% pass rate on GCC Torture tests, and even compiling and running Doom.

Technical Implementation of Multi-Agent Collaboration

The system's operating mechanism is quite elegant: a bash loop script keeps Claude running continuously in Docker containers, automatically picking up the next task upon completing one. Multiple Agents collaborate via Git in a distributed fashion — each Agent claims tasks using file locks in independent directories, then pulls code, merges code, and Claude resolves conflicts on its own.

Specialized division of labor was also implemented: some Agents handle merging duplicate code, others do performance optimization, and others write documentation. Final output data: nearly 2,000 Claude Code sessions, two weeks of time, 2 billion input tokens, 140 million output tokens, total cost approximately $20,000, producing 100,000 lines of code.

Three Key Lessons

Validators must be reliable. Claude will autonomously solve any given problem — if the validator has a bug, it will fix the validator rather than fix the code. This means if you give it a flawed test case, it might "cleverly" modify the test to pass while the bug remains. The correct approach is to first confirm the test cases themselves are sound, then let Claude fix the code.

Think from Claude's perspective. Don't let log outputs dump thousands of characters of useless content — that pollutes the context window and severely impacts subsequent reasoning quality. Streamlining output information is a critical step in improving Agent efficiency.

Keep parallelization simple. The simpler the task decomposition, the better the parallel performance. Coordination overhead for complex tasks will consume the gains from parallelization.

Three-Agent Architecture: Solving Context Anxiety

Single-Agent long-running sessions have two fatal problems. First is context anxiety: when the context window fills up with various information, Claude starts repeatedly checking completed work, afraid to push forward on new tasks, and rushes to finish when the window is about to fill. Second is self-evaluation bias: Agents are always overly lenient and optimistic when evaluating their own work — "looks good, close enough" — when in reality it's far from done.

Planner-Generator-Evaluator Division

Anthropic's solution is a three-role architecture:

Planner: Expands simple prompts into complete product specifications, intentionally not over-specifying technical details
Generator: Implements features sprint by sprint, self-evaluating after each sprint
Evaluator: Uses Playwright to actually operate the running application, clicking and testing like a user, then feeding real bugs back to the generator

Comparative Test Data

Same prompt "build a retro game editor": Single Agent took 20 minutes, $9, resulting in a rough interface with broken core features; Three-Agent took 6 hours, $200, resulting in feature-rich, actually playable output. 20x more expensive, but the result went from "unusable" to "playable." For real projects, this ROI is absolutely worth it.

Developers who've used Claude Code know it asks for approval at every step. Statistics show users approve 93% of the time, leading to severe approval fatigue — by the tenth click you're no longer reading the content, and the safety guardrail becomes meaningless.

Dual-Layer Defense Mechanism

Anthropic's solution is Auto Mode: a model classifier automatically determines which operations are safe and which are dangerous — safe ones pass through directly, dangerous ones prompt the user.

Specifically, a dual-layer defense was designed: the input layer scans tool outputs and injects warnings when suspicious content is detected; the output layer classifier only looks at user messages and tool calls, not the Agent's reasoning process — this prevents the Agent from "persuading" the classifier to let things through.

Test data shows: across 10,000 real internal traffic requests, the false positive rate was only 0.4%, with only 40 incorrectly blocked.

Managed Agents: Avoiding Over-Engineering

This is the part that best demonstrates architectural thinking. The Anthropic team found that Agent frameworks encode too many assumptions about model limitations, but these assumptions become outdated as models improve. For example: you build a complete Agent framework with a bunch of best practices, then the model upgrades and most of it becomes useless — because those practices were essentially compensating for model deficiencies, and once the model gets stronger, the compensation is no longer needed.

Three-Component Decomposition Strategy

Their core approach is decomposing the Agent system into three independent components:

Brain: Claude + Harness, stateless and horizontally scalable
Hands: Sandbox and tool containers, treated as "cattle" that can be replaced anytime, no longer "pets"
Memory: Persistent event logs, with the Harness organizing context on demand

Performance data: first-token latency P50 dropped approximately 60%, P95 dropped over 90%.

Official Six Best Practices

1. CLAUDE.md Configuration First

Every time Claude Code starts, it reads the CLAUDE.md file — essentially the project's persistent memory. Loading priority stacks from broad to specific: user-level (applies to all projects) → project-level (shared by team) → subdirectory-level (highest priority).

Key principle: write things Claude can't guess (build commands, code style rules, common pitfalls) — don't write what can be inferred by reading the code. An overly long CLAUDE.md causes important rules to get buried.

2. Plan Mode for Complex Tasks

Four-step workflow: Exploration phase (read files and answer questions in Plan Mode) → Planning phase (create detailed implementation plan) → Implementation phase (exit Plan Mode and code according to plan) → Commit phase (write descriptive commit messages, open PRs).

Applicable scenarios: new feature development, multi-file refactoring, architecture decisions, unclear requirements. If a task can be described in one sentence, don't bother planning.

3. Dynamic Workflows for Cross-Validation

New feature in version 2.1.154. The core idea is turning plans into scripts that can be repeatedly run, executed in parallel, and cross-validated. The most practical feature is N independent "skeptics" for validation — majority rejection means exclusion. The Claude writing the code is not the same Claude reviewing it.

4. Manage the Context Window

This is the foundation of all strategies. Use the Clear command to reset between unrelated tasks; use sub-Agents for investigation to avoid polluting the main context; if two corrections don't work, Clear and start over. Good context management leads to a qualitative improvement in Agent output quality.

5. Multi-Agent Collaboration Patterns

Claude Code supports creating various specialized sub-Agents: Explore (quick codebase scanning), Frontend Engineer, Backend Engineer, Code Reviewer, Test Engineer. Key parameters: Isolation uses Worktrees to give sub-Agents independent Git branches avoiding file conflicts; Run Background runs in the background and notifies upon completion.

6. Give Claude Verification Methods

Autonomous operation without verification methods is dangerous. The Agent thinks it's done, but it might not be finished or might have done it wrong. You must provide objective verification mechanisms, such as test suites, type checking, lint tools, etc.

Summary: One Core Principle

Looking at all these practices together, Anthropic is really saying the same thing: hand deterministic logic to scripts, hand judgment calls to AI, then give AI a reliable verification method.

Applied to daily development: configure first by writing a good CLAUDE.md, use Plan Mode for complex tasks, parallelization is the trend, separate generation from evaluation, manage context well, and don't over-engineer. Models are improving rapidly — today's best practices might be tomorrow's baggage. Maintaining architectural flexibility matters more than pursuing perfection in the present.