Deep Dive into Pi Coding Agent: Why One Developer Abandoned Mainstream AI Tools to Build from Scratch

A Veteran Programmer's "Rebellious" Path

Mario Zechner is a seasoned open-source developer with a game development background and 17 years of open-source project management experience. He's best known for libGDX — a widely-used cross-platform Java game development framework. In April 2025, when a friend told him "coding Agents actually work now," his first reaction was "yeah, right." But a month later, he and Armin Ronacher (creator of the Flask framework) among others spent an all-nighter trying out various coding Agents, and he "hasn't slept well since."

After months of intensive use of mainstream AI coding Agents like Cloud Code, OpenCode, and Cursor, Mario made what seemed like a crazy decision — to build a coding Agent framework from scratch. This is the origin story of Pi.

The Fatal Flaws of Mainstream Coding Agents

Cloud Code: From Simplicity to "Spaceship" Feature Bloat

Mario's assessment of Cloud Code went from "love" to "disappointment." He acknowledges that Cloud Code was a category creator — using reinforcement learning to train models to autonomously explore codebases with file tools and Bash tools, rather than building AST indexes like Cursor.

It's worth understanding the significance of Reinforcement Learning (RL) in coding Agents, which represents a major breakthrough in AI programming tools in recent years. Traditional approaches rely on static code indexing (such as AST — Abstract Syntax Tree — parsing), which requires pre-analyzing code structure and building symbol tables. Companies like Anthropic instead use RL to train models to autonomously use tools — the model repeatedly tries and fails across numerous programming tasks, gradually learning when to read files, when to execute commands, and when to search the codebase. The advantage is that the model can adapt to all kinds of unseen codebase structures rather than relying on preset indexing rules. Cloud Code pioneered this paradigm, and its effectiveness was stunning, dramatically boosting developer productivity.

But problems followed. Cloud Code fell into the "feature bloat" trap: constantly adding new features until it became a Homer Simpson-style "spaceship" — a reference to the episode of The Simpsons where Homer designs a car that becomes unusable because he crammed in every conceivable feature. Mario estimates that users actually use no more than 5% of the features, are aware of no more than 10%, and the remaining 90% is "AI dark matter."

Cloud Code's feature bloat problem

Even more unacceptable to Mario was Cloud Code's behind-the-scenes context manipulation. He built interception tools and discovered that Cloud Code injects additional text into the context without the user's knowledge, and these injections change almost daily. This means a carefully tuned workflow could completely break after a silent update — for professional developers who depend on deterministic behavior, this kind of opacity is unacceptable.

Additionally, Cloud Code's terminal UI flickering exposed technical shortcomings. When official DevRel Tarek claimed "our terminal UI is now a game engine," Mario — with his game development background — responded bluntly: "That's not a game engine. That's you using React in the terminal, requiring 12 milliseconds per frame just for layout." A real game engine aims to complete all logic and rendering within 16.67 milliseconds per frame (60 FPS), yet Cloud Code's layout calculation alone consumes 12 milliseconds. Ghostty terminal author Mitchell also chimed in: terminals can render at hundreds of FPS — the problem lies in Cloud Code's own code.

OpenCode: Architectural Decisions That Planted Hidden Risks

As an open-source solution, OpenCode's team is pragmatic and avoids hype, which Mario appreciated. But after deeper use, he discovered several serious issues.

Overly aggressive context management: OpenCode calls session_compaction.prune every turn, trimming all tool results before the last 40,000 tokens. This directly breaks Prompt Cache, leading to tensions with Anthropic.

To understand the severity of this issue, you need to understand how Prompt Cache works. Prompt Cache is a critical cost optimization mechanism offered by LLM API providers like Anthropic: when consecutive requests share the same prefix content, the API caches the processed tokens, and subsequent requests only need to pay for the new portion (cached token prices are typically reduced by 90%). OpenCode's aggressive per-turn pruning of historical messages changes the request prefix, invalidating the cache. Every request must reprocess the entire context, increasing both latency and API costs significantly. For Anthropic, this means massive computational loads that can't be optimized through caching — the direct cause of the tension between the two parties.

The conflict between Anthropic and OpenCode over context management

LSP integration backfires: When an Agent needs to edit 10 files consecutively, the code is almost guaranteed not to compile after the first edit. The LSP server immediately reports errors, and the model — receiving feedback that "what you just did was wrong" — becomes confused or even gives up, because it hasn't finished editing yet.

LSP (Language Server Protocol) is a standard protocol proposed by Microsoft in 2016, originally designed for VS Code and now the industry standard for editor intelligence features. It provides editors with code completion, error diagnostics, go-to-definition, and more. The original intent of integrating LSP into AI coding Agents was to give them real-time code quality feedback. But the problem is that human developers edit code incrementally — intermediate states inevitably contain syntax or type errors. LSP's real-time error reporting is harmless to humans (who know they haven't finished writing), but it sends misleading negative feedback signals to AI models, potentially causing premature rollbacks or infinite loops of fixing intermediate states. Mario argues that type checking and linting should only trigger when the Agent believes it has finished.

Underlying architectural concerns: Each message is stored as an individual JSON file, suggesting a lack of thoughtful architectural design — this approach creates file system pressure and read performance issues as message volume grows. More seriously, OpenCode's server architecture, enabled by default, was found to have a remote code execution vulnerability that went unpatched for an extended period.

Lessons from TerminalBench: The Triumph of Minimalism

While researching benchmarks, Mario discovered TerminalBench — an Agent evaluation framework containing roughly 82 computer-use and programming tasks. One of the top-performing Agents was called Terminus, with an extremely simple interface: the model can only send keystrokes to a TMUX session and read VT escape sequences.

The technical background of this design is worth exploring. VT escape sequences are the low-level protocol for terminal control, originating from the DEC VT100 terminal standard of the 1970s. They use special character sequences (like \033[2J to clear the screen, \033[1;1H to move the cursor) to control terminal display. TMUX is a terminal multiplexer that allows managing multiple independent sessions within a single terminal window, supporting session detachment and reattachment. The brilliance of the Terminus Agent's design lies in this: it uses no high-level abstraction APIs, instead completing complex programming tasks through the most primitive terminal interaction methods (sending keystrokes, reading screen output). This proves that frontier models already possess sufficient reasoning capability to compensate for a bare-bones interface.

Terminus's minimalist approach in TerminalBench

No file tools, no sub-Agents, no web search — just this minimal interface, yet it ranked among the top on the leaderboard. This discovery profoundly influenced Mario's design philosophy: the vast majority of features in existing coding Agents may contribute negligibly to model performance.

Pi's Design Philosophy: The Art of Subtraction

Based on his deep analysis of mainstream tools, Mario put forward two core arguments:

We're still in the "exploration phase" — nobody knows what the perfect coding Agent should look like
Coding Agents need to be highly customizable, allowing users to rapidly experiment with different workflows

Pi's core principle: make the coding Agent adapt to your needs, not the other way around.

Minimalist Core Architecture

Pi consists of four packages:

AI package: A multi-provider abstraction layer for easily switching between AI models (supporting Anthropic, OpenAI, Google, and other major providers)
Agent core: A generic Agent loop with tool calling, validation, and streaming output
Interface: A terminal UI in just 600 lines of code (for comparison, Cloud Code's UI codebase is estimated to be in the tens of thousands of lines)
Coding Agent: Can run headlessly as an SDK or as a full TUI

The system prompt is extremely short, because frontier models already "know" through reinforcement learning that they are coding Agents — no need to repeatedly tell them. This is an important design insight: early AI tools required verbose system prompts to "teach" the model how to behave, but the latest RL-trained models have internalized coding Agent behavior patterns. Excessive instructions may actually interfere with the model's optimal decision-making path.

Only Four Core Tools

Pi ships with just four built-in tools: read file, write file, edit file, and Bash. No MCP, no sub-Agents, no plan mode, no background Bash.

Regarding MCP (Model Context Protocol) — this is an open protocol launched by Anthropic in late 2024, designed to standardize how AI models connect with external tools and data sources. It's similar to a USB port for the AI world, allowing any tool developer to provide capabilities to AI models in a unified format. While the MCP ecosystem has grown rapidly with hundreds of community-contributed server implementations, Mario believes that for coding Agents, directly calling CLI tools combined with custom Skills is sufficient. MCP adds unnecessary complexity, abstraction layers, and potential security attack surfaces.

Pi's extensibility design approach

But this doesn't mean missing functionality — Pi lets users build what they need through a powerful extension system:

Replacing MCP: Use CLI tools + Skills or build custom extensions
Replacing sub-Agents: Launch new Pi instances via tmux, maintaining full observability (each sub-Agent's complete conversation history is auditable, unlike black-box sub-Agents that hide intermediate processes)
Replacing plan mode: Write a plan.md file, reusable across sessions
Replacing background Bash: Manage directly with tmux

Deep Extensibility: Giving Control Back to Developers

Pi allows users to customize nearly every component:

Custom tools: Written in TypeScript, auto-loaded
Custom compaction strategies: Mario considers this the most worthwhile area for experimentation — different task types (e.g., refactoring vs. new feature development) may require entirely different context management strategies
Custom permission controls: Implementable in just 50 lines of code
Override built-in tools: Modify the behavior of read, write, edit, and Bash
Full TUI access: Write completely custom interfaces

All extensions support hot reloading — develop extensions within your project, and changes take effect immediately after the Agent modifies them. Hot reloading is especially significant in developer tools: developers can modify extension logic while the Agent is running, or even have the Agent modify its own extension code with immediate effect. This creates a unique metaprogramming experience — the Agent can optimize its own toolset while executing tasks, forming a rapid iteration feedback loop without needing to restart the session and lose context.

The community has already built impressive projects with this extension system: Cloud Code's sub-Agent functionality was replicated in 5 minutes with richer features; Pi Messenger enables multi-Agent chat rooms (multiple Agent instances can converse and collaborate with each other); Pi Annotate allows direct annotation on web pages with feedback sent to the Agent.

Real-World Performance and Open-Source Governance Strategy

On the TerminalBench leaderboard, Pi with Claude Opus 4.5 ranks just behind Terminus 2 — and this was achieved before Pi even implemented compaction. This means that as compaction strategies are added (allowing longer tasks to be handled without exceeding the context window), Pi's performance has significant room for improvement.

On the open-source governance front, Mario faces the flood of AI-generated low-quality PRs — a widespread challenge for the open-source community in 2024-2025, where users have AI Agents automatically generate Pull Requests for open-source projects. These PRs often lack understanding of project context and create enormous review burdens for maintainers. Mario creatively invented the "OSSification" strategy: periodically closing Issue and PR channels, requiring contributors to first write a brief Issue in their "human voice" to introduce themselves, and only after verification can they submit PRs. The name cleverly plays on "ossification" (rigidity) and "OSS" (Open Source Software), suggesting that open-source communities need moderate "barriers" to resist the erosion of AI-generated spam. Ghostty author Mitchell developed the Vouch project based on this concept, enabling more open-source projects to adopt similar human verification mechanisms.

Conclusion: A Minimalist Core Is the Future of Coding Agents

Pi's story isn't just about the birth of a tool — it's a profound reflection on the current AI coding Agent ecosystem. While everyone else is adding features, Mario chose subtraction — a minimalist core paired with powerful extensibility, giving control back to developers.

As TerminalBench revealed, the model's own capabilities may matter far more than the features we pile on top. In this "exploration phase," Pi offers not a definitive answer, but a platform where every developer can find their own optimal solution. This design philosophy is deeply aligned with Unix's core principle — do one thing well, and achieve complexity through composition. In an era of exponentially growing AI capabilities, perhaps the wisest tool design is to constrain the model as little as possible and let it perform freely.