Fed Up with Every AI Coding Assistant, a Veteran Game Developer Built His Own Coding Agent

When a Veteran Developer Says "No" to Every AI Coding Agent

Mario, a veteran game developer from Austria, was introduced to AI coding agents by friends (including Armin Ronacher, creator of Flask and Sentry) in April 2025. He spent 24 hours immersed in tools like Claude Code. From that point on, he was hooked — but not because he fell in love with them. Rather, he was dissatisfied with every coding agent on the market and ultimately decided to build his own.

The project is called Pi — a minimalist, extensible coding agent framework built around developer control. In a tech talk in London, Mario shared his in-depth critique of existing tools and the design philosophy behind Pi.

A Brief History of AI Coding Tools: From Copy-Paste to the Agentic Era

Looking back at the evolution of AI coding tools, we can identify several stages:

2023: Copy-pasting code from ChatGPT, mostly limited to handling individual functions
The GitHub Copilot era: Integrated into VS Code with tab-to-complete, sometimes useful but often not, and occasionally regurgitating GPL code (like John Carmack's fast inverse square root algorithm)
The Aider era: An early command-line coding assistant
Late 2024 to early 2025: Claude Code burst onto the scene, truly creating the "agentic coding" category

Claude Code's core innovation was this: instead of deeply parsing ASTs (Abstract Syntax Trees) to index codebases like traditional tools, it used reinforcement learning to train models to explore codebases on the fly using file tools and Bash tools, finding the locations they need to understand and modify. Traditional code indexing tools (like Sourcegraph, ctags, etc.) parse ASTs to build structured indexes of codebases — an approach that's precise but expensive to maintain and struggles to capture semantic relationships in code. Claude Code took a fundamentally different approach: using reinforcement learning (RLHF/RLAIF) to train models to dynamically explore codebases with basic tools like file reading, grep searching, and Bash commands. This essentially simulates how a human developer understands an unfamiliar codebase — first looking at the directory structure, then grepping for keywords, and gradually diving into relevant files. The advantage is that no pre-built index is needed, it adapts to any language and project structure, and it works surprisingly well.

Claude Code's Fatal Flaws: Why a Veteran Developer Walked Away

Mario acknowledges that Claude Code is the category leader with an excellent team. But as he dug deeper, he discovered several serious problems.

Feature Bloat: Spaceship Syndrome

Because Claude Code can write massive amounts of code, the team kept cramming in features. The end product became a "spaceship" — users might only use 5% of its features, understand 10% of its content, and the remaining 90% is like "dark matter in AI" that nobody knows what it's doing.

In terms of observability and actually managing context

Black-Box Context Management

In the summer of 2025, Mario built a series of tools to intercept the requests Claude Code sends to its backend, discovering that it was secretly injecting large amounts of extra text into the context behind the user's back. Worse still, these injections changed almost daily — new versions would alter the timing and content of injections, directly disrupting workflows users had already established.

"You're working with this new tool, trying to create predictable workflows, and then the tool vendor changes some tiny thing under the hood that makes the LLM go crazy on your existing workflows. This is simply not sustainable."

Terminal UI Performance Issues

The Claude Code team once proudly claimed their terminal UI "became a game engine." As someone with a game development background, Mario was unimpressed — re-laying out the entire UI with React in the terminal takes 12 milliseconds, while Ghostty author Mitchell pointed out directly: terminals can render hundreds of frames per second at under one millisecond per frame. "Your code is just bad."

Other AI Coding Tools Are Equally Problematic

OpenCode: Architecture and Security Concerns

Mario also tried OpenCode, an open-source coding framework, but found several troubling issues:

And also OpenCode

Crude context management: Every interaction calls SessionCompaction.prune, trimming all results before the last four token batches, which directly invalidates the Prompt Cache. Prompt Cache is a critical optimization technique offered by LLM API providers — when consecutive requests share the same prefix content, the API can reuse previously computed KV Cache, dramatically reducing latency and cost. Anthropic's Prompt Caching can save up to 90% in fees. OpenCode's aggressive context pruning on every interaction means the prefix changes with each request, causing cache hit rates to plummet. For coding agents that require frequent multi-turn interactions, this means not only higher API costs but also waiting for full inference computation on every response, severely degrading the development experience.

Poorly considered architecture: Every message becomes a separate JSON file on disk, suggesting the overall architecture lacks careful thought.

Security vulnerabilities: The default built-in server-client architecture has remote code execution vulnerabilities, and the issue remained open and unpatched for a long time.

The LSP Integration Trap

One particularly insightful observation concerns LSP (Language Server Protocol) integration. LSP is an open protocol designed by Microsoft in 2016 for VS Code, intended to decouple programming language intelligence features (code completion, go-to-definition, error diagnostics, etc.) from editors. LSP servers continuously monitor file changes and report diagnostic information in real time (such as compilation errors, type mismatches, etc.). This is extremely useful in human development scenarios, but creates a serious timing problem for AI agents.

When an agent needs to edit multiple files, the code is almost guaranteed not to compile after the first edit. For example, when renaming an interface, the first file is changed but the files referencing it haven't been updated yet — compilation errors are inevitable at that point. If LSP reports errors after every edit and feeds them back to the model, the model gets confused: "I haven't finished editing and you're already telling me it's wrong." Doing this frequently causes the model to abandon its correct refactoring plan and attempt erroneous "fixes" instead. The correct approach is to only perform code checks at natural synchronization points (when the agent believes it's finished), giving it the chance to complete full cross-file modifications before evaluating results.

The Truth Revealed by TerminalBench: Do Models Really Need That Many Tools?

A key turning point came from TerminalBench — an agent evaluation framework containing roughly 82 diverse tasks.

From fix my Windows config to write me a Monte Carlo simulation and stuff like that

One of the top-performing frameworks on the leaderboard is Terminus, which gives the model an extremely simple interface: just a single TMUX session, where the model can only send keystrokes and read back VT code sequences from the terminal. TMUX (Terminal Multiplexer) is a terminal multiplexer that allows users to manage multiple sessions within a single terminal window. The Terminus framework provides the model with only two TMUX operations — send-keys and capture-pane — the former simulates keyboard input, and the latter reads VT100/VT220 escape sequences (the raw character stream displayed on the terminal screen). No file tools, no sub-operations, no web search — nothing. This means the model must "see" terminal output and "type" on the keyboard just like a human. Yet its performance ranks among the best.

This finding is profoundly significant: all those fancy coding tool features may not be necessary for agents to achieve good results. It fundamentally challenges the industry assumption that "more tools is better," suggesting that frontier models' general reasoning capabilities are already powerful enough to complete complex tasks with the most basic interfaces.

Pi's Design Philosophy: Minimal, Controllable, Extensible

Based on all these observations, Pi's core principle is: make the coding agent adapt to your needs, not the other way around.

Minimal Core: Only Four Built-in Tools

Pi has only four built-in tools: read file, write file, edit file, and run Bash. No MCP, no sub-agents, no plan mode, no background Bash, no built-in to-do list.

The system prompt is also extremely concise — because frontier models have been extensively trained with reinforcement learning and already know what a coding agent is, there's no need to repeatedly tell them.

We have TMUX

Replacing Complex Features with Simple Solutions

Instead of sub-agents: Restart the agent using TMUX, giving you full control over input and output
Instead of plan mode: Write a plan.md file — it's persistent and reusable across sessions
Instead of background Bash: Use TMUX
Instead of built-in to-do: Write a todo.md file

Highly Extensible Plugin System

Pi's killer feature is its extension system. Users can extend tools, customize the UI, write skills, and modify prompt templates and themes through simple TypeScript files. All extensions support hot reloading — changes made by the agent to extensions take effect immediately.

Real-world examples include: custom compaction strategies, permission gating (implemented in 50 lines of code), tool versions for working on remote machines via SSH (implemented in 5 minutes), multi-agent chat rooms, and even extensions for playing mini-games while the agent runs.

The Case for Default YOLO Mode

Most coding agents either let the agent do whatever it wants or ask the user for confirmation at every step. Mario argues the latter leads to "approval fatigue" — similar to "alert fatigue" in the security field, where people either turn off confirmations entirely and enter YOLO mode, or sit there pressing Enter repeatedly without looking at anything. YOLO mode (You Only Live Once) means the agent can autonomously execute all operations — including file modifications and shell commands — without step-by-step human approval. Pi defaults to YOLO mode, and if security is a concern, it recommends using containerized isolation rather than conversational approval — using Docker or similar technology to restrict the agent's execution environment to a sandbox, so even if the agent performs destructive operations, the host system remains unaffected. This is a "guardrails" mindset rather than an "approval gates" mindset, achieving a better balance between security and efficiency.

A New Challenge in Open Source Governance: Dealing with AI-Generated Low-Quality PRs

Mario also shared an interesting open source governance experience: a flood of AI-generated low-quality PRs pouring into the repository. As AI coding tools become widespread, open source project maintainers face an increasingly severe challenge — these PRs typically feature code that looks reasonable but lacks understanding of project context, templated commit messages, inability to respond to maintainers' code review comments, and even fixes for non-existent problems. This not only drains maintainers' review energy but also dilutes genuinely valuable community contributions.

He coined the concept of "OSS holidays" (periodically not processing issues and PRs) and implemented an access control scheme — before submitting a PR, contributors must first submit a brief issue in a "human voice" introducing themselves, and only after their account name is added to a whitelist can they submit PRs. This is essentially a social engineering solution that filters out pure bot submissions by requiring contributors to first describe themselves and their intentions in natural language. Ghostty author Mitchell later developed this idea into a project called Vouch, establishing a contributor verification system based on community trust networks.

Less Is More: Pi Proves Minimalist Coding Agents Can Be Equally Powerful

Pi's story reflects a deep tension in the AI coding tools space: we're still in the exploration phase, and nobody knows what the perfect coding agent should look like. Between minimalism and full-featured spaceships, between full autonomy and human control, industry standards have yet to form. Pi has chosen a unique path — providing a minimal, extensible core that lets each developer explore answers in their own way. On TerminalBench, Pi paired with Claude Opus has already achieved near-top-tier results, proving that the "less is more" philosophy is indeed viable.