Fed Up with Every AI Coding Assistant, a Veteran Game Developer Built His Own Coding Agent

A game dev veteran built his own minimalist coding agent Pi after finding all existing AI tools inadequate.
Veteran game developer Mario tried every AI coding agent from Claude Code to OpenCode and found critical flaws in all of them — feature bloat, black-box context management, poor architecture, and security issues. Inspired by TerminalBench results showing models can excel with minimal tools, he built Pi: a coding agent with just four built-in tools, default YOLO mode, and a powerful extension system that puts developers in control.
When a Veteran Developer Says "No" to Every AI Coding Agent
Mario, a veteran game developer from Austria, was introduced to AI coding agents by friends (including Armin Ronacher, creator of Flask and Sentry) in April 2025. He spent 24 hours immersed in tools like Claude Code. From that point on, he was hooked — but not because he fell in love with them. Rather, he was dissatisfied with every coding agent on the market and ultimately decided to build his own.
The project is called Pi — a minimalist, extensible coding agent framework built around developer control. In a tech talk in London, Mario shared his in-depth critique of existing tools and the design philosophy behind Pi.
A Brief History of AI Coding Tools: From Copy-Paste to the Agentic Era
Looking back at the evolution of AI coding tools, we can identify several stages:
- 2023: Copy-pasting code from ChatGPT, mostly limited to handling individual functions
- The GitHub Copilot era: Integrated into VS Code with tab-to-complete, sometimes useful but often not, and occasionally regurgitating GPL code (like John Carmack's fast inverse square root algorithm)
- The Aider era: An early command-line coding assistant
- Late 2024 to early 2025: Claude Code burst onto the scene, truly creating the "agentic coding" category
Claude Code's core innovation was this: instead of deeply parsing ASTs (Abstract Syntax Trees) to index codebases like traditional tools, it used reinforcement learning to train models to explore codebases on the fly using file tools and Bash tools, finding the locations they need to understand and modify. Traditional code indexing tools (like Sourcegraph, ctags, etc.) parse ASTs to build structured indexes of codebases — an approach that's precise but expensive to maintain and struggles to capture semantic relationships in code. Claude Code took a fundamentally different approach: using reinforcement learning (RLHF/RLAIF) to train models to dynamically explore codebases with basic tools like file reading, grep searching, and Bash commands. This essentially simulates how a human developer understands an unfamiliar codebase — first looking at the directory structure, then grepping for keywords, and gradually diving into relevant files. The advantage is that no pre-built index is needed, it adapts to any language and project structure, and it works surprisingly well.
Claude Code's Fatal Flaws: Why a Veteran Developer Walked Away
Mario acknowledges that Claude Code is the category leader with an excellent team. But as he dug deeper, he discovered several serious problems.
Feature Bloat: Spaceship Syndrome
Because Claude Code can write massive amounts of code, the team kept cramming in features. The end product became a "spaceship" — users might only use 5% of its features, understand 10% of its content, and the remaining 90% is like "dark matter in AI" that nobody knows what it's doing.

Black-Box Context Management
In the summer of 2025, Mario built a series of tools to intercept the requests Claude Code sends to its backend, discovering that it was secretly injecting large amounts of extra text into the context behind the user's back. Worse still, these injections changed almost daily — new versions would alter the timing and content of injections, directly disrupting workflows users had already established.
"You're working with this new tool, trying to create predictable workflows, and then the tool vendor changes some tiny thing under the hood that makes the LLM go crazy on your existing workflows. This is simply not sustainable."
Terminal UI Performance Issues
The Claude Code team once proudly claimed their terminal UI "became a game engine." As someone with a game development background, Mario was unimpressed — re-laying out the entire UI with React in the terminal takes 12 milliseconds, while Ghostty author Mitchell pointed out directly: terminals can render hundreds of frames per second at under one millisecond per frame. "Your code is just bad."
Other AI Coding Tools Are Equally Problematic
OpenCode: Architecture and Security Concerns
Mario also tried OpenCode, an open-source coding framework, but found several troubling issues:

Crude context management: Every interaction calls SessionCompaction.prune, trimming all results before the last four token batches, which directly invalidates the Prompt Cache. Prompt Cache is a critical optimization technique offered by LLM API providers — when consecutive requests share the same prefix content, the API can reuse previously computed KV Cache, dramatically reducing latency and cost. Anthropic's Prompt Caching can save up to 90% in fees. OpenCode's aggressive context pruning on every interaction means the prefix changes with each request, causing cache hit rates to plummet. For coding agents that require frequent multi-turn interactions, this means not only higher API costs but also waiting for full inference computation on every response, severely degrading the development experience.
Poorly considered architecture: Every message becomes a separate JSON file on disk, suggesting the overall architecture lacks careful thought.
Security vulnerabilities: The default built-in server-client architecture has remote code execution vulnerabilities, and the issue remained open and unpatched for a long time.
The LSP Integration Trap
One particularly insightful observation concerns LSP (Language Server Protocol) integration. LSP is an open protocol designed by Microsoft in 2016 for VS Code, intended to decouple programming language intelligence features (code completion, go-to-definition, error diagnostics, etc.) from editors. LSP servers continuously monitor file changes and report diagnostic information in real time (such as compilation errors, type mismatches, etc.). This is extremely useful in human development scenarios, but creates a serious timing problem for AI agents.
When an agent needs to edit multiple files, the code is almost guaranteed not to compile after the first edit. For example, when renaming an interface, the first file is changed but the files referencing it haven't been updated yet — compilation errors are inevitable at that point. If LSP reports errors after every edit and feeds them back to the model, the model gets confused: "I haven't finished editing and you're already telling me it's wrong." Doing this frequently causes the model to abandon its correct refactoring plan and attempt erroneous "fixes" instead. The correct approach is to only perform code checks at natural synchronization points (when the agent believes it's finished), giving it the chance to complete full cross-file modifications before evaluating results.
The Truth Revealed by TerminalBench: Do Models Really Need That Many Tools?
A key turning point came from TerminalBench — an agent evaluation framework containing roughly 82 diverse tasks.

One of the top-performing frameworks on the leaderboard is Terminus, which gives the model an extremely simple interface: just a single TMUX session, where the model can only send keystrokes and read back VT code sequences from the terminal. TMUX (Terminal Multiplexer) is a terminal multiplexer that allows users to manage multiple sessions within a single terminal window. The Terminus framework provides the model with only two TMUX operations — send-keys and capture-pane — the former simulates keyboard input, and the latter reads VT100/VT220 escape sequences (the raw character stream displayed on the terminal screen). No file tools, no sub-operations, no web search — nothing. This means the model must "see" terminal output and "type" on the keyboard just like a human. Yet its performance ranks among the best.
This finding is profoundly significant: all those fancy coding tool features may not be necessary for agents to achieve good results. It fundamentally challenges the industry assumption that "more tools is better," suggesting that frontier models' general reasoning capabilities are already powerful enough to complete complex tasks with the most basic interfaces.
Pi's Design Philosophy: Minimal, Controllable, Extensible
Based on all these observations, Pi's core principle is: make the coding agent adapt to your needs, not the other way around.
Minimal Core: Only Four Built-in Tools
Pi has only four built-in tools: read file, write file, edit file, and run Bash. No MCP, no sub-agents, no plan mode, no background Bash, no built-in to-do list.
The system prompt is also extremely concise — because frontier models have been extensively trained with reinforcement learning and already know what a coding agent is, there's no need to repeatedly tell them.

Replacing Complex Features with Simple Solutions
- Instead of sub-agents: Restart the agent using TMUX, giving you full control over input and output
- Instead of plan mode: Write a plan.md file — it's persistent and reusable across sessions
- Instead of background Bash: Use TMUX
- Instead of built-in to-do: Write a todo.md file
Highly Extensible Plugin System
Pi's killer feature is its extension system. Users can extend tools, customize the UI, write skills, and modify prompt templates and themes through simple TypeScript files. All extensions support hot reloading — changes made by the agent to extensions take effect immediately.
Real-world examples include: custom compaction strategies, permission gating (implemented in 50 lines of code), tool versions for working on remote machines via SSH (implemented in 5 minutes), multi-agent chat rooms, and even extensions for playing mini-games while the agent runs.
The Case for Default YOLO Mode
Most coding agents either let the agent do whatever it wants or ask the user for confirmation at every step. Mario argues the latter leads to "approval fatigue" — similar to "alert fatigue" in the security field, where people either turn off confirmations entirely and enter YOLO mode, or sit there pressing Enter repeatedly without looking at anything. YOLO mode (You Only Live Once) means the agent can autonomously execute all operations — including file modifications and shell commands — without step-by-step human approval. Pi defaults to YOLO mode, and if security is a concern, it recommends using containerized isolation rather than conversational approval — using Docker or similar technology to restrict the agent's execution environment to a sandbox, so even if the agent performs destructive operations, the host system remains unaffected. This is a "guardrails" mindset rather than an "approval gates" mindset, achieving a better balance between security and efficiency.
A New Challenge in Open Source Governance: Dealing with AI-Generated Low-Quality PRs
Mario also shared an interesting open source governance experience: a flood of AI-generated low-quality PRs pouring into the repository. As AI coding tools become widespread, open source project maintainers face an increasingly severe challenge — these PRs typically feature code that looks reasonable but lacks understanding of project context, templated commit messages, inability to respond to maintainers' code review comments, and even fixes for non-existent problems. This not only drains maintainers' review energy but also dilutes genuinely valuable community contributions.
He coined the concept of "OSS holidays" (periodically not processing issues and PRs) and implemented an access control scheme — before submitting a PR, contributors must first submit a brief issue in a "human voice" introducing themselves, and only after their account name is added to a whitelist can they submit PRs. This is essentially a social engineering solution that filters out pure bot submissions by requiring contributors to first describe themselves and their intentions in natural language. Ghostty author Mitchell later developed this idea into a project called Vouch, establishing a contributor verification system based on community trust networks.
Less Is More: Pi Proves Minimalist Coding Agents Can Be Equally Powerful
Pi's story reflects a deep tension in the AI coding tools space: we're still in the exploration phase, and nobody knows what the perfect coding agent should look like. Between minimalism and full-featured spaceships, between full autonomy and human control, industry standards have yet to form. Pi has chosen a unique path — providing a minimal, extensible core that lets each developer explore answers in their own way. On TerminalBench, Pi paired with Claude Opus has already achieved near-top-tier results, proving that the "less is more" philosophy is indeed viable.
Key Takeaways
Related articles

DeepMind TacticAI Lands at Palmeiras: AI Predicts Match Dynamics 8 Seconds in Advance
Google DeepMind partners with Palmeiras to deploy TacticAI, the first AI tactical system in professional football, predicting open-play dynamics 8 seconds ahead.

Local Deployment of Claude Code: Principles and Practical Guide for Three Approaches
A detailed guide to locally deploying Claude Code with three approaches (LM Studio, Ollama, vLLM), covering architecture, protocol translation, hardware selection, and model recommendations.

Kuaima Client Tutorial: A Detailed Guide to Pay-Per-Use Pricing for Cursor's Premium Models
A detailed guide to downloading, installing, configuring, and using the Kuaima client for pay-per-use access to Cursor's premium AI models, with troubleshooting tips and security advice.