OpenAI Engineer's Practice: Agent-Driven Development Methodology in the Era of Free Code

OpenAI engineer Ryan Lopopolo recently shared his deep practical experience from nine months of building software purely with AI agents in a technical talk. He calls himself a "Token Billionaire," consuming over one billion output tokens per day, worth more than a thousand dollars. Tokens are the basic unit of measurement for how large language models process text — one token corresponds to roughly 3/4 of an English word or 1-2 Chinese characters. Output tokens (content generated by the model) are typically more expensive than input tokens (prompts provided by the user). Taking GPT-4o as an example, output tokens cost approximately $15 per million. Ryan's daily consumption of over one billion output tokens means his team has AI generating code and text equivalent to millions of lines of code every day — a scale of AI usage that remains an extreme case in the industry. He put forward a bold vision: code is free, and the human role should shift from writing code to system design, task orchestration, and quality control.

Core Philosophy: Code Is Free, Humans Steer

Ryan's central thesis upends traditional software engineering thinking — implementation capability is no longer a scarce resource. He even prohibits team members from using editors directly, forcing everyone to complete coding work through AI models.

In his view, there are only three truly scarce resources in today's world: human time, human and model attention, and the model's context window. The context window is the maximum number of tokens a large language model can process in a single inference pass. Even the most advanced models (such as Claude's 200K or GPT-4o's 128K context window) still fall short when facing large codebases — a medium-sized project might contain hundreds of thousands of lines of code, far exceeding a single context capacity. This is why Ryan lists the context window as one of the three scarce resources: how you inject the most critical information into a limited context directly determines the quality of the agent's output.

Under this new paradigm, every engineer effectively has the productive capacity of 5, 50, or even 5,000 engineers, working 24/7 without interruption. The only constraints are GPU compute and token budget.

This means that P3-level tasks that could never make it onto the schedule due to insufficient headcount can now be kicked off immediately — even trying multiple approaches in parallel and selecting the optimal solution. Ryan gave a vivid example: when code is free, all internal tools can have comprehensive localization support from day one, providing native-language interfaces for colleagues in London, Dublin, Paris, and elsewhere, without having to trade off between features and quality.

Our job is to find every possible way

Methodology for High-Quality Agent Delivery

Define Non-Functional Requirements Explicitly

Getting a patch right might require making 500 small decisions, all revolving around non-functional requirements that haven't been explicitly specified. Non-Functional Requirements (NFRs) are constraints related to the quality of system behavior in software engineering, including performance, security, maintainability, observability, error handling strategies, and more. Unlike functional requirements (what the system should do), NFRs define how the system should do it. In traditional development, these decisions often rely on the tacit experience of senior engineers and are rarely fully documented.

Ryan emphasizes that these models have already seen trillions of lines of code during training. The key is to write down the non-functional requirements so the agent can see what constitutes acceptable work. His insight is that AI agents lack the tacit experience of human engineers, so NFRs must be made explicit and documented to enable agents to produce production-grade code.

He proposed a concise and powerful principle: "Don't shovel bad code, and don't accept bad code." Achieving this requires sacrificing short-term speed, deeply analyzing where agents struggle, setting up guardrails, and then investing time in higher-leverage activities.

Multi-Layered Prompt Injection Strategy

Ryan revealed a profound insight: everything he's talking about is essentially prompt engineering, with absolutely no need to modify model weights. Prompt engineering itself has evolved from simple conversational techniques into a systematic engineering discipline, encompassing context management, instruction layering, constraint injection, and more. Ryan's methodology elevates prompt engineering from "the art of talking to AI" to "part of software engineering infrastructure." Methods of prompt injection include:

agents.md files: Define behavioral specifications for agents. This is a practice pattern of storing AI behavioral specifications as Markdown files in the code repository, similar to how .editorconfig or .eslintrc configure editors and linting tools. The deeper significance of this approach is bringing prompt engineering into version control systems, making it traceable, collaborative, and iterable.
Custom ESLint rules: Guide agents to fix code through lint error messages. ESLint is the most popular static code analysis tool in the JavaScript/TypeScript ecosystem, detecting potential issues before code runs through predefined or custom rules. Ryan's innovative usage is designing ESLint error messages to be AI-friendly prompts — when agent-generated code triggers a lint error, the error message itself contains fix guidance, forming an automated feedback-correction loop without human intervention.
Review agents: Add comments on PRs and require them to be addressed before merging
Test constraints: Write tests that enforce a maximum of 350 lines per file
Better error messages: Instead of just saying "Lint check failed," tell the model exactly how to fix it

With enough tool tokens and context

He gave a classic example: network code missing timeouts and retry mechanisms causing production outages — a pitfall many engineers have encountered. In distributed systems, network calls without timeout settings can cause threads or connections to be held indefinitely, eventually exhausting system resources and triggering cascading failures; missing retry mechanisms mean any transient network jitter directly results in request failures. Rather than relying on human reviewers to remember this rule, it's better to write a static analysis rule that ensures every fetch call includes timeout and retry mechanisms, solving the problem at its root.

Minimalist Philosophy of Skill Design

Ryan's team maintains only 5 to 10 core skills, rather than spreading across dozens or hundreds. The reason is practical: the infrastructure and local development tools in the repository change extremely frequently, making maintenance costs too high. They hide all complexity behind a handful of carefully designed skills and let the agents figure everything out on their own.

An interesting detail: when the team switched from directly using the Chrome DevTools Protocol to a daemon-based approach, Ryan didn't learn about the change until three weeks later. The Chrome DevTools Protocol (CDP) is a remote debugging protocol exposed by the Chrome browser that allows external programs to control browser behavior, commonly used for automated testing and web scraping. A daemon is a service process that runs continuously in the background. Switching from direct CDP connections to a daemon-based approach means the team encapsulated browser automation as a stable intermediate service layer, reducing the complexity of direct protocol interaction. The fact that this change could be self-adapted by AI agents demonstrates that good documentation and abstraction layer design can give AI the ability to autonomously learn new toolchains — Codex was able to adapt using existing documentation with absolutely no human intervention.

You don't directly participate in Cloud Code work

The Automation Revolution in Code Review

From Human Bottleneck to Agent Closed Loop

In high-velocity development mode, each engineer submits 3 to 5 PRs (Pull Requests — the core mechanism for code review and collaboration in modern software development) per day. Even with a team of just three people, merge conflicts become a headache. Merge conflicts occur when multiple people modify the same region of the same file simultaneously, and the Git version control system cannot automatically decide which version to keep, requiring manual intervention. When AI dramatically increases code output velocity, this problem is amplified exponentially.

Ryan's solution is a two-pronged approach: optimize code structure to reduce conflicts while shortening the time PRs remain pending.

He established "Recycle Day" (every Friday), where the team collectively reviews all issues discovered during the week that made PRs difficult to merge, eliminating them at the root to prevent recurrence. This forms a closed loop: human feedback on PRs → reflects the agent's context understanding failures → written into repository documentation → agents automatically reference documentation to self-correct. The essence of this loop is a flywheel effect of continuous improvement: every human correction not only solves the current problem but permanently enhances the agent's ability to handle similar issues in the future.

Development speed is so fast now

Role-Based Review Agents

Team members categorize review feedback according to their areas of expertise (frontend architect, reliability engineer, scalability expert, etc.). Dedicated review agents are deployed for each role, triggered on every code push. These agents identify all merge-blocking issues based on defined code standards documentation.

The elegance of this approach lies in the fact that every team member's best practices are codified. When an engineer with product thinking writes a good QA plan template, it means every agent's execution trace produces a high-quality QA plan — do it once, generate continuous leverage. This is actually a paradigm shift in knowledge management: in traditional organizations, expert knowledge is locked inside individual minds and spreads slowly through mentorship; in Ryan's system, expert knowledge is encoded as agent behavioral specifications, achieving instant, lossless, unlimited knowledge replication.

Code as a Compilation Artifact: A Mental Model

Ryan proposed a highly illuminating analogy: think of the large language model as a fuzzy compiler. All the context invested in the codebase (documentation, rules, constraints) serves as the compiler's optimization parameters, and code is merely the compilation artifact of these specifications.

Traditional compilers (such as GCC, LLVM, Cranelift) deterministically transform high-level languages into machine code — the same input always produces the same output. LLVM is the most widely used compiler infrastructure in the industry, underpinning code generation for multiple languages including Clang (C/C++), Rust, and Swift; Cranelift is an emerging code generation backend in the Rust ecosystem, known for its compilation speed and used in WebAssembly runtimes like Wasmtime. Ryan's analogy of LLMs as "fuzzy compilers" is remarkably apt: the LLM's input is natural language specifications and context, its output is code, but the process is probabilistic rather than deterministic.

Swapping out a model is like switching the code generation backend in the Rust compiler from LLVM to Cranelift — all the rules about what constitutes acceptable code should produce valid output, even if the generation process differs. The practical implication of this analogy is that — just as you wouldn't manually modify a compiler's intermediate artifacts — you shouldn't manually modify AI-generated code. Instead, you should modify the "source code" (i.e., specifications and constraints) and then "recompile." This means code is indeed a disposable build artifact, and the real value lies in the specifications and constraints that define "what good code looks like."

Engineering Vision for the Future

The future Ryan envisions is this: given a token budget and a quarterly/annual workload, use human input to prioritize (success metrics, reliability metrics), then hand it over to machines to run continuously, driving the product forward, with absolutely no need for humans to write code themselves.

He candidly acknowledges that beyond writing code, there's an entire universe of software engineering that needs attention: user feedback triage, alert handling, production log auditing, runbook writing, and more. These activities are traditionally categorized as "Operations" or "Site Reliability Engineering" (SRE) in software engineering and have long been viewed as less "prestigious" than coding. But after AI takes over coding, these activities — which require deep understanding of business context and user needs — actually become the most irreplaceable value that human engineers provide. Since engineers no longer need to write code themselves, their energy can shift toward these higher-level or softer activities.

Finally, Ryan left a thought-provoking statement: "If you need to interact with the agent to keep it working, that means the framework hasn't provided enough context to define what 'continue until done' means." This statement precisely captures the ultimate goal of agent engineering — completely liberating humans from the role of synchronous driver. The distinction between "synchronous" and "asynchronous" is key here: synchronous means humans must be present in real-time, guiding step by step; asynchronous means humans only need to define goals and constraints, and the agent executes autonomously to completion. What Ryan pursues is exactly this leap from synchronous mode to fully asynchronous mode — the hallmark transformation of AI agents evolving from "tools" to "autonomous agents."