Harness Engineering: A New Software Development Paradigm Where Humans Steer and AI Agents Execute

Introduction: A Token Billionaire's Practice

OpenAI engineer Ryan Lopopolo shared a radical practice at the AI Engineer conference: over the past nine months, he banned his team members from touching their editors — all code had to be written through AI agents. He calls himself a "Token Billionaire" — consuming over 1 billion output tokens per day, amounting to more than $1,000.

Tokens are the basic units that large language models use to process text. In English, each word corresponds to roughly 1–1.5 tokens; in Chinese, each character is about 1.5–2 tokens. Consuming over 1 billion output tokens per day means the model generates text equivalent to millions of pages of documents daily. At OpenAI's current API pricing, this level of consumption far exceeds the scale of a typical developer's usage, reflecting an orders-of-magnitude gap between industrial-grade AI-assisted programming and individual use.

This isn't a vision of the future — it's a field report from the front lines. Ryan introduced the concept of "Harness Engineering" — a new software engineering methodology whose core idea is: code is free, and the human role is to build systems, structures, and constraints, letting AI agents handle all implementation work.

bilibili source: Harness Engineering: How to Build Software When Humans Steer and Agents Execute — Ryan Lopopolo, OpenAI - AI Engineer

The Core Philosophy of Harness Engineering: Code Is Now Free

Implementation Is No Longer a Scarce Resource

Ryan argues that three critical things have happened at this stage that have fundamentally changed the landscape of software engineering. The most important is that large models have become isomorphic with humans in code production capability — able to perform the full range of a software engineer's work.

"Isomorphic with humans" means that large models are no longer merely good at code completion or snippet generation. They can understand requirements, design architectures, write complete modules, handle edge cases, and write tests — covering the entire chain of a software engineer's daily work. This leap in capability stems from training Transformer architectures on massive code corpora (billions of lines of open-source code on GitHub), combined with the maturation of post-training techniques like RLHF (Reinforcement Learning from Human Feedback) and code execution feedback.

What does this mean? Every engineer now has the productive capacity of 5, 50, or even 5,000 engineers, working year-round without breaks. Code is free to produce, free to refactor, free to delete. Those tasks that were perpetually stuck at P3 priority can now be kicked off immediately — or even launched in four parallel approaches, with the best one merged in.

The Shift in Scarce Resources

In this new world, the truly scarce resources have become three things:

Human time — which needs to be invested in higher-leverage activities
Human and model attention — which needs to be carefully managed
The model's context window — which needs to be efficiently utilized

The engineer's role shifts from "the person who writes code" to a "Staff Engineer" — responsible for systems thinking, system design, and task delegation, managing a team composed of AI agents. In Silicon Valley tech company leveling systems, Staff Engineer is a senior technical level above Senior Engineer, where responsibilities shift from individual code output to setting technical direction, cross-team coordination, and system-level decision-making. Staff Engineers at Google, Meta, and similar companies often no longer write code day-to-day, instead focusing on technical specification, architecture reviews, and technical debt management. Ryan's analogy of elevating all engineers to the Staff Engineer role essentially means that AI agents take on the work of the "hands," while humans need to level up to become the "brain."

What Good Harness Engineering Looks Like

The Key to Helping AI Agents Do Good Work

Ryan points out that producing a good patch may require 500 small decisions about non-functional requirements. Non-Functional Requirements (NFRs) are constraints in software engineering that are unrelated to functional logic but critical to system quality, including performance metrics, security standards, maintainability, code style, error handling strategies, logging conventions, and more. An experienced engineer implicitly makes hundreds of such decisions while writing code — such as choosing synchronous vs. asynchronous, whether a retry mechanism is needed, or what log level to set. These decisions typically live in an engineer's "muscle memory" and have never been explicitly documented. The core challenge of harness engineering is precisely to externalize this tacit knowledge into agent-readable instructions.

Models have seen trillions of lines of code during training, covering every possible choice. The engineer's job is to write down these non-functional requirements so the agent can see what constitutes "acceptably good work."

The core method is: everything is a prompt. Ryan listed multiple ways to inject instructions into agents:

agents.md files
Rules files
Skills (skill files)
Lint error messages
Comments injected by Review Agents on PRs
Agent SDKs embedded in code

agents.md is a conventional repository-level configuration file, similar to README.md, but written specifically for AI agents, containing project architecture descriptions, coding standards, prohibited practices, and more. Rules files and Skills files are structured instruction formats supported by AI coding tools like Codex. The essence of these mechanisms is transforming traditional "team coding standards documents" into machine-executable prompts, so AI agents automatically load project context at the start of each task rather than relying on humans to repeatedly explain things in conversation.

The Actual Workflow of AI Agents

Ryan's team uses Codex as an entry point rather than building an environment around it. They provide Codex with Skills, teaching it how to start the application, spin up the local observability stack, connect to Chrome DevTools, and more. The entire repository and local development tooling are designed with Codex as the primary user.

The team focuses on 5–10 core skills, prioritizing depth over breadth. One interesting example: when the underlying architecture migrated from Chrome DevTools Protocol to a daemon architecture, Ryan didn't discover the change until three weeks later — because Codex had figured it out on its own using the documentation.

Codebase Structuring Strategies

Optimizing Code Organization for AI Agents

As the project grew, Ryan's team adopted architecture at the level of a "10,000-person engineering organization": 750 packages in a PNPM workspace, isolated by business logic domain or system layer.

PNPM is a high-performance Node.js package manager whose Workspace feature allows managing hundreds of independent packages within a single monorepo, each with its own dependency declarations and build configuration. A scale of 750 packages in human teams typically only appears at organizations like Google or Microsoft with thousands of engineers. Ryan splitting a three-person team's codebase to this granularity isn't for human organizational management — it's for AI agent context management. Each package is small enough to fit entirely within a model's context window, so agents can complete local tasks without needing to understand the entire codebase.

The reasons for this approach:

Agents can scope their work to a directory subtree
Code in the filesystem itself serves as a prompt for the agent
The more uniform the code, the easier and more consistent the model's token predictions become

Specific practices include: only one concurrency helper, one ORM, one programming language, one way to write CI scripts. Ryan recommends being a "dictator" within your team — determine the single way to do things, write it down, then let agents migrate the entire codebase.

Source-Code-Level Test Constraints

Ryan shared a clever technique: writing tests about the source code itself. For example, enforcing that files don't exceed 350 lines — this is designed to fit the model's context window, optimizing context efficiency from an engineering perspective. A large language model's context window refers to the maximum number of tokens the model can process in a single inference pass; current mainstream models range from 128K to 200K tokens. The strategy of limiting files to no more than 350 lines (roughly 5,000–7,000 tokens) ensures that a single file, along with task instructions and relevant dependency information, can comfortably fit within the context window, avoiding information loss from context truncation. This represents a paradigm shift of "optimizing code structure for the reader (AI) rather than the writer (human)."

Similarly, error messages must provide actionable fix steps. Instead of simply reporting "there's an unknown here," you tell the model: "You shouldn't have an unknown here because we parse rather than validate at the boundary — you should have a type derived from Zod here."

The "Parse, Don't Validate" pattern mentioned here is a type-safe design approach popularized by the functional programming community, proposed by Alexis King in 2019. Its core idea is: at system boundaries, parse unstructured data (such as API requests or user input) into strongly-typed data structures, then only pass validated types within the system's interior, eliminating illegal states at compile time rather than runtime. Zod is the most popular runtime type validation library in the TypeScript ecosystem, capable of automatically deriving TypeScript types from schema definitions, unifying runtime validation with compile-time type checking. Embedding this design philosophy into error messages means injecting architecture-level design guidance into the agent at the exact moment it makes a mistake.

Revolutionizing AI Code Review

From Human Bottleneck to Automated Guardianship

The biggest challenge that comes with high velocity is code review. With three team members each producing 3–5 PRs per day, merge conflicts become extremely painful. The solution comes in two steps:

Tree-structure the codebase to reduce conflict probability
Shorten PR open time, which requires automated review

The team established "Garbage Collection Day" (every Friday), dedicated to systematically eliminating code issues observed throughout the week. They categorized review feedback by role (frontend architect, reliability engineer, etc.) and created a Review Agent for each role, automatically triggered on every push.

A Continuous Improvement Feedback Loop

This creates a closed loop: humans provide feedback on PRs → identify it as a context failure for the agent → write it into repository documentation → automatically prompt-inject into the agent → the agent self-corrects. As documentation continuously accumulates, code quality steadily improves, and the frequency of human intervention decreases over time.

This feedback loop mechanism essentially achieves continuous codification of organizational knowledge. In traditional software teams, senior engineers' experience often exists in the form of oral mentoring or code review comments, and new members need months to internalize this tacit knowledge. In harness engineering, every piece of human feedback is transformed into machine-readable permanent documentation — the team's "institutional memory" grows at an exponential rate and never forgets.

Future Vision: Code as a Disposable Build Artifact

LLM as a Fuzzy Compiler

Ryan stated explicitly: code is a disposable build artifact. All the constraints and optimizations in harness engineering are essentially analogous to static analysis and optimization passes in the LLVM compiler. Swapping models is like swapping code generation backends — as long as the constraint structure is correct, different models can produce acceptable code.

LLVM is the de facto standard for modern compiler infrastructure, with a core design that separates the compilation process into three stages: frontend (language parsing), intermediate representation (IR) optimization, and backend (target code generation). Different programming languages share the same set of optimization passes and code generation backends. Ryan's analogy of harness engineering to LLVM implies that the constraints, specifications, and architecture documents written by humans serve as the "intermediate representation," while different large models (GPT-4o, Claude, Gemini) serve as different "code generation backends." This analogy hints at a profound trend: the core artifact of software engineering will shift from code to constraint specifications, with code itself becoming a renewable compilation artifact.

The Ultimate Goal

The future Ryan describes is: given a token budget and six months to a year of work, combined with human input on priorities and success metrics, let machines continuously advance the product while humans never need to manually intervene.

He emphasizes that software engineering is far more than writing code — it also includes user feedback triage, alert handling, PII leak checking, runbook writing, and more. When code production no longer requires human involvement, human attention can shift to these higher-level or more "fuzzy" activities — and agents are equally capable of handling this work.

Practical Advice for Practitioners

For engineers just starting to use coding agents, Ryan offers two paths:

Start by having agents write tests — this builds confidence in existing code while enhancing the agent's ability to navigate the codebase for subsequent tasks. Test code is the easiest entry point for AI agents because tests have clear inputs and outputs, quality is easy to verify, and the tests themselves become a "map" for the agent to understand codebase behavior, laying the foundation for more complex tasks.
Examine where your time goes — waiting for tests? Waiting for reviews? CI too slow? Use agents to gradually automate these time-consuming steps.

One final golden rule worth remembering: "Every time you need to type 'continue' for the agent, it's a failure of the harness to provide sufficient context." This is the ultimate standard of harness engineering — make the system self-driving, and let humans truly become the ones steering rather than rowing.