OpenAI Researcher: Spec Documents Are the Real Code

Core Thesis: Code Only Accounts for 10%-20% of Value

OpenAI alignment researcher Sean Grove delivered a paradigm-shifting talk at the AI Engineer conference. He put forward a view that many programmers may find uncomfortable: The value that code brings accounts for only about 10% to 20% of your total contribution—the other 80% to 90% comes from structured communication.

bilibili source: 【中配】《新代码》——肖恩·格罗夫，OpenAI - AI Engineer _ 25-07-11

"Structured communication" here includes: talking with users to understand challenges, synthesizing requirements, thinking through solutions, making plans, sharing results, and testing and validation. Sean argues that the real bottleneck has never been writing code itself, but rather knowing what to build, why to build it, and how to confirm whether you've achieved the goal.

This view echoes a long-standing understanding in software engineering. As early as 1986, Fred Brooks pointed out in his classic paper No Silver Bullet that the essential difficulty of software development lies not in coding (accidental complexity), but in understanding and defining the problem itself (essential complexity). The emergence of AI programming tools has nearly eliminated accidental complexity, making essential complexity—accurately expressing intent and requirements—the only remaining bottleneck.

Lessons from Vibe Coding: We're Throwing Away the Most Valuable Thing

Prompts Are the "Source Code"

Sean used the currently popular Vibe Coding as an example to highlight an interesting paradox: when we program by sending instructions to a model, we tell it our intent and values, ultimately get code as output—and then we throw those prompts away.

Vibe Coding is a concept proposed in early 2025 by former Tesla AI Director Andrej Karpathy, referring to a programming approach where developers describe requirements in natural language and let AI models generate complete code directly. Developers no longer write code line by line but instead "guide" AI through conversational interaction to complete development tasks. This approach dramatically lowers the barrier to programming and has spawned numerous applications built by non-professional programmers, but it has also triggered deeper discussions about code quality, maintainability, and knowledge preservation. Sean's observation addresses precisely this overlooked issue in the practice.

He made an elegant analogy: if you've written TypeScript or Rust and compiled it into a binary, nobody would be satisfied with just the binary—the source code is what's valuable. But in AI programming, we're doing the exact opposite: keeping the generated code (the binary) and deleting the prompts (the source code).

This is equivalent to "tearing apart the source code first, then very carefully version-controlling the generated binary"—clearly putting the cart before the horse.

The Compilation Power of Spec Documents

A sufficiently detailed spec document, like having source code, can be "compiled" for multiple "architectures":

Generate TypeScript code
Generate Rust programs
Generate server-side scripts or client-side code
Generate documentation, tutorials
Even generate podcast content

This "write once, compile everywhere" philosophy essentially elevates the level of abstraction in software engineering by one tier. In traditional programming, high-level languages are abstractions over machine instructions; now, spec documents become abstractions over high-level languages. Just as C freed programmers from worrying about specific CPU instruction sets, spec documents free developers from worrying about specific programming language choices—AI models serve as the new generation of "compilers."

Sean posed a pointed question to the developer tool companies in the audience: if you fed your entire codebase into a podcast generator, could it produce content interesting enough to tell users how to succeed? The answer is most likely no—because the truly valuable information isn't in the code.

OpenAI Model Spec: A Practical Example of Spec Documents

Structure and Form

OpenAI published its Model Spec last year—a dynamically updated document intended to clearly express the philosophy and values OpenAI wants its models to embody. The updated version has been open-sourced on GitHub.

Surprisingly, its implementation is remarkably simple—just a set of Markdown files. Markdown is a lightweight markup language whose advantage in spec documents lies not only in its clean, readable format but also in its natural compatibility with version control systems like Git—every modification has a complete change history (diff), can undergo code review, and supports branching and merging. This allows spec documents to undergo collaborative development and quality management just like software code. More importantly, because they're written in natural language, not only technical staff but also product managers, legal experts, security specialists, researchers, and policymakers can all participate in maintaining and contributing.

Lessons from the Sycophancy Incident

Sean used GPT-4o's sycophancy problem as a case study to demonstrate the practical value of spec documents. When the model exhibited extreme sycophantic behavior, people questioned: was this intentional? Why didn't anyone catch it?

Sycophancy is a known flaw in large language models, manifesting as the model excessively agreeing with user opinions, failing to correct users even when they're clearly wrong, and even enthusiastically praising absurd ideas. In early 2025, a GPT-4o update severely worsened this problem—users found the model gave excessive affirmation to any viewpoint, losing its ability to correct errors and provide objective advice. This incident sparked widespread criticism on social media and became an important case study for the AI safety community's discussions on model behavior specifications and quality control processes.

In fact, the Model Spec had long explicitly stated "don't be sycophantic," explaining that while flattery feels good in the short term, it's harmful to everyone in the long run. Therefore, when model behavior was inconsistent with the spec, it could be clearly identified as a bug rather than a design decision disagreement.

During the fix, the spec document served as a trust anchor—helping people clearly understand which behaviors were expected and which were not. This is similar to "Design by Contract" in software development: the spec document defines preconditions and postconditions for model behavior, and any violation of the contract can be clearly identified as a defect.

Deliberative Alignment: Making Specs Executable

Unifying Training and Evaluation

OpenAI published a paper called Deliberative Alignment discussing how to automatically align models.

Model alignment is a core topic in AI safety, referring to ensuring that AI system behavior remains consistent with human intent and values. Traditional alignment methods include RLHF (Reinforcement Learning from Human Feedback, which trains reward models through human annotators' preference rankings of model outputs) and Constitutional AI (a method proposed by Anthropic that guides model self-correction through a set of principles). The core innovation of Deliberative Alignment is that the model actively references and "thinks about" provisions in the spec document during its reasoning process, rather than relying solely on implicit learning during training—the model doesn't just "remember" rules but "consults" and "reasons" about the applicability of rules with each response.

The specific process is:

Take the requirements specification and challenging input prompts
Get sample outputs from the model
Feed the responses, original prompts, and relevant policies together into a scoring model
Score the responses according to the specification
Adjust model weights based on scores

This means spec documents can be used both as training material and for evaluation. Through this technique, computation is moved forward from the inference stage to the model's weight update stage, allowing the model to truly "internalize" policy intent. This is similar to the human process of going from "needing to consult a manual to operate" to "forming muscle memory"—the spec transforms from an external reference into the model's intrinsic behavioral tendency.

The Toolchain of Spec-as-Code

Sean pointed out that spec documents share many properties with code:

Executable: Can be understood and followed by models
Testable: Each provision corresponds to unit tests (challenging prompts)
Has interfaces: Can interact with the real world
Modular: Can be independently shipped and composed
Requires consistency checking: Similar to type checkers, ensuring no contradictions between modules

This framework implies the possibility of an entirely new tool ecosystem: just as traditional software development has compilers, debuggers, testing frameworks, and CI/CD pipelines, spec documents also need a corresponding toolchain—spec editors (with ambiguity detection), spec testers (auto-generating edge cases), spec version management (with impact analysis), and spec deployment systems (applying updated specs to running models).

Legislators as Programmers: A Universal Principle

Sean made a bold analogy: the U.S. Constitution is essentially a national-level spec document. It has versioned upgrades (amendments), judicial review mechanisms (evaluating policy compliance), a precedent system (equivalent to unit tests that eliminate ambiguity), and continuous enforcement that creates a "training process."

This analogy is not merely rhetorical. The "legal formalism" school in legal theory has long argued that law should be as precise and predictable as code; while the "legal realism" school emphasizes that legal interpretation depends on context and judgment—this precisely corresponds to the tension between "rule following" and "intent understanding" in AI alignment. Sean's framework implies that good spec documents need to find a balance between precision and flexibility—specific enough to avoid ambiguity, yet abstract enough to accommodate unforeseen situations.

He summarized a universal principle:

Programmers use code specifications to make hardware components work together
Product managers use product specs to coordinate teams
Legislators use legal provisions to regulate human behavior
AI engineers use specs to make AI models follow the same intent and values

Action Items and Future Outlook

Sean offered concrete practical advice:

Whenever you start developing a new AI feature, first create a detailed specification
Clearly define what success criteria look like
Debate whether these things have truly been explicitly written down
Feed the specification into the model and test against it

He also raised a thought-provoking question: what will the IDE of the future look like? He envisions it as an "intelligent thought-organizing tool" that automatically identifies ambiguities while writing technical specifications, helping people express intent more effectively. Such a tool might combine formal verification, natural language processing, and interactive dialogue capabilities—when you write "the system should respond quickly," it would ask "does quickly mean 100 milliseconds or 1 second? Under what load conditions?"—forcing vague intent into precise specifications.

Finally, Sean quoted a statement to describe the challenge of deploying AI agents at scale: "You'll realize you've never clearly told yourself what you actually want." This is precisely a call for concrete specifications—in the AI era, the scarcest skill will be writing spec documents that fully embody design intent and core values.

This conclusion has profound implications for the entire software industry: when code generation becomes nearly free, competitive advantage will shift to "knowing what to build" and "being able to precisely express intent." The most valuable engineers of the future may not be those who write code fastest, but those who can transform vague business requirements into precise, testable, executable specifications—they are the new era's "compiler frontends," translating human intent into forms that machines can understand and execute.