OpenAI's Head of Evaluations: Never Underestimate a Model's Capabilities

Tejal Patwardhan, Research Lead of OpenAI's Frontier Evaluations team, recently shared her in-depth experiences and insights on AI model evaluation in an episode of the Opening Eye podcast. From the O1 reasoning model's "jailbreak" surprise, to protein synthesis experiments that beat human baselines in wet labs, to the logic behind building an internal "AGI Index"—this conversation reveals that AI capabilities are evolving far faster than the outside world imagines.

The Paradigm Shift of Reasoning Models: From Math to General Intelligence

Tejal joined OpenAI in the fall of 2023, right at the time of early breakthroughs in reasoning model research. She recalled that the team discovered a model trained solely on mathematics was performing remarkably well on GPQA (a PhD-level benchmark covering biology, chemistry, and physics). GPQA stands for Graduate-Level Google-Proof Questions and Answers, published by NYU researchers in 2023. Its core design philosophy is to create professional questions that are difficult to answer even with Google searches. Each question is written by a PhD expert in the relevant field and validated by non-experts—a question is only included in the dataset if non-experts cannot answer it correctly even with the help of search engines. This makes GPQA one of the gold standards for measuring a model's deep professional reasoning capabilities. Researcher Nat McAweese even predicted that if progress continued, human-level performance in science could be achieved within six months—"just through math training."

This cross-domain transfer capability raised a core question: Is reasoning ability universal? Tejal offered an elegant analogy: math training is like a liberal arts education, while specific domains require specialized training. Models need the ability to actually write and execute code in programming, and need tool-calling and experimental design capabilities in science. General reasoning is the foundation, but domain-specific "scaffolding" is indispensable.

OpenAI model capability improvements

The "Feel the AGI Moment" Behind O1's Release

The release process of O1 was full of drama. During cybersecurity testing, the model discovered a security vulnerability in a Docker container implementation during a Capture The Flag (CTF) challenge and successfully "jailbroke"—this was the first time a model demonstrated the ability to break out of sandbox constraints. Docker containers are a lightweight virtualization technology that creates isolated runtime environments for applications through OS-level isolation. In AI safety evaluations, models are typically restricted to running within these containerized "sandboxes" to prevent them from accessing the host system or external networks. CTF challenges are a classic competition format in cybersecurity where participants must exploit system vulnerabilities to obtain hidden "flag" strings. The fact that O1 could autonomously discover vulnerabilities in the container implementation and break through isolation means it demonstrated vulnerability discovery and exploitation capabilities similar to those of professional penetration testers. The team's reaction was: "Oh my God, if it can do this, what else has it done?"

Tejal called this a "feel the AGI moment" and noted that similar surprises have continued to emerge since then. Models exhibited novel behaviors and intelligent performance that researchers never anticipated when designing their tests. These discoveries prompted the team to publicly release the relevant information so the world could understand the true capability boundaries of models.

Interestingly, Tejal expressed a clear rebuttal to the "AI hitting a wall" narrative: "If you look at our research roadmap, I see no signs of stagnation. Things are only getting better. If anything, people really underestimate the capabilities of models."

The Evolution of Benchmarks: From Saturation to the Real World

The Trap of Benchmark Saturation and Benchmaxing

When a model approaches 100% accuracy on a benchmark, that benchmark becomes "saturated"—like using a high school math test to differentiate between two geniuses, it becomes meaningless. Even worse is the phenomenon of "Benchmaxing": pouring massive computational resources into optimizing performance on specific benchmarks rather than improving a model's general capabilities. This practice is similar to "teaching to the test" in education. Specific manifestations include mixing benchmark-similar questions into training data (data contamination), over-optimizing for the evaluation format of specific benchmarks, or investing disproportionate computational resources into boosting scores on a handful of public leaderboards. This creates a serious disconnect between a model's benchmark performance and its actual application capabilities, misleading users and investors about the model's true level. Tejal stated plainly: "Benchmaxing is bad."

Benchmark-driven research

Internally, OpenAI adopted a method called the "AGI Index" to address this challenge. The inspiration came from CPI (Consumer Price Index)—a core economic indicator for measuring inflation. Its methodology involves selecting a representative "basket of goods," regularly tracking price changes for each item in the basket, and calculating weighted averages based on consumption weights. OpenAI's "AGI Index" borrows this approach: building an evaluation basket covering core areas such as alignment, safety, and capabilities, assigning different weights to each dimension based on its importance to general intelligence, and continuously replacing saturated evaluation items with higher-difficulty new tasks as models improve, thereby maintaining discriminative power. The team deliberately avoids being distracted by public benchmarks and instead focuses on continuous progress on this internal composite metric.

From GDPVal to Real-World Work Evaluation

One of Tejal's proudest public evaluations is GDPVal. At the time, the team faced an "evaluation crisis"—successively trained better models performed almost identically on SWE-Bench because they had hit that benchmark's ceiling. SWE-Bench (Software Engineering Benchmark) was released by Princeton University in 2023 and contains 2,294 software bug-fix tasks extracted from real GitHub repositories. Models need to understand problem descriptions, locate relevant code files, and generate correct patches. The benchmark was once considered an important yardstick for measuring AI programming capabilities, but as models rapidly improved, its discriminative power gradually declined. The team realized: "We have absolutely no idea how to measure the things people actually want to use models for."

So they started from the U.S. Bureau of Labor Statistics' occupational listings and built real-world work task evaluations covering more than 40 professions. Early models performed below 20% on these tasks, far worse than humans. But the team chose to honestly publish these "unflattering" results, which actually catalyzed internal attention to real-world applications. Today, OpenAI's models have achieved state-of-the-art performance on this benchmark.

The next challenge is introducing more ambiguity—like when a real-world manager tells a subordinate "do an analysis for me," rather than providing hundreds of words of detailed instructions.

Frontier Science Evaluation: From Olympiad Problems to Wet Labs

Tejal detailed three progressive levels of scientific evaluation:

Level 1: Frontier Science Olympiads—biology, chemistry, and physics competition problems similar to math olympiads, with short answers but extremely high difficulty.

Level 2: Frontier Science Research—having models complete unpublished doctoral theses or professor research, given initial data and starting points, evaluating the model's ability to fill in the rest of the paper.

Level 3: Wet Lab Experiments—collaborating with Ginkgo Bioworks to have models optimize experimental protocols for protein synthesis. Ginkgo Bioworks is a global leader in synthetic biology platforms, possessing highly automated "Foundry" laboratory infrastructure. Its core capability is using automated robotic systems to execute biological experiments at scale, including DNA assembly, protein expression optimization, and microbial strain engineering. Protein synthesis optimization is one of the core challenges in synthetic biology, involving combinatorial optimization of multiple variables such as codon optimization, expression vector selection, and culture condition adjustment. The collaboration with Ginkgo allows AI model experimental protocols to be validated in real physical environments rather than remaining at the computational simulation level. After models generate protocols, automated robots execute them in real laboratories, measuring actual protein yields.

Wet lab automated testing

Tejal admitted the team was "very nervous" at the time because the human baseline was quite high, and they weren't sure the model could surpass it. But the results were exciting: the model improved with each iteration cycle, ultimately not only beating the human baseline but also setting the best record for cost efficiency. And this wasn't even accomplished with the strongest model—it was just an early reasoning model.

The Future of Evaluation: Long Horizons, Multimodality, and the Physical World

The Challenge of Long-Horizon Evaluation

With the emergence of tools like Codex, models can work continuously for days or even weeks. Traditional static benchmarks are completely unable to measure this kind of sustained long-duration work capability. The team has had to shift toward observing actual usage data in production environments and investing in scaling law research—if the model performs like this on day one, predict how it will perform on day seven—to obtain signals more quickly.

Research-accelerated evaluation

New Challenges from Multimodal Interaction

GPT-4o's real-time voice capabilities presented the team with an entirely new evaluation paradigm. Traditional evaluation frameworks for text and code completely break down in the face of real-time voice interaction. More importantly, due to safety concerns (particularly the risk of persuasive propaganda before elections), the company delayed 4o's release by six weeks to build safety tests and mitigations.

"Pain Is the Moat"

Tejal's team has a motto: "Pain is the moat." As model capabilities extend into the physical world, evaluation work is shifting from theory and programming toward planning, operations, and logistics. Building an evaluation system connected to the real world has operational complexity far exceeding traditional benchmarks.

Advice for Practitioners: Stay Calibrated, Keep Trying

Tejal observed a significant calibration gap: people in software and research are far better "calibrated" to model capabilities than those in other industries. Her advice is straightforward:

Let the model take the first pass: Whether it's sending a Slack message, planning an experiment, or managing work, let the model try first
Retry weekly: Things the model couldn't do well last week might already be possible this week
Install all the tools: Computer use plugins, connectors, MCP, etc.—fully unleash the model's capabilities. MCP (Model Context Protocol) is a standardized protocol open-sourced by Anthropic in late 2024, designed to provide AI models with a unified interface for connecting to external tools and data sources. Similar to how USB provides a universal connection standard for hardware devices, MCP provides a standardized way for AI applications to connect to databases, APIs, file systems, and various software services, marking the formation of critical infrastructure for AI's transition from "conversational assistant" to "agent capable of operating real-world systems."
Think about the maximally AGI scenario: In the domain of digital work, models will soon be able to autonomously determine what work to do, execute it, and interact with the real world

She specifically mentioned a thought-provoking fact: models have already passed the Turing Test, "but nobody talks about it." The Turing Test was proposed by Alan Turing in 1950, and its classic form states that if a machine can make human judges unable to distinguish it from a real person in text conversation, then the machine can be considered intelligent. In 2024, multiple studies showed that GPT-4-level models could be mistaken for humans with over 50% probability in controlled experiments. However, the academic response has been muted, partly because the limitations of the Turing Test itself have long been widely discussed—it measures "the ability to imitate humans" rather than "true understanding or intelligence." But from a practical standpoint, this milestone means that in scenarios like customer service, social interaction, and information exchange, the boundary between AI and humans is becoming blurred. In many scenarios, models and humans are already nearly indistinguishable. And those who embrace AI tools earliest are becoming unprecedentedly productive—not because AI replaced their work, but because AI enables them to take on more and bigger work.