VendingBench: A Practical Methodology for AI Evaluation from Haiku to Mythos

Introduction: Why AI Evaluation Has Become a Core Industry Challenge

As large models iterate at breakneck speed, how to scientifically and fairly assess the true capabilities of AI models has become a shared concern for developers and enterprise users alike. From early academic benchmarks like MMLU and HumanEval to today's comprehensive evaluation systems designed for real-world scenarios, AI evaluation methodology itself is undergoing a profound transformation. Recently, Lukas Petersson and Axel Backlund from Andon Labs sat down for a podcast interview to share their in-depth experience as the creators of VendingBench — from evaluating every generation of Anthropic's Claude models (Haiku through Mythos) to building a leading, long-lasting frontier evaluation benchmark from scratch.

Podcast interview screenshot

What Is VendingBench? A Real-World AI Evaluation Benchmark

VendingBench is an AI model evaluation benchmark developed by the Andon Labs team. AI evaluation benchmarks are standardized test suites designed to systematically measure the capabilities of artificial intelligence models, with a history stretching back to classic datasets like ImageNet and GLUE. As large language models have risen to prominence, benchmarks have evolved from single-task assessments to multi-dimensional comprehensive evaluations — yet they commonly face the dilemma of rapid saturation and loss of discriminative power. Unlike many existing academic benchmarks, VendingBench focuses on a model's overall performance in real-world scenarios rather than testing a single dimension of capability, which is precisely the core approach it takes to break through the limitations of traditional evaluation.

Core Philosophy: Reality as the Final Eval

Traditional AI evaluations tend to focus on specific tasks — mathematical reasoning, code generation, or knowledge-based Q&A. VendingBench's design philosophy treats evaluation as "the ultimate reality check" (Reality: The Final Eval). Specifically, evaluations should mirror real-world usage scenarios as closely as possible, measuring a model's actual performance on complex, multi-step tasks rather than just scores on standardized tests. Behind this philosophy lies a deep reflection on the current evaluation ecosystem: when a model's MMLU score improves from 70% to 90%, the capability improvement users actually experience in practice is often far less dramatic than the score gap suggests — indicating a systematic bias between traditional benchmarks and real-world capabilities.

From Haiku to Mythos: Evaluation Results Across the Full Claude Model Lineup

Anthropic is an AI safety company co-founded in 2021 by former OpenAI Research VP Dario Amodei and Daniela Amodei. Its core technical approach features Constitutional AI and RLHF (Reinforcement Learning from Human Feedback). The Claude model family employs a tiered product strategy: Haiku is positioned as a lightweight, low-latency model suited for high-throughput scenarios; Sonnet is a mid-sized, balanced model; Opus is the flagship large model pursuing maximum reasoning capability; and Mythos, as a newer member, represents Anthropic's latest explorations in model architecture and training methods. The VendingBench team conducted systematic evaluations across this entire spectrum, providing valuable data for understanding the capability boundaries of models at different scales.

The Non-Linear Relationship Between Model Scale and Capability

The evaluation results revealed an important finding: capability improvements do not scale linearly with parameter count. This finding forms an interesting contrast with the widely discussed Scaling Laws in AI research. In 2020, Kaplan et al. at OpenAI first systematically described the power-law relationship between model performance and increases in parameter count, data volume, and compute — but VendingBench's evaluations revealed a more complex reality. On certain task dimensions, smaller models can demonstrate performance comparable to larger ones — for example, factual retrieval tasks may already be near saturation on smaller models. However, in scenarios requiring deep reasoning, the gap between models widens dramatically, exhibiting a staircase-like "emergence" pattern where a qualitative leap suddenly occurs once model scale crosses a certain threshold. This non-linear characteristic has direct practical implications for enterprises choosing the right model deployment strategy — it's not simply a matter of "bigger is better," but rather requires precise matching based on the specific capability demands of each application scenario.

The Special Significance of Mythos Evaluation Results

As a newer member of the Claude model family, Mythos's evaluation performance has attracted widespread attention. VendingBench's evaluation framework was able to capture capability improvements that traditional benchmarks struggle to reflect, which also validates the discriminative power and forward-looking nature of the evaluation system. When scores across mainstream benchmarks increasingly converge, an evaluation system that can effectively distinguish subtle but critical differences between models becomes ever more valuable.

How to Build a Durable AI Evaluation Benchmark from Scratch

During the interview, Lukas and Axel shared their complete methodology for building high-quality evaluation benchmarks. Here are several key aspects.

Addressing Evaluation Data Contamination

One of the biggest challenges facing AI evaluation today is data contamination — when evaluation data gets incorporated into training sets, high scores no longer reflect true capability. The core mechanism behind this problem is that large language model training data typically comes from large-scale web crawling, and many publicly available benchmark datasets also exist on the internet. When test questions and their answers are included in training corpora, models are effectively "recalling" answers rather than "reasoning" through them, leading to inflated evaluation scores. Since 2023, multiple studies have revealed the severity of this issue — the performance gap between contaminated and uncontaminated benchmarks can reach 10-20 percentage points for some models.

The VendingBench team adopted multiple strategies to address this problem:

Dynamically updating evaluation sets to reduce data leakage risk
Designing task structures that are difficult to simply memorize
Introducing diverse evaluation dimensions
Using parameterized generation so that specific instances differ with each evaluation, fundamentally increasing the difficulty of data contamination

Designing for Evaluation "Shelf Life"

An excellent evaluation benchmark must not only be effective today but also maintain its discriminative power as models rapidly iterate. Benchmark Saturation is a classic challenge in this field — historically, classic benchmarks like MNIST (handwritten digit recognition, accuracy now exceeding 99.8%) and SQuAD 1.1 (reading comprehension, already surpassing human-level performance) have all gone through the complete lifecycle from effective to saturated. The team considered evaluation longevity from the outset, using tiered difficulty design (setting multiple difficulty gradients from basic to extremely challenging) and an extensible task framework (dynamically increasing task complexity as model capabilities improve) to ensure the benchmark won't be "maxed out" and rendered meaningless in the short term. Additionally, incorporating open-ended task design — task types without a single correct answer that require holistic judgment — is another important approach for maintaining long-term discriminative power.

Four Core Principles for Building Frontier Evaluations

The following key principles can be distilled from the interview:

Reality First: Evaluation tasks should reflect real-world usage scenarios as closely as possible, bridging the gap between benchmark scores and actual user experience
Multi-Dimensional Coverage: Avoid one-sided assessments from single metrics by comprehensively considering multiple capability dimensions including reasoning, creativity, instruction following, and robustness
Saturation-Resistant Design: Evaluation difficulty should have sufficient headroom, ensuring long-term effectiveness through combinatorially explosive task spaces and tiered difficulty architectures
Reproducibility: Evaluation processes and standards must be transparent and repeatable — this is a fundamental requirement for scientific evaluation and the foundation for building industry trust

Implications for AI Developers and Enterprise Users

As AI model capabilities continue to advance, the importance of evaluation systems only grows. The VendingBench team's work reminds us that good evaluation is not just a tool for measuring models — it's a compass for driving model progress. As Goodhart's Law warns — "When a measure becomes a target, it ceases to be a good measure" — evaluation designers must continuously innovate to ensure that benchmarks always guide models toward genuinely valuable directions of evolution.

As more and more models approach saturation on traditional benchmarks, evaluation benchmarks like VendingBench that emphasize real-world scenarios and possess lasting discriminative power will become critical references for the industry in judging true model capabilities. For AI developers and enterprise users, understanding the design philosophy behind evaluations is more important than simply watching leaderboard scores. Specifically, when selecting models, enterprises should consider whether the evaluation covers the core capability requirements of their business scenarios, whether the evaluation data carries contamination risks, and whether evaluation results are consistent and comparable across different time points.

Conclusion

The Andon Labs team has demonstrated frontier practices in AI evaluation through VendingBench. From systematic evaluation of the full Claude model lineup to their methodology for building durable evaluation benchmarks, their experience provides a valuable reference framework for the industry. At a time when AI capability boundaries are constantly expanding, a scientifically rigorous evaluation system is a crucial cornerstone for ensuring healthy technological development. As model capabilities continue to leap forward and application scenarios keep broadening, evaluation methodology itself will continue to evolve — and the design philosophy represented by VendingBench — "real-world scenarios first, saturation-resistant, contamination-resistant" — may well become the standard paradigm for the next generation of AI evaluation benchmarks.