2 related articles

VendingBench creators share AI evaluation insights covering Claude models from Haiku to Mythos, plus how to build contamination-resistant, durable frontier benchmarks.

ViBench is the first end-to-end app creation benchmark based on real-world tasks. Results show Claude Opus 4.8 leads in performance and cost-effectiveness, revealing gaps between SWE-bench scores and actual development capability.