#AI evaluation benchmark

2 related articles

2026年6月14日·2 min

VendingBench: A Practical Methodology for AI Evaluation from Haiku to Mythos

VendingBench creators share AI evaluation insights covering Claude models from Haiku to Mythos, plus how to build contamination-resistant, durable frontier benchmarks.

2026年6月4日·2 min

ViBench Benchmark: End-to-End App Creation Evaluation Reveals the True Level of AI Programming

ViBench is the first end-to-end app creation benchmark based on real-world tasks. Results show Claude Opus 4.8 leads in performance and cost-effectiveness, revealing gaps between SWE-bench scores and actual development capability.