2 related articles

Deep dive into ViBench, a benchmark addressing SWE-bench's gaps in evaluating AI application building through end-to-end generation, visual quality, and functional completeness.

ViBench is the first end-to-end app creation benchmark based on real-world tasks. Results show Claude Opus 4.8 leads in performance and cost-effectiveness, revealing gaps between SWE-bench scores and actual development capability.