7 related articles

Deep dive into ViBench, a benchmark addressing SWE-bench's gaps in evaluating AI application building through end-to-end generation, visual quality, and functional completeness.

ViBench is the first end-to-end app creation benchmark based on real-world tasks. Results show Claude Opus 4.8 leads in performance and cost-effectiveness, revealing gaps between SWE-bench scores and actual development capability.
Product ReviewsHands-on testing of Claude Haiku 4.5's coding ability, comparing it with Sonnet 4.5 and Opus 4.1 across weather cards, physics simulation, and 3D rendering tasks.
Product ReviewsGPT-5.5 vs DeepSeek-V4 in four comprehensive rounds covering world knowledge, context memory, logical reasoning, and coding — a detailed comparison of real performance differences.
Tech FrontiersWeekly AI roundup: Kimi K2.6 tops open-source rankings, Anthropic launches Opus 4.7 and Claude Design, Alibaba rolls out Qwen 3.6 series, Google releases emotion-controllable TTS model.
Tech FrontiersDeepSeek releases OCR2 replacing CLIP with an LLM as visual encoder; Moonshot AI launches Kimi K2.5 with 100+ sub-agent cluster mode; Microsoft deploys 3nm Maia 200 chip; Alibaba releases Qwen3 Max Thinking.