#SWE Bench

7 related articles

2026年6月4日·2 min

ViBench: A Benchmark Designed Specifically for Evaluating AI Application Building Capabilities

Deep dive into ViBench, a benchmark addressing SWE-bench's gaps in evaluating AI application building through end-to-end generation, visual quality, and functional completeness.

2026年6月4日·2 min

ViBench Benchmark: End-to-End App Creation Evaluation Reveals the True Level of AI Programming

ViBench is the first end-to-end app creation benchmark based on real-world tasks. Results show Claude Opus 4.8 leads in performance and cost-effectiveness, revealing gaps between SWE-bench scores and actual development capability.

Claude Haiku 4.5 Hands-On: Coding Ability Rivals Sonnet 4 at One-Third the Cost

Product Reviews

2026年6月3日·1 min

Claude Haiku 4.5 Hands-On: Coding Ability Rivals Sonnet 4 at One-Third the Cost

Hands-on testing of Claude Haiku 4.5's coding ability, comparing it with Sonnet 4.5 and Opus 4.1 across weather cards, physics simulation, and 3D rendering tasks.

GPT-5.5 vs DeepSeek-V4: Who Wins in a Four-Round Head-to-Head Test?

Product Reviews

2026年6月3日·2 min

GPT-5.5 vs DeepSeek-V4: Who Wins in a Four-Round Head-to-Head Test?

GPT-5.5 vs DeepSeek-V4 in four comprehensive rounds covering world knowledge, context memory, logical reasoning, and coding — a detailed comparison of real performance differences.

AI Weekly: Kimi K2.6 Tops Open-Source Rankings, Qwen 3.6 and Google TTS Launch Together

Tech Frontiers

2026年5月27日·2 min

AI Weekly: Kimi K2.6 Tops Open-Source Rankings, Qwen 3.6 and Google TTS Launch Together

Weekly AI roundup: Kimi K2.6 tops open-source rankings, Anthropic launches Opus 4.7 and Claude Design, Alibaba rolls out Qwen 3.6 series, Google releases emotion-controllable TTS model.

DeepSeek OCR2, Kimi K2.5, and Microsoft Maia 200 All Launched on the Same Day

Tech Frontiers

2026年5月27日·2 min

DeepSeek OCR2, Kimi K2.5, and Microsoft Maia 200 All Launched on the Same Day

DeepSeek releases OCR2 replacing CLIP with an LLM as visual encoder; Moonshot AI launches Kimi K2.5 with 100+ sub-agent cluster mode; Microsoft deploys 3nm Maia 200 chip; Alibaba releases Qwen3 Max Thinking.

Product Reviews

Running Qwen3.6-27B Locally on Mac: 4 …

2026年5月27日·3 min

Running Qwen3.6-27B Locally on Mac: 4 Solutions Benchmarked

Benchmarking 4 solutions for running Qwen3.6-27B locally on Mac: GGUF, MLX Diflash, and MTP-LX. MTP-LX 4bit leads at 43.6 tok/s with solid coding, writing, and reasoning quality.