3 related articles

mini-SWE-agent's GPT-5 series evaluation on SWE-bench shows GPT-5 matches Claude Sonnet 4, while GPT-5-mini loses only ~5 points at less than 1/5 the cost.

A deep dive into SWE-bench Multilingual benchmark covering 9 programming languages, 300 real GitHub tasks, its design methodology, language distribution, evaluation metrics, and significance for AI coding assistants.

SWE-agent team finds mini-SWE-agent randomly switching between GPT-5 and Claude Sonnet 4 outscores either model alone on SWE-bench. Exploring the diversity hypothesis behind Roulette Mode.