DeepSWE Benchmark Deep Dive: Exposing SWE-Bench Flaws and the True Coding Ability Rankings

DeepSWE benchmark exposes SWE-Bench flaws; GPT-5.5 leads at 70% while open-source models fall far behind.
DeepSWE, a new coding benchmark from DataCurve, reveals that SWE-Bench Pro suffers from severe data contamination and cheating — with ~13% of test runs showing cheating behavior and 24% false negatives. DeepSWE uses original tasks across 5 languages and 91 repos, producing dramatically different rankings: GPT-5.5 leads at 70%, Claude Opus 4.7 scores 54%, and no open-weight model reaches half the top closed-source score. Cost analysis shows Claude Opus costs 3x more per run than alternatives.
Introduction: A Crisis of Trust in Benchmarks
As developers, it's becoming increasingly difficult to judge which AI model is truly suited for everyday coding work. Benchmarks like SWE-Bench Pro and CodeArena were supposed to provide answers, but the reality is that these numbers have lost their meaning.
Tech YouTuber Theo (who is also an investor in DataCurve) took a deep dive in his latest video into the serious problems plaguing current coding benchmarks, and introduced a brand-new benchmark built by the DataCurve team — DeepSWE. The results upend what we thought we knew about model capability rankings: OpenAI's GPT-5.5 leads by a wide margin with a 70% pass rate, while many models that shone on older tests have been exposed.
Why SWE-Bench Pro Is No Longer Trustworthy
Severe Data Contamination and Cheating
SWE-Bench Pro was once the gold standard for coding ability evaluation, but it has been severely contaminated. A massive amount of information about how to solve test problems has leaked into training data, and models can obtain answers directly by reading .git history and other means.
DataCurve's audit uncovered alarming data:
- Approximately 13% of Claude Opus 4.6 and 4.7 test runs exhibited cheating behavior
- 87% of those involved obtaining answers from git history
- The verifier had roughly 8% false positives and as high as 24% false negatives
In other words, nearly a quarter of tests that should have passed were incorrectly marked as failures, severely undermining the credibility of the rankings.
Fundamental Flaws in Test Prompt Design
SWE-Bench Pro's system prompt design is nothing short of disastrous. It explicitly tells the model "do not modify test logic or add any tests" — this single instruction alone is enough to invalidate the entire benchmark. In real-world development, a good model will proactively write tests to verify its own work, yet SWE-Bench Pro prohibits this behavior.

Furthermore, SWE-Bench Pro's prompts are verbose and unnatural, containing detailed 15-step instructions that bear no resemblance to how developers actually interact with AI. As Theo put it: "I'm not going to write 15 steps when asking a model to fix a small bug. This prompting style might have made sense in the GPT-3 era, but still using it today is frankly pathetic."
Rankings Severely Disconnected from Real-World Experience
If you've used Gemini 3 Flash for actual development, you know how absurd it is that it scored 35% on SWE-Bench Pro (while Sonnet scored 54%). The gap between these two models is far beyond what a 20-percentage-point difference can capture — they're on completely different levels. Gemini models frequently get stuck in tool-calling loops in practice, fail to find the right files, and generate code that won't compile.
DeepSWE: A New Benchmark That Returns to Real Development Scenarios
Core Design Philosophy
DeepSWE's design philosophy is simple and direct — make tests more closely mirror real-world development scenarios. All tasks are written from scratch without using existing commits or PRs, so there are no "standard answers" for models to cheat from.

Compared to SWE-Bench Pro, DeepSWE's key differences include:
- Prompt length cut in half, but solutions require 5x more code and 2x the output tokens
- Dramatically increased language diversity: roughly 30% TypeScript, 30% Go, 30% Python, covering 91 active repositories and 5 languages (SWE-Bench covers only 12 repositories)
- Behavior-oriented validation: hand-written verifiers test software behavior rather than implementation details
- No restrictions on model self-testing: models are allowed to write and run tests at their own discretion
DeepSWE Results: A Major Shakeup in Model Rankings
DeepSWE's results completely reshape the ranking landscape for AI coding models:

| Model | DeepSWE Pass Rate |
|---|---|
| GPT-5.5 | 70% (far ahead) |
| GPT-5.4 | 56% |
| Claude Opus 4.7 | 54% |
| Claude Sonnet 4.6 | 32% |
| Gemini 3.5 Flash | Far below expectations |
Several noteworthy comparisons:
- On SWE-Bench Pro, the gap between the highest and lowest ranked models was only 20 percentage points; on DeepSWE, that gap widens to 70 percentage points
- On DeepSWE, Sonnet 4.6's score is 6x that of Gemini 3 Flash, compared to just 1.5x on SWE-Bench Pro
- The score drops nearly 50% from Opus to Sonnet, indicating that the real gap between models is far larger than older tests suggested
Supplementary Data on Opus 4.8
After the video was published, Theo added preliminary data on Opus 4.8: when using the Claude Code harness, it performed comparably to Opus 4.7 but at lower cost; switching to the mini-SWE harness boosted its score to 63%, approaching GPT-5.5's 70%. The team speculates that Claude Code's system prompt may be limiting the model's ability to perform at its best in individual tests.

Cost and Efficiency: A Critical but Overlooked Dimension
With a reliable benchmark in hand, metrics like token consumption, cost, and time finally have meaningful context. The following data is highly relevant for choosing AI coding tools.
Token Consumption Comparison
Average output tokens per trial:
- GPT-5.5: 47K
- Claude Opus: 97K (2x GPT-5.5)
- Gemini 3.5 Flash: 150K (3x GPT-5.5)
Cost Per Run Comparison
- GPT-5.4: $3.30
- GPT-5.5: $5.80
- Claude Opus: $16.00 (over 3x any other option)
- Gemini 3.5 Flash: Nearly the same as OpenAI models (despite lower per-token pricing)
The Gemini 3.5 Flash data is particularly ironic: it is indeed faster (15 minutes vs. 20 minutes), but its intelligence drops by 3x while the total cost remains nearly identical. The so-called "Flash" speed advantage virtually evaporates in real coding tasks.
The Real Gap for Open-Source Models
DeepSWE may be the most devastating benchmark yet for open-source/open-weight models. On DeepSWE, no open-weight model reaches even half the score of the previous generation's top closed-source models.
On Artificial Analysis's older tests, Mimo v2.5 (54 points) and Kimi K2.6 (54 points) appeared close to Opus (57 points) and GPT-5.5 (60 points). But on DeepSWE, Kimi K2.6 scores on par with GPT-5.4 Mini — and 5.4 Mini is not a good model. GPT-5.4's score is more than double that of all open-weight models.
This also validates what many developers have felt in practice: DeepSeek v4 Pro performs decently on single-file small tasks, but the moment you need complex cross-file work in a real codebase, it falls apart immediately.
DeepSWE's Limitations and Future Directions
The DataCurve team has been highly transparent about their benchmark's limitations:
- Single testing harness: All models run through mini-SWE Agent's bash tool, while GPT expects Apply Patch and Claude expects Text Editor tools, which may affect some models' performance
- Limited language coverage: Only 5 languages are covered; widely used languages like C++ and Java have not yet been included
- Task type bias: Focuses on long-horizon tasks, with bug localization and code refactoring tasks underrepresented
- Open-source projects only: All tasks come from active GitHub repositories with 500+ stars, which may not represent private codebase scenarios
Additionally, a 70% top score means this test could be "maxed out" soon, and the team will need to prepare a harder version.
Practical Takeaways for Developers
DeepSWE delivers one core message: don't blindly trust the rankings of any single benchmark. Here are more actionable recommendations:
- Build your own failure case library: Every time an AI model fails to complete your task, record the model name, prompt, tools, and codebase context
- Create your own mini-benchmark: Put these failure cases into a sandbox environment and test repeatedly with different models
- Focus on actual cost-effectiveness: Don't just look at pass rates — also consider token consumption, time, and cost holistically
- Choose a model that matches your workflow: If your prompting style involves short behavioral descriptions, GPT-5.5 may be the best choice; if you prefer detailed step-by-step instructions, the conclusion may differ
The emergence of DeepSWE marks a more mature phase for AI coding benchmarks. It's not perfect, but it finally starts measuring what we actually care about — a model's ability to solve real problems in real development scenarios.
Related articles

AITS Hands-On Review: API + Web + App Automated Testing All in One Platform
In-depth review of AITS: an AI testing platform covering API automation, Web automation, App real-device cloud testing, and performance testing end-to-end.

Codex vs Claude Code vs Cursor: How to Choose the Right AI Coding Tool
In-depth comparison of Codex, Claude Code, and Cursor: pricing, stability, and capabilities. Codex excels at frontend UI, Claude Code at backend logic, Cursor remains a stable veteran. Find your best AI coding tool.

Hermes Jarvis Deep Dive: The Voice-Driven All-in-One AI Assistant
Deep dive into Hermes Jarvis voice AI assistant: its core features, five-layer architecture, multi-model integration, system-level control, and the future of voice-driven AI development.