GPT-5.5 After 3 Weeks of Real-World Testing: Does It Really Crush Opus 4.7 at Coding?

Overview: How GPT-5.5 Actually Performs

OpenAI has officially released GPT-5.5, and the EVERY team has been testing it internally for about three weeks. From coding to writing to knowledge work, they've put it through comprehensive evaluation. The verdict: this model delivers a genuine leap forward in many capabilities, but it also has clear weaknesses.

bilibili source: GPT-5.5实测3周：真的能打败Claude Opus 4.7了吗？| 中文配音

This article draws on the EVERY team's detailed "vibe evaluation" to provide an in-depth analysis of GPT-5.5's real-world performance across three dimensions — coding, writing, and knowledge work — along with how it truly stacks up against Claude Opus 4.7.

Coding: A Breakthrough on the SABench Senior Engineer Benchmark

SABench Benchmark Explained

The EVERY team created a benchmark called SABench (Senior Engineer Benchmark). The methodology: give the model a poorly written codebase and ask it to perform a clean, conceptually clear rewrite from scratch — exactly what a real senior engineer would do. The gold standard is set by two human senior engineers who each rewrote the code independently, consistently scoring 80–90 points.

This is the benchmark I invented

Key Score Comparison:

GPT-5.5 (using an Opus 4.7 plan): 62.5 points
GPT-5.5 (writing its own plan): 50–55 points
GPT-5.5 (no plan): Low 40s
Claude Opus 4.7: ~33 points
Human senior engineers: 80–90 points

This means GPT-5.5 outscored Opus 4.7 by a full 30 points — but with one critical caveat: GPT-5.5's best performance was achieved using a plan written by Opus 4.7.

Why Does an Opus 4.7 Plan Make GPT-5.5 Stronger?

In ultra-high reasoning mode

This finding is fascinating and reveals the distinct "personalities" of each model:

GPT-5.5's Core Strengths:

Can identify core principles and invariants in a codebase
Doesn't get led astray by existing code or fall into "patch mode"
Has the boldness to delete large numbers of files and start fresh
Can follow through on an idea from start to finish over hours of work
Delivers real execution power in ultra-high reasoning mode

Opus 4.7's Core Strengths:

Superior plan-writing ability with conceptual clarity
Plans read like "contracts" — precise, with acceptance criteria
Specifies concrete details, such as "this large file should only be 100 lines"

Opus 4.7's Clear Weakness:

When given its own beautifully written plan, it tends to say "that's too much work"
Prefers to pick a small section and patch things up
Reluctant to perform the full rewrite as requested

This leads to an interesting best practice: Use Opus 4.7 for planning, use GPT-5.5 for execution.

Performance Across Different Coding Scenarios

In the LF benchmark

GPT-5.5 doesn't dominate in every coding scenario. Multiple leads on the EVERY team tested from different angles:

Product-Side Engineering Tasks (LSD Benchmark): Opus 4.7 has a higher ceiling, especially on design-oriented tasks where its aesthetic sense outshines GPT-5.5. For feature development involving heavy frontend design and product thinking, Opus still leads.

Vibe Coding (Building New Apps from Scratch): When the plan is unclear, GPT-5.5 is less capable than Opus 4.7 at carrying a task through to completion. Opus has stronger autonomous planning ability under ambiguous requirements.

Programming Language Preferences: GPT-5.5 excels particularly at TypeScript and Swift, but performs poorly with Ruby. If you're working on a Rails project, you may be disappointed with the quality of generated Ruby code.

Real Project Validation: EVERY's GM Navin used GPT-5.5 to build a native iOS/Mac to-do app called Dayline and was impressed by its ability to batch-process features according to plan. He called it his "favorite all-around model" and said he couldn't have met the product launch deadline without it.

Writing: A New Option for Business Writing

In terms of writing, GPT-5.5 has less "personality" than Opus, especially compared to older versions like Opus 4–6. But it excels in business writing:

Investor update emails are essentially ready to send on the first draft
Excellent voice cloning — captures style accurately without overdoing it
More restrained and nuanced tone, well-suited for business contexts

EVERY's staff writer Katy Perrott has been writing with Claude models for nearly two years, and this is the first GPT model in a long time that she's started using for writing tasks. That endorsement carries significant weight.

Knowledge Work and AI Agent Experience

Across all these tests

A Clear Speed Advantage

Across all tests, the team was consistently impressed by GPT-5.5's speed. Compared to Opus 4.7, OpenAI's hardware advantage is clearly felt. This matters especially in agent scenarios that require frequent interaction.

The Codex Desktop App + GPT-5.5 Combo

OpenAI is iterating rapidly in the knowledge work space. The Codex desktop app paired with GPT-5.5 was rated as "the best agent experience on desktop":

Extremely fast and powerful
Can use any application on your computer
Excellent at web browsing
Great at building dashboards and performing complex data analysis

The Trade-Off in Insight

Interestingly, some training trade-offs made to make the model more digestible came at the cost of its insight into details. If your work demands highly sharp analytical insight, reviewers recommend sticking with Opus 4.7. Even when evaluating model paths on the senior engineer benchmark, Opus 4.7's judgment proved more trustworthy.

Practical Advice: How to Maximize GPT-5.5's Value

Based on three weeks of intensive testing, here are the key recommendations:

Plan First: Whether you're vibe coding or tackling senior engineering tasks, writing a more explicit plan is how you unlock this model's full potential
Use Both Models Together: Plan with Opus 4.7, execute with GPT-5.5 — this is the optimal workflow right now
Choose Your Language: Prefer TypeScript or Swift; avoid Ruby
Match the Scenario: Use GPT-5.5 when you need execution power; use Opus 4.7 when you need insight and aesthetic judgment
Agent Scenarios: For agent-based work on your computer, Codex + GPT-5.5 is currently the best option

Conclusion: How to Choose Between GPT-5.5 and Opus 4.7

GPT-5.5 has indeed achieved a significant edge over Opus 4.7 in coding execution, but this isn't a total domination. Each model has its strengths: GPT-5.5 is an outstanding executor, while Opus 4.7 is a better planner and aesthetic judge. The smartest approach isn't choosing one over the other — it's understanding each model's "personality" and using the right tool for the right scenario.