Deep Dive Review of AI Coding Assistants: Copilot at the Bottom

Introduction

With AI coding assistants blooming everywhere, developers face a happy dilemma: which tool should they actually choose? A senior developer spent considerable time conducting a systematic evaluation of mainstream AI coding assistants on the market, and the results were jaw-dropping — the former industry benchmark GitHub Copilot ended up near the bottom, while some tools you may have never heard of delivered stunning performance.

Review video screenshot

This article provides a detailed breakdown of the evaluation methodology, how each tool performed across different models, and the final comprehensive rankings.

Evaluation Methodology: A Three-Dimensional Scoring System

Model Selection

This evaluation used three mainstream large language models as the underlying engines:

Claude 4.0
Claude 3.7
Gemini Pro 2.5 0506

Claude 4.0 (also known as Claude Sonnet 4) is Anthropic's latest model released in mid-2025, with significant improvements in code generation, reasoning, and instruction following. Claude 3.7 Sonnet is its predecessor, known for its "Extended Thinking" capability that enables deeper reasoning chains before answering. Gemini Pro 2.5 is Google DeepMind's multimodal model — while excellent at general tasks, it has gradually been surpassed by the Claude series in pure code generation scenarios. These three models represent the top tier of current AI coding, making them highly representative benchmarks.

The reviewer noted that Gemini Pro 2.5 0506 may be removed next month, as its overall performance has clearly fallen behind the Claude series.

Three-Dimensional Scoring Criteria

The evaluation uses three core dimensions:

Instruction Following: When you tell the AI to do something, does it accurately execute the instruction?
Unit Testing: Testing the actual functionality of generated code
LLM as Judge: Using Claude 3.7 Thinking as a code quality judge to assess overall code quality

"LLM as Judge" is an important methodology that has emerged in AI evaluation in recent years. Traditional code evaluation relies on manual review or fixed automated tests, but both approaches are either prohibitively expensive or limited in coverage. Using a large language model as a judge enables comprehensive assessment across multiple dimensions including code readability, architectural design, error handling, and edge case coverage. Claude 3.7 Thinking mode is particularly suited for this role because its extended thinking capability allows multi-step reasoning before assigning scores, producing more consistent and explainable judgments. Research shows that judgment reliability is highest when the judging model's capability significantly exceeds that of the code generation model being evaluated.

The reviewer specifically noted that using Claude 3.7 Thinking as a judge provides extremely high consistency — running the same evaluation multiple times produces minimal variance in results.

Performance Under Gemini Pro 2.5 0506

Under the Gemini Pro 2.5 model, overall performance was disappointing:

Rank	Tool	Score
1	Kline	Slightly above 6240
2	Zed	6240
3	RooCode	5980
4	Trae	Slightly below 5980
5	GitHub Copilot	Significant drop
6	Cursor	Significant drop
7	Windsurf	Last place

The reviewer pointed out that any score below 6000 doesn't qualify as high-quality output by current standards. Kline had an extremely low tool call failure rate with acceptable code quality; Zed showed clear improvement over RooCode with fewer tool failures and more unit tests passed.

Performance Under Claude 3.7: A Major Shakeup

Switching to Claude 3.7, the rankings shifted dramatically:

Rank	Tool	Score
1	VOID Editor	7280
2	RooCode	7180
3	Zed	6780
4	GitHub Copilot	Slightly below Zed
5	Cursor	Slightly below Copilot
6	Kline	Lower-middle
7	Trae	Significant drop
8	Windsurf	Last place

The biggest surprise was VOID Editor — an open-source Cursor alternative that took first place with 7280 points. VOID Editor is a fully open-source AI code editor project built on VS Code's open-source core, but with its own integrated AI Agent system. Unlike commercial products like Cursor, VOID allows users to freely choose their underlying model provider without being locked into a specific API service. Its open-source nature means the community can audit its system prompts and tool call logic, which explains why it achieves extremely high scores under certain models — transparent architectural design typically means less middleware overhead and more precise instruction delivery.

The reviewer stated they had never heard of this tool before. Any score above 7000 represents "excellent quality."

Another notable change is Kline dropping from champion under Gemini to lower-middle under Claude 3.7, while Windsurf also fell significantly from its previously strong performance.

Performance Under Claude 4.0: The Ultimate Showdown

Under the latest Claude 4.0 model:

Rank	Tool	Score
1	Claude Code (Ultra Think)	7170
2	Trae	7120
3	Windsurf	7080
4	Zed	~6780
5	Cursor	~6780
6	RooCode	~6780
7	Augment Code	Lower
8	GitHub Copilot	Last place

A key finding: Trae and Windsurf were at the bottom under Claude 3.7 but jumped into the top three under Claude 4.0. This suggests they may have been specifically optimized for Claude 4.0, but this optimization actually hurt their 3.7 performance.

The massive performance differences across models have deep technical reasons. Each coding assistant has its own system prompts, tool use protocols, and context assembly strategies. When a tool is optimized for a specific model's API format, token limits, and response characteristics, switching to another model can cause instruction parsing deviations. For example, Trae and Windsurf may use Claude 4.0-specific format markers or function calling conventions in their system prompts that actually cause confusion on 3.7. This also explains why highly configurable tools (like RooCode) tend to maintain better cross-model consistency — users can adjust prompt strategies for different models.

The Massive Impact of Ultra Think

The reviewer found that using the "Ultra Think" prompting technique with Claude Code produced enormous differences — it forces the AI to create its own checklist and iterate through execution. The same technique had minimal impact on other tools.

Ultra Think is an advanced prompt engineering technique for Claude Code. Its core principle is using a specific prompt structure to force the AI to generate a detailed execution plan and checklist before performing a task, then completing each item with self-verification. This essentially combines "Chain of Thought" and "Self-Reflection" techniques applied to coding scenarios. As a CLI tool, Claude Code manages conversational context differently from IDE-embedded agents — it has a longer context window and more flexible execution loops, making iterative self-improvement techniques like Ultra Think most effective on this platform.

Comprehensive Consistency Rankings

Cross-model comprehensive consistency rankings (more stable = better):

Zed — Most stable performance across models
RooCode — Very close to Zed
Trae — An impressive newcomer
Kline — Slightly below Trae
Cursor / Windsurf — Essentially tied
GitHub Copilot — Consistently last

Subjective User Experience Rankings

The reviewer provided personal subjective rankings based on daily use:

First Place: Claude Code

Became the primary daily work tool. Advantages include:

Extremely efficient CLI workflow
Amazing performance once you learn proper prompting
Max plan provides nearly unlimited usage
Supports WSL, Linux, and Mac across platforms

Second Place: Augment Code

While it performed poorly in the "creating new code from scratch" evaluation, its context engine is arguably the strongest on the market:

After loading an open-source project, it can precisely locate code that needs modification within minutes
Irreplaceable capability for working with existing codebases
Frequent agent updates, though stability needs improvement

Augment Code's core technical advantage lies in its proprietary code indexing and context retrieval system. Unlike most AI coding assistants that rely on simple file reading or embedding-based RAG (Retrieval-Augmented Generation), Augment Code has built a semantic indexing engine that deeply understands code structure. It can parse a project's dependency graph, function call chains, type system, and module boundaries, enabling precise location of relevant code snippets when users request modifications. This capability is especially critical when handling large codebases (hundreds of thousands of lines or more), where simple full-text search or vector similarity matching often introduces significant noise.

Third Place: RooCode

Core advantage lies in extreme configurability:

Supports extensive custom modes
Boomerang mode significantly improves usability
New features like multi-file simultaneous reading continuously added
Can be deeply customized to match personal workflows

Conclusions and Insights

This evaluation reveals several important trends:

There is no absolute king: Rankings shift dramatically across models, showing that tool-model compatibility is crucial
GitHub Copilot has fallen behind: The former industry pioneer ranked near the bottom in almost every test
Open-source tools are rising: Open-source/highly configurable tools like VOID Editor and RooCode delivered impressive results
Use case determines choice: Creating new code and maintaining existing code are completely different scenarios requiring different tools
Prompt engineering still matters: Techniques like Ultra Think can significantly change tool performance

The recommendation for developers: don't rely on a single ranking. Instead, choose the best tool combination based on your primary use case (new projects vs. code maintenance), preferred underlying model, and workflow habits.

Deep Dive Review of AI Coding Assistants: Copilot at the Bottom — Who's the Real King?