110K PRs Tested: Which of 5 AI Coding Agents Is Most Reliable?

Research Background: AI Coding Agents Have Penetrated Real Development Workflows

From auto-completion to independently submitting PRs, autonomous coding agents have truly embedded themselves in the core software development workflow. Tools like GitHub Copilot, Devin, and Claude Code are contributing an ever-growing share of code in open-source communities, but a critical question remains unanswered: How good is the quality of these agents' contributions? How do they differ from human developers?

Notably, autonomous coding agents have evolved through three distinct phases: The first phase was the "code completion" mode represented by early versions of GitHub Copilot, providing line-level or function-level completions based on the Codex model. The second phase was "conversational programming," where developers described requirements in natural language and models generated complete code snippets. The third phase is the current "autonomous agent" mode, where systems can independently plan tasks, invoke tools (such as terminals, browsers, code search), execute multi-step operations, and submit complete PRs. Devin is a quintessential representative of this third phase, with its core architecture comprising a task planner, tool-calling module, and self-reflection mechanism, capable of independently completing the entire workflow from requirement understanding to code submission in a sandboxed environment. This capability leap means evaluation frameworks must also upgrade from "code snippet quality" to "full software engineering lifecycle contribution quality."

A latest study from MSR 2026 provides the most systematic answer to date. The team led by Rezvan Mihai Popescu collected 110,000 open-source PR data points, covering 5 mainstream coding agents, not only comparing current contribution differences horizontally but also tracking long-term code evolution longitudinally. The conclusions challenge many common assumptions.

Research Overview

Research Design: A Rigorous Six-Layer Progressive Framework

The experimental design of this study is textbook-level, with the entire framework progressing through six layers:

Data Collection Layer: Crawling PR submission records, comments, reviews, Issues, and file change data from major open-source platforms
Agent Identification Layer: Identifying AI vs. human submissions through multiple dimensions including commit messages, account characteristics, and content features
Horizontal Comparison Layer: Conducting statistical tests comparing various metrics across different agents and human developers
Longitudinal Analysis Layer: Tracking code survival and churn rates at 1 month, 3 months, 6 months, and 1 year
Statistical Validation Layer: Controlling for confounding variables such as project activity level and developer experience
Conclusion Output Layer: Drawing conclusions based on results after strict variable control

The most commendable aspect is how well confounding variables were controlled. The control group consisted of senior developers with years of open-source contribution experience, not beginners, avoiding the common pitfall of "comparing AI against novices."

Agent Performance Comparison

5 AI Coding Agents and Evaluation Dimensions

The 5 coding agents compared include: OpenAI Codex/CL, Claude Code, GitHub Copilot, Google Gemini, and Devin. The evaluation dimensions cover six core metrics: PR merge rate, change size, file type preference, developer interaction signals, short-term churn rate (3 months), and long-term survival rate (1 year).

Horizontal Comparison: Significant Differences in Merge Rates and Behavioral Patterns

PR Merge Rate: Copilot Leads, Devin Trails

Pull Request (PR) merge rate is the core proxy metric for measuring code contribution quality in open-source communities. A PR must pass through multiple gates from submission to merge—code review, CI/CD automated testing, and manual maintainer review—so merge rate naturally integrates signals from multiple dimensions including code correctness, style consistency, and requirement alignment. This study mitigated bias from differing merge standards across projects by controlling for project activity level.

Merge rates differ significantly across agents. GitHub Copilot has the highest merge rate, approximately 60%, essentially on par with senior human developers; while Devin's merge rate is below 40%. This relates to their design positioning—Copilot is fundamentally an assistive tool where humans write the main code and it supplements portions. Human developers have already filtered and modified Copilot-generated code before submission, effectively completing a round of manual filtering before the PR is created, hence the higher merge rate. Devin is designed to autonomously complete entire tasks, resulting in higher PR complexity and naturally lower merge rates.

Change Size: AI Prefers Small Modifications

The average change size of agent-submitted PRs is notably smaller than human submissions, mostly consisting of small changes of a few dozen lines. Human developer PRs frequently contain feature submissions of hundreds or even thousands of lines.

File Type Preference: AI Excels at "Peripheral" Work

Agents generally tend to modify documentation, configuration files, test cases, and other non-core code files, while human developers more frequently modify core business logic and architecture-related code.

Code Style Consistency Analysis

Review Interaction: AI Code Requires More Scrutiny

Agent-submitted PRs receive approximately 30% more comments and review feedback on average than human PRs, indicating that human reviewers are more cautious when examining AI code. One reason is that agent-written code tends to have lower consistency with the project's existing style.

Another noteworthy finding: Agent PRs have significantly lower Issue resolution rates than human PRs. Although PRs may pass review, they don't fully resolve the corresponding requirements, necessitating subsequent human supplementary modifications.

Review Interaction Signals

Longitudinal Tracking: Long-Term Maintainability of AI Code Is Concerning

If horizontal comparison reveals current differences, the longitudinal tracking results are even more alarming.

Short-Term Churn Rate: AI Code Is 40% Higher

Within 3 months after PR merge, the churn rate (proportion modified, refactored, or deleted) of agent-generated code is on average 40% higher than human code. Specific figures: human code has a 3-month churn rate of approximately 25%, while agents average 35%, with some exceeding 40%.

Long-Term Survival Rate: Half of AI Code Doesn't Survive a Year

Code Survival Rate is a classic metric in software evolution research, typically measured through Git blame analysis—by comparing code attribution at different time points, it calculates the proportion of code from a given commit that remains in subsequent versions. Low survival rates mean code is frequently refactored or deleted, which directly corresponds to Technical Debt accumulation in engineering practice. The concept of technical debt was introduced by Ward Cunningham in 1992, described as "the implicit cost incurred by sacrificing code quality for short-term delivery speed."

One year after submission, approximately 70% of code written by senior human developers remains in the project, while agent-generated code has an average survival rate of only around 50%, with the lowest falling below 40%. If the hidden costs of subsequent maintenance and refactoring are factored into ROI calculations, the actual benefits of AI coding tools shrink dramatically—this has direct financial implications for enterprises formulating AI tool procurement and usage strategies.

The more critical finding: The more central the module, the lower the agent's survival rate. In core business logic code, human one-year survival rates reach 80%, while agents manage only around 30%. Conversely, for documentation, configuration files, and other non-core components, agent survival rates are comparable to humans.

This means: using AI to write code may save a few days initially, but the subsequent time spent on refactoring and bug fixes can be even greater—the total cost may not be worthwhile.

Root Causes: Why Does AI Code "Not Last"?

The research team identified six core reasons:

Limited Context Understanding: While mainstream LLM context windows have expanded from the early 4K tokens to 128K or even 200K tokens, large production-grade codebases often far exceed these limits. The deeper challenge is that code understanding involves not just text comprehension but also structural knowledge like Call Graphs, Data Flow Graphs, and module dependency relationships—information that cannot simply be "stuffed into" a context window. Core modules are often deeply coupled with the entire system's historical evolution, and this "tacit knowledge" cannot be fully conveyed to models through current context mechanisms.
Training Objective Misalignment: Training objectives emphasize "writing correct code for the present" rather than long-term maintainability and extensibility
Hidden Errors Are Hard to Detect: AI code sometimes contains subtle bugs that current tests can't catch, only manifesting after running in production for some time
Lack of Domain Knowledge: Industry-specific business logic and unwritten rules lack sufficient training data for AI
Insufficient Creativity: In scenarios requiring innovative design (such as new architectural approaches), AI falls short of human developers
Communication Gaps: Unlike humans, AI cannot discuss requirement details with product managers and colleagues before coding—it can only rely on written documentation

Research Limitations Discussion

Practical Implications: How to Use AI Coding Tools Scientifically

Based on this research, development teams can optimize their AI coding tool usage strategies across three dimensions:

Establish Differentiated Code Review Standards

Don't review AI code using exactly the same standards as human code. For AI-written core module code, conduct more rigorous checks focusing on logical correctness, architectural compatibility, and long-term maintainability.

Allocate Tasks Based on Each Tool's Strengths

Assign tasks that AI excels at and where churn has minimal impact—such as documentation updates, writing test cases, and modifying configuration files—to AI. Core business logic and architectural design should remain with human developers for maximum efficiency and reliability. While current technologies like RAG (Retrieval-Augmented Generation) and Code Graph Embedding extend AI's code comprehension capabilities to some degree, they still have significant shortcomings in capturing cross-file implicit dependencies and historical design decisions. Therefore, human leadership on core modules remains the optimal strategy at this stage.

Establish Long-Term Monitoring Mechanisms for AI Code

Don't just check whether there are issues at submission time—continuously track the churn rate and bug rate of AI-generated code. If you discover that maintenance costs actually increase after using AI for certain types of tasks, adjust your strategy promptly to avoid accumulating technical debt for short-term efficiency gains.

Conclusion

The greatest value of this research lies in not stopping at the surface-level question of "can AI write code," but extending the perspective to the complete software engineering lifecycle. Empirical data from 110,000 PRs tells us: AI is an excellent assistive tool, but long-term quality of core code still requires human oversight. As AI coding tools become increasingly prevalent, rationally understanding their strengths and limitations is the only way to truly maximize their benefits.

Key Takeaways

The study covers 110K PR data points and 5 mainstream coding agents, making it one of the largest empirical studies in this area to date
GitHub Copilot has the highest merge rate (~60%), on par with senior human developers; Devin's merge rate is below 40%
AI-generated code has a 3-month churn rate 40% higher than human code, a one-year survival rate of only ~50%, and core module survival rates as low as 30%
Agents excel at modifying documentation, configuration files, and other non-core code; core business logic still requires human developer leadership
Teams should establish differentiated review standards for AI code, allocate tasks based on strengths, and build long-term code quality monitoring mechanisms