Sonar Tests Code Quality Across 53 LLMs: Claude Has the Most Security Vulnerabilities, GPT-5 Code Volume Surges 5x

The Trust Crisis in AI-Generated Code

According to survey data from Pragmatic Engineer, 55% of code is now generated by AI Agents, and 75% of developers already use AI programming tools in their daily work. From VS Code and JetBrains to Cursor and Windsurf, to Agent coding platforms like Codex, Claude Code, and Gemini CLI, software development is undergoing a paradigm shift.

But a core question faces everyone: Do you trust AI-generated code? Is it secure? Maintainable? Readable?

Code quality management company Sonar systematically evaluated 53+ large language models using 4,444 Java programming tasks, and the results are thought-provoking — models that shine on benchmarks exposed serious problems under enterprise-grade code quality standards.

Evaluation Framework: Beyond Functional Correctness

The benchmarks that LLM vendors love to showcase — HumanEval, MBPP, SWE-bench — essentially measure functional correctness: whether code can pass test cases. But enterprise development cares about far more than that.

These benchmarks each have their focus: HumanEval is a code generation benchmark released by OpenAI in 2021, containing 164 Python programming problems that measure the probability of a model generating correct code via the pass@k metric; MBPP (Mostly Basic Python Problems) was proposed by Google and includes approximately 1,000 entry-level programming tasks; SWE-bench extracts issues and corresponding pull requests from real GitHub repositories, requiring models to solve actual software engineering problems within a full codebase context — significantly harder than the other two. The common limitation of these benchmarks is that they only verify functional correctness without evaluating engineering quality attributes like security, readability, and maintainability.

What enterprise development truly cares about includes:

Security: Whether vulnerabilities and security risks exist
Maintainability: Technical debt and code complexity
Engineering standards: Architectural soundness and coding discipline
Reliability: Edge case handling and exception management

Sonar's evaluation framework uses SonarQube Enterprise to perform static analysis on generated code from 4,444+ independent Java programming tasks, covering bug detection, security vulnerability scanning, code complexity calculation, and more. Static analysis is a technique that can find potential issues without executing code, detecting bugs, security vulnerabilities, and code smells by parsing the source code's Abstract Syntax Tree (AST), control flow graphs, and data flows. As the benchmark tool in this field, SonarQube has thousands of built-in rules covering dozens of programming languages, capable of exposing potential risks before code even runs.

Head-to-Head Comparison of Five Major Models

Gemini 3.1 Pro High: The Accuracy King

SWE-bench pass rate: 84.17% (ranked #1)
Code volume: 307,000 lines
Cyclomatic complexity: 234
Bug density: 614 per million lines of code
Security issues: 210 per million lines of code

Gemini showed the most balanced performance with good code conciseness, making it currently the best overall quality choice.

A key metric needs explanation here: Cyclomatic Complexity was proposed by Thomas McCabe in 1976 and quantifies program complexity by counting the number of independent execution paths in code — each if, for, while, and case branch increases the complexity value. Generally, functions with cyclomatic complexity exceeding 10 need refactoring, and those exceeding 20 are nearly untestable. Gemini's overall cyclomatic complexity of 234, spread across 4,444 tasks, means the average complexity per task is quite well controlled.

Claude Sonnet 4.6: Highest Security Risk

Security issues: 300 per million lines of code (highest among all models)
Code volume: 627,000 lines (highly redundant)

Claude's performance on the security dimension is concerning, with severe code bloat — generating more than twice the code volume of Gemini.

GPT-5.4 / GPT-5.4 Pro High: Code Volume Explosion

Code volume: 1.2 million lines (staggering code bloat from just 4,400 tasks)
Compared to the GPT-4.0 era's 250,000 lines, this represents nearly a 5x increase

This reveals a counterintuitive trend: newer and larger models actually generate more bloated code.

Regarding the Bug Density metric as an industry reference: Bug Density (Defect Density) is one of the core metrics for software quality measurement, typically expressed as defects per thousand lines of code (KLOC) or per million lines of code. Based on industry experience data, mature commercial software typically has a bug density between 1-25 per KLOC, while mission-critical systems like NASA require less than 0.1 per KLOC. Gemini's 614 per million lines (i.e., 0.614 per KLOC) appears to be within the normal industry range, but considering this is the raw quality of auto-generated code before human review, and these bugs may include security vulnerabilities and logic errors, their impact in production environments could be significantly amplified.

Four Root Cause Analyses

1. Mixed Quality of Training Data

Training sets contain large amounts of open-source code, including both high-quality examples and substantial amounts of non-compliant or even defective code. Models cannot distinguish good from bad and learn everything indiscriminately. It's estimated that a significant proportion of publicly available code on GitHub contains known security vulnerabilities or practices that don't follow best practices, and large models learn from this code indiscriminately during pre-training.

2. Built-in Security Defects

Known insecure coding patterns exist in training data, and models learn these vulnerable patterns while learning functional implementations. For example, classic vulnerability patterns from the OWASP Top 10 — such as SQL injection, path traversal, and insecure deserialization — are widespread in open-source code, making it easy for models to reproduce these anti-patterns when generating code.

3. Hidden Logic Errors

Subtle logic errors exist in the training pool, causing models to produce code that appears correct but is actually problematic in certain scenarios — these types of bugs are extremely difficult for human reviewers to catch. For example, off-by-one errors, race conditions, and resource leaks look perfectly reasonable on the surface and only trigger under specific inputs or concurrent scenarios.

4. Fundamental Limitations of LLMs

Probabilistic nature: The same prompt generates different code today versus tomorrow
Limited context: Cannot understand enterprise architecture and internal codebases
Inexplicability: Difficult to diagnose and improve when errors occur

These limitations stem from the fundamental working principles of large language models — they are essentially statistical probability-based next-token predictors, not reasoning systems that truly understand code semantics and execution logic. Even the most advanced models cannot simulate code execution and track state changes in their "minds" the way human developers can.

Trend Insight: Newer Models Bring More Subtle Risks

A noteworthy finding: as models iterate, the total number of security vulnerabilities is decreasing, indicating that vendors' reinforcement learning is indeed fixing known issues. However, bugs and vulnerabilities produced by newer models are becoming more subtle and hidden, representing a different type of risk.

In other words, large models are evolving from "obviously bad code" to "code that looks great but hides subtle traps" — this raises the bar for human review. This evolution is similar to the offense-defense escalation in cybersecurity: when simple attack methods are defended against, attackers shift to more covert Advanced Persistent Threats (APT). In the code quality domain, this means traditional code review approaches may no longer be sufficient, requiring more specialized automated tools to help discover these "advanced" defects.

Sonar's ACDC Solution

Sonar proposes the Agent-Centric Development Cycle (ACDC) framework, consisting of three phases:

Guide Phase: Governing Training Data

Sonar Sweep: Cleaning and governing data sources used for training
Context Augmentation: Injecting complete codebase context into LLMs

The concept of Technical Debt was introduced by Ward Cunningham in 1992, using a financial metaphor to describe the long-term cost accumulated by sacrificing code quality for short-term delivery speed. SonarQube quantifies technical debt as "remediation time" — the estimated effort needed to restore code to a compliant state. In the era of AI-generated code at scale, technical debt may accumulate far faster than during manual coding, because AI tends to generate redundant code and lacks understanding of overall project architecture, making data governance in the Guide phase particularly critical.

Verify Phase: Real-time Code Detection

SonarQube Agentic Analysis: Integrated into Claude/Codex/Gemini CLI via MCP protocol
Completes analysis within 1-5 seconds before commit (compared to 1-5 minutes for CI)
Automatically pushes issues back to the Agent for fixing before submission

MCP (Model Context Protocol) is a protocol standard open-sourced by Anthropic in late 2024, designed to provide AI models with a unified interface for interacting with external tools and data sources. It uses a client-server architecture that allows AI Agents to invoke external capabilities like code analysis, database queries, and file operations through standardized methods. In Sonar's use case, MCP enables SonarQube's analysis capabilities to be directly called by Agent coding tools like Claude Code, OpenAI Codex, and Gemini CLI, achieving a "generate-detect-fix" closed loop without developers manually switching tools or waiting for CI pipelines. This integration approach shifts quality detection from traditional "post-hoc review" to "generation time," dramatically shortening the feedback loop.

Solve Phase: Automatically Fixing Technical Debt

Remediation Agent: Automatically fixes issues in PRs or batch-processes historical technical debt
Built-in verification loop: Re-analyzes and compiles after fixes to ensure no regressions are introduced

Implications for Developers

Don't blindly trust benchmark scores: An 84% pass rate doesn't mean code is production-ready
Code review cannot be skipped: Hidden bugs in AI-generated code require more specialized tool assistance
Evaluate models holistically: Security, code volume, and complexity are all critical metrics
Establish quality gates: Embed automated detection steps in AI programming workflows
Follow the Sonar Leaderboard (sonar.com/leaderboard): Continuous evaluation data for 53+ models is publicly available

Key Takeaways

AI code generation is reshaping software development, but there's a massive gap between "it runs" and "it's production-ready." Functional correctness is just the tip of the iceberg for code quality — security, maintainability, and engineering standards are the key factors determining whether code can run reliably in production long-term. Developers need to build new quality awareness: AI is a powerful coding assistant, but it is by no means a silver bullet that exempts you from engineering discipline. In this new era of AI programming, automated quality gates and continuous code health monitoring will become essential infrastructure for every engineering team.