Claude Mythos Real-World Test Falls Flat: Only 1 Low-Risk Vulnerability Found in 170K Lines of Code

Anthropic's "most dangerous model" Mythos finds only 1 low-risk vulnerability when tested on Curl's 170K-line codebase.
Curl founder Daniel Stenberg tested Anthropic's supposedly "too dangerous to release" flagship model Mythos, which found only 1 low-risk CVE in 170,000 lines of code—with the other 4 reports being false positives or ordinary bugs. Testing shows Mythos offers no qualitative leap in vulnerability discovery, with other mainstream AI tools achieving similar results. Anthropic's "dangerous weapon" marketing appears to be largely hype.
Background: Anthropic's "Most Dangerous Model" Faces a Real-World Test
Anthropic has long claimed that its unreleased flagship model Mythos (also referred to as Methos/Misos) possesses vulnerability discovery and exploitation capabilities that surpass nearly all human experts. The company even stated that releasing the model prematurely could pose serious consequences for the economy, public safety, and even national security. As a result, access has only been granted to a select few institutions and open-source organizations.
Anthropic uses an internal framework called the Responsible Scaling Policy to assess model risk. This framework categorizes AI capabilities into multiple AI Safety Levels (ASL), ranging from ASL-1 (no significant risk) to ASL-4+ (potentially catastrophic risk). When a model demonstrates capabilities surpassing human experts in areas like cybersecurity or bioweapon design, it triggers higher-level safety controls, including restricted access, increased monitoring, and delayed release. Mythos was classified as high-risk based on its vulnerability discovery capabilities demonstrated during internal red team testing. However, there's a significant gap between internal testing environments and real-world projects—internal tests may have used codebases with known vulnerabilities, while mature projects like Curl, which have undergone years of continuous auditing, have long since had their low-hanging fruit picked.
Yet when this AI model—marketed as a "watershed moment for security"—encountered a real-world large-scale open-source project, the results were eye-opening.
Curl Founder's Test: Results Severely Contradict Anthropic's Claims
Test Subject: An Internet Infrastructure-Level Project
Curl founder Daniel Stenberg recently published a detailed test report on his personal blog. Curl is one of the most widely used network transfer libraries on the internet, with over 20 billion installations, running on virtually every smartphone, tablet, car, TV, gaming console, and server. The project contains approximately 170,000 lines of code, and the importance of its security is self-evident.
Curl is not just a command-line tool—it's a core component of internet infrastructure. Its underlying library, libcurl, is embedded in nearly every operating system, programming language runtime, and IoT device, handling data transfer across dozens of network protocols including HTTP, FTP, and SMTP. Given its ubiquitous deployment, any security vulnerability in Curl could produce cascading effects—similar to the impact of the 2021 Log4Shell vulnerability on the Java ecosystem. Daniel Stenberg has maintained the project since 1998, accumulating extensive security audit experience, with approximately 150 CVEs fixed throughout the project's history. This makes Curl an ideal target for testing AI security scanning capabilities: sufficiently complex, with a comprehensive historical vulnerability database for comparison.
Before Mythos scanned Curl, Daniel launched a poll on social media. The vast majority of respondents, based on Anthropic's marketing claims, expected the model to find at least 10 or more CVE vulnerabilities.
The Difference Between CVEs and Regular Bugs
An important distinction needs to be clarified here: A bug is a very broad concept, referring to any program behavior that doesn't meet expectations (such as garbled output or incorrect font sizes). A CVE (Common Vulnerabilities and Exposures), on the other hand, specifically refers to vulnerabilities that can be exploited and threaten system confidentiality, integrity, and availability. Simply put, all CVEs are bugs, but not all bugs are CVEs.
CVE is a globally unified vulnerability numbering system maintained by the MITRE organization, where each confirmed security vulnerability receives a unique CVE identifier (e.g., CVE-2024-XXXXX). Vulnerability severity is quantified through the CVSS (Common Vulnerability Scoring System) score, ranging from 0 to 10, where 0-3.9 is low risk, 4.0-6.9 is medium risk, 7.0-8.9 is high risk, and 9.0-10 is critical. The score comprehensively considers the attack vector (network/local), attack complexity, required privileges, user interaction requirements, and impact on confidentiality, integrity, and availability. The only confirmed CVE found by Mythos was classified as extremely low risk, meaning its CVSS score likely falls in the 2-3 range, with an extremely low probability of actual exploitation.
Disappointing Scan Results
Mythos's final report listed only 5 CVEs. For a project with 170,000 lines of code, 5 CVEs is already a negligible number. But what's even more embarrassing is that after the Curl team's in-depth investigation:
- 3 were false positives: They posed no security threat whatsoever
- 1 was merely a minor bug: Not qualifying as an "exploitable vulnerability"
- The only confirmed vulnerability: Was classified as an extremely low-risk CVE that wouldn't cause serious consequences
In other words, across this massive open-source library, Mythos actually found only one low-risk vulnerability, which has already been scheduled for a fix in the next release. This severely contradicts Anthropic's claims of the model "posing a significant threat to security."
Deeper Analysis: Mythos's Vulnerability Discovery Capability Is Not a Unique Advantage
The Ability to Find Bugs Isn't Unique
Daniel also noted that beyond CVEs, Mythos did find approximately 20 regular bugs, with clear descriptions, solid explanations, and almost no false positives. However, the awkward truth is that he explicitly stated these bugs could have been found using any other AI tool.
They had previously used other AI tools for code security scanning, and those tools even found a greater number of bugs. Of course, Daniel fairly pointed out that as historical bugs are continuously fixed, finding new ones becomes increasingly difficult, so a pure quantity comparison isn't entirely fair.
Technical Principles and Limitations of AI Code Auditing
The core mechanism of modern large language models performing code security scanning is pattern recognition: through training on massive amounts of code and vulnerability data, models learn to identify common vulnerability patterns such as buffer overflows, SQL injection, and race conditions. Compared to traditional static analysis tools (like Coverity and CodeQL), AI's advantage lies in understanding the semantic context of code, reducing false positives caused by overly rigid rules. However, AI also has clear limitations: it cannot actually execute code, cannot verify whether a vulnerability can be triggered in practice (i.e., it lacks dynamic verification capability), and struggles to understand complex cross-module data flows. This explains why Mythos produced a relatively high false positive rate—it may have identified code patterns that look dangerous but couldn't confirm whether those patterns are actually reachable at runtime.
The Real Value Scenario
The original article made a very important addition: using AI to discover vulnerabilities and errors in source code is indeed more effective than any traditional tool before it. But this is a shared capability of modern AI large language models, not a unique advantage of Mythos.
The truly valuable use case lies in code repositories that have never been scanned by AI. Such projects will naturally expose numerous defects, vulnerabilities, and potential security risks—discoverable by any AI. Conversely, if your project has never undergone AI code auditing, you're essentially leaving a wealth of exploitable entry points for attackers.
Key Conclusions: Marketing Hype Exceeds Actual Breakthrough
Based on this real-world project test, several key conclusions can be drawn:
First, Mythos is not as dangerous as Anthropic claims. Judging from the Curl project alone, the hype surrounding this model is primarily marketing spin. Compared to previous AI models, no qualitative leap in vulnerability discovery capability was observed.
Second, the false positive rate is too high. Of the 5 reported vulnerabilities, 4 were either false positives or non-threatening, meaning human intervention is still required for verification even when results are produced. This represents a significant efficiency loss for practical security workflows.
Third, AI code security scanning itself is effective. This is a shared capability of modern large language models, not exclusive to Mythos. Any mainstream AI model can play a significant role in code auditing.
A Rational View of AI Security Capabilities and Their Boundaries
Mythos is undoubtedly an excellent model, but Anthropic's marketing strategy has clearly been overly aggressive. Packaging the model as a "dangerous weapon too risky to release" serves both as a responsible safety posture and inevitably invites suspicion of being a scarcity-manufacturing marketing tactic.
Based on this real-world test, AI capabilities in code security are indeed improving, but they remain a considerable distance from being a "disruptive threat." For developers, the real focus shouldn't be on how dangerous any particular model is, but whether AI tools have been incorporated into their security audit workflows—because if you don't use them, attackers certainly will.
AI application in cybersecurity is essentially an arms race between offense and defense. Defenders use AI for code auditing, anomaly detection, and threat intelligence analysis; attackers can equally leverage AI for automated vulnerability mining, phishing content generation, and security detection bypass. DARPA's 2024 AIxCC (AI Cyber Challenge) has already demonstrated that multiple AI systems can autonomously discover and fix vulnerabilities in real open-source software. This means the proliferation of AI security scanning capabilities is an irreversible trend—any strategy attempting to maintain security by restricting a single model is destined to fail. True security improvement comes from defenders comprehensively adopting AI tools, not from hoping attackers can't access similar capabilities.
Related articles
Tech FrontiersGitHub Agent HQ Launch: AI Coding Tools Enter the Era of Platform Competition
GitHub Universe unveils Agent HQ platform for unified coding agent management, Copilot upgrades with multi-model support. OpenAI completes restructuring, Anthropic tests new model, NVIDIA open-sources AI models.
Tech FrontiersGemini 3.5 Flash Achieves a Massive Leap on the GDPval Benchmark
Google Gemini 3.5 Flash surpasses Gemini 3.1 Pro on the GDPval benchmark. The lightweight Flash model leverages post-training techniques to approach frontier-level performance, redefining the balance between quality and cost.
Tech FrontiersGoogle Gemini Antigravity Weekly Quota Tripled — AI Coding Without Limits
Google Gemini triples Antigravity weekly quotas following a prior daily quota boost. Analyzing the impact on developers and its strategic significance in AI coding.