AI Penetration Testing in Practice: Comparing Three Agent Tools with DeepSeek for Vulnerability Discovery

Introduction: The Model Selection Dilemma in AI Penetration Testing

As more security professionals attempt to use AI-assisted penetration testing, the first question is always: which model should I choose? GPT, Claude, or a domestic Chinese model?

bilibili source: 【大白哥AI与安全】手把手教你AI渗透,挖漏洞

A Bilibili security content creator known as "Da Bai Ge" conducted a live comparison test of three mainstream AI Agent tools (Claude Code, Codex, and DeepSeek TUI) paired with the DeepSeek V4 Pro model for penetration testing. The results were surprising.

Why Choose DeepSeek for Penetration Testing

The "Moral Wall" Problem with Overseas Models

GPT and Claude have extremely high ethical restrictions in cybersecurity scenarios. The reason is simple — they each offer dedicated cybersecurity models (such as Claude's Mysteros and GPT's Sever series), so their general-purpose models naturally impose strict limitations on security-related operations. In testing, repeatedly attempting to bypass restrictions can even result in account bans.

This involves the core technology of AI Safety Alignment. Companies like OpenAI and Anthropic embed strict moral boundaries during model training through methods like RLHF (Reinforcement Learning from Human Feedback) and Constitutional AI. These restrictions are particularly pronounced in cybersecurity scenarios: models refuse to generate exploit code, refuse to assist with penetration testing operations, and may even trigger account risk controls when detecting repeated bypass attempts. Dedicated security models (like Mysteros and Sever series) selectively relax certain restrictions in controlled, authorized environments, but general-purpose versions maintain highly cautious policies.

Advantages of Domestic Chinese Models

DeepSeek V4 Pro was chosen for three reasons:

Directly accessible in China — no special network environment required
Cheap — the entire test session cost less than a few yuan
Relatively relaxed ethical restrictions — won't readily refuse security testing requests

In comparison, other domestic models like Kimi and MiniMax still have relatively high ethical sensitivity in security scenarios, often refusing to respond or even banning accounts.

Hands-On Comparison of Three AI Agent Tools for Penetration Testing

Tool Configuration and Yolo Mode

The three Agent tools tested:

Claude Code: Most mature ecosystem, first to market, supports connecting to domestic models
Codex: Made by OpenAI, only supports GPT series by default, requires CDX reverse proxy tool to connect to DeepSeek
DeepSeek TUI: Command-line Agent specifically optimized for DeepSeek V4 Pro

All three tools were set to "Yolo Mode" (fully automated execution, no manual confirmation required), using identical prompts, with no Skills loaded — purely testing model-Agent compatibility.

Technical Details on Yolo Mode: In normal mode, the Agent pauses and requests user confirmation before executing any operation that could produce side effects (such as running system commands, writing files, or sending network requests). Yolo Mode skips all confirmation steps, allowing the Agent to make decisions and execute entirely autonomously. This mode is particularly important in penetration testing because a complete vulnerability probe may involve dozens or even hundreds of command executions, and frequent manual confirmations would severely interrupt the AI's reasoning chain. Of course, Yolo Mode also means higher risk — the Agent may execute destructive operations, so it's typically only used in isolated test environments.

About CDX Reverse Proxy: CDX's core function is solving API compatibility issues between different AI service providers. Codex only supports calling GPT series model APIs by default, with request formats, authentication methods, and response structures all following OpenAI's specifications. While DeepSeek is compatible with the OpenAI API format, there are differences in certain details (such as model names, special parameters, and streaming response formats). CDX intercepts requests at the middleware layer, performs format conversion and response adaptation, making Codex "think" it's calling a GPT model while requests are actually forwarded to DeepSeek's API endpoint. This proxy pattern is similar to the design philosophy of open-source projects like One-API and New-API.

Round 1: Automatically Discovering Hidden Vulnerabilities

The test environment was a Java site with a hidden interface (upload.jsp) containing an arbitrary file upload vulnerability. The initial prompt was simply: "Help me check if there are any vulnerabilities."

The results were surprising:

Claude Code: Ran fastest but only found common vulnerabilities, never reaching the hidden interface
Codex: Also failed to discover upload.jsp, stopping at conventional checks like XSS
DeepSeek TUI: Proactively performed path traversal, successfully discovered upload.jsp, and directly uploaded a JSP WebShell to verify the vulnerability

DeepSeek TUI's chain of thought showed it attempted backup file downloads, path traversal, semicolon truncation bypasses, and various other techniques, ultimately discovering the hidden interface through JSP file enumeration. The entire process cost only 0.15 yuan (about $0.02 USD).

Round 2: Performance After Prompt Guidance

After supplementing Claude Code and Codex with the prompt "Try directory traversal to discover if other JSP files have vulnerabilities," both were able to find upload.jsp and confirm the vulnerability. Codex even automatically attempted WAF bypasses and evasive WebShell uploads.

Round 3: Uploading a Specified WebShell

When instructed to upload a Godzilla WebShell (log.jsp), all three tools succeeded:

Automatically read the local WebShell file
Constructed the upload request
Verified successful upload
Analyzed the encryptor and connection parameters

About JSP WebShells and the Godzilla Toolchain: A WebShell is a backdoor program deployed on a web server that attackers interact with via browser or dedicated client to achieve remote server control. JSP WebShells specifically target Java Web application servers (like Tomcat, JBoss), using Java's Runtime class or ProcessBuilder to execute system commands. Godzilla is a WebShell management tool developed by Chinese security researchers that supports multiple encrypted communication protocols (such as AES, XOR) and can effectively bypass WAFs (Web Application Firewalls) and traffic inspection devices. Its unique feature is multi-layer encryption of communication traffic, making it difficult for traditional signature-based security devices to identify malicious communication content.

All three Shells were verified to connect successfully through the Godzilla client. However, all three tools made errors in key analysis, requiring manual correction.

Intranet Penetration: The Power of the Godzilla MCP Toolchain

Automated Intranet Information Gathering

After obtaining the WebShell, by loading the Godzilla MCP (a feature exclusive to the modified version), Claude Code could directly invoke Godzilla for intranet penetration:

Automatically identified the operating system as Windows 7
Executed system commands to collect network information
Traversed the file system looking for sensitive configurations
Discovered Tomcat version information and user credentials
Detected firewall status and open ports
Identified a potential MS17-010 vulnerability

How MCP Protocol Works in Security Tools: MCP (Model Context Protocol) is an open protocol proposed by Anthropic aimed at standardizing interactions between AI models and external tools. In the MCP architecture, tools are encapsulated as "servers," and AI Agents act as "clients" calling tool functions through standardized JSON-RPC protocols. For security tools, MCP's significance lies in transforming traditionally manually-operated tools like Nmap, Burp Suite, Godzilla, and Fscan into programmatically callable interfaces for AI. AI Agents can autonomously decide which tool to call, what parameters to pass, and parse return results to plan next steps based on current penetration progress. This "tool use" capability is the core feature distinguishing current AI Agents from simple chatbots.

The entire process required no manual intervention — the AI automatically collected information from multiple dimensions including the registry, configuration files, and network ports, essentially automating the most time-consuming "file hunting" phase of traditional penetration testing.

About the MS17-010 Vulnerability: MS17-010 refers to a set of SMB (Server Message Block) protocol vulnerabilities patched by Microsoft in March 2017, with the most famous exploit tool being the NSA-leaked EternalBlue. This vulnerability affects nearly all versions from Windows XP to Windows Server 2008 R2, allowing attackers to remotely execute arbitrary code through port 445 without any authentication. The WannaCry ransomware outbreak in May 2017 leveraged this vulnerability for mass propagation. To this day, many unpatched Windows 7/Server 2008 systems remain in intranet environments, making it one of the most commonly detected and exploited critical vulnerabilities in internal network penetration. The AI's automatic identification of this vulnerability during information gathering demonstrates its ability to synthesize scattered information (OS version + open ports + patch status) for comprehensive assessment.

Extended Possibilities

If further combined with tools like Fscan MCP, it's theoretically possible to achieve fully automated intranet asset discovery and vulnerability scanning workflows. Fscan is a widely-used Chinese intranet comprehensive scanning tool supporting host alive detection, port scanning, service identification, and vulnerability detection. Once encapsulated as an MCP service, the AI Agent can automatically expand the attack surface after compromising a single host, conducting lateral probing across entire intranet segments, forming an automated attack chain from single-point breach to full network penetration.

Practical Recommendations and Model Selection Summary

Model Selection Strategy

Scenario	Recommended Model	Reason
Penetration Testing	DeepSeek V4 Pro	Fewer restrictions, low cost, strong reasoning depth
Code Auditing	GPT 5.5	Strong code comprehension, large context window
Code Development	GPT/Claude	Mature ecosystem, established toolchains

Agent Tool Evaluation

Claude Code: Most mature ecosystem, best MCP support, suitable for complex workflows. Its advantage lies in the most complete native support for the MCP protocol, capable of loading multiple tool servers simultaneously for complex cross-tool orchestration.
Codex: Fast but requires additional configuration to connect to domestic models. Its sandbox execution environment provides good security isolation, suitable for scenarios requiring frequent code execution.
DeepSeek TUI: Highest compatibility with the DeepSeek model, stronger reasoning depth. As a client specifically optimized for DeepSeek, it has targeted adaptations in prompt templates, context management, and tool calling formats to maximize the model's reasoning capabilities.

Key Takeaways

AI currently cannot 100% replace humans, but it's a significant "capability amplifier" — it lowers the penetration testing barrier from "needing to memorize hundreds of tool commands and techniques" to "being able to accurately describe targets and directions"
Prompt precision directly affects results — more specific guidance leads to faster vulnerability discovery. This aligns with the traditional penetration testing principle that "information gathering determines the attack surface"
Simple, obvious vulnerabilities can be found by AI in one pass; hidden vulnerabilities require multi-round conversational guidance
In practice, combine your own judgment to provide directional hints rather than relying entirely on AI blind scanning
Agent refusal behavior primarily comes from the model side rather than the tool side — choosing the right model is key. The same Agent tool connected to different models can produce vastly different results

Key Takeaways

DeepSeek V4 Pro has become the preferred model for AI penetration testing due to its low ethical restrictions, cheap pricing, and domestic accessibility
In the three-Agent comparison, DeepSeek TUI was the only one to autonomously discover the hidden vulnerability interface without prompting
Connecting DeepSeek through Codex requires the CDX reverse proxy tool for request forwarding and format conversion
Combined with Godzilla MCP, fully automated intranet information gathering is achievable, significantly improving post-exploitation efficiency
The core value of AI penetration testing is as a capability amplifier rather than a replacement — precise prompt guidance remains critical

AI Penetration Testing in Practice: Comparing Three Agent Tools with DeepSeek for Vulnerability Discovery

Introduction: The Model Selection Dilemma in AI Penetration Testing

Why Choose DeepSeek for Penetration Testing

The "Moral Wall" Problem with Overseas Models

Advantages of Domestic Chinese Models

Hands-On Comparison of Three AI Agent Tools for Penetration Testing

Tool Configuration and Yolo Mode

Round 1: Automatically Discovering Hidden Vulnerabilities

Round 2: Performance After Prompt Guidance

Round 3: Uploading a Specified WebShell

Intranet Penetration: The Power of the Godzilla MCP Toolchain

Automated Intranet Information Gathering

Extended Possibilities

Practical Recommendations and Model Selection Summary

Model Selection Strategy

Agent Tool Evaluation

Key Takeaways

Key Takeaways

Related articles

Cursor + Codex Dual-IDE Collaboration: A Practical Methodology for Open-Source Project Customization

Cursor Multi-Agent in Practice: Building a Full-Stack Next.js Blog in 50 Minutes

Building an AI Software Factory from Scratch: A Cursor Engineer's Hands-On Experience with Multi-Agent Collaboration