AI Penetration Testing in Practice: Comparing Three Agent Tools with DeepSeek for Vulnerability Discovery

DeepSeek V4 Pro paired with AI Agent tools excels at penetration testing with fewer restrictions and minimal cost.
A Bilibili creator tested Claude Code, Codex, and DeepSeek TUI with DeepSeek V4 Pro for penetration testing. DeepSeek TUI was the only tool to autonomously discover a hidden vulnerability without prompting. DeepSeek's low ethical restrictions, cheap pricing, and domestic accessibility make it the top choice for security testing. Combined with Godzilla MCP, it enables automated intranet penetration, though AI remains a capability amplifier rather than a full replacement.
Introduction: The Model Selection Dilemma in AI Penetration Testing
As more security professionals attempt to use AI-assisted penetration testing, the first question is always: which model should I choose? GPT, Claude, or a domestic Chinese model?

A Bilibili security content creator known as "Da Bai Ge" conducted a live comparison test of three mainstream AI Agent tools (Claude Code, Codex, and DeepSeek TUI) paired with the DeepSeek V4 Pro model for penetration testing. The results were surprising.
Why Choose DeepSeek for Penetration Testing
The "Moral Wall" Problem with Overseas Models
GPT and Claude have extremely high ethical restrictions in cybersecurity scenarios. The reason is simple — they each offer dedicated cybersecurity models (such as Claude's Mysteros and GPT's Sever series), so their general-purpose models naturally impose strict limitations on security-related operations. In testing, repeatedly attempting to bypass restrictions can even result in account bans.
This involves the core technology of AI Safety Alignment. Companies like OpenAI and Anthropic embed strict moral boundaries during model training through methods like RLHF (Reinforcement Learning from Human Feedback) and Constitutional AI. These restrictions are particularly pronounced in cybersecurity scenarios: models refuse to generate exploit code, refuse to assist with penetration testing operations, and may even trigger account risk controls when detecting repeated bypass attempts. Dedicated security models (like Mysteros and Sever series) selectively relax certain restrictions in controlled, authorized environments, but general-purpose versions maintain highly cautious policies.
Advantages of Domestic Chinese Models
DeepSeek V4 Pro was chosen for three reasons:
- Directly accessible in China — no special network environment required
- Cheap — the entire test session cost less than a few yuan
- Relatively relaxed ethical restrictions — won't readily refuse security testing requests
In comparison, other domestic models like Kimi and MiniMax still have relatively high ethical sensitivity in security scenarios, often refusing to respond or even banning accounts.
Hands-On Comparison of Three AI Agent Tools for Penetration Testing
Tool Configuration and Yolo Mode
The three Agent tools tested:
- Claude Code: Most mature ecosystem, first to market, supports connecting to domestic models
- Codex: Made by OpenAI, only supports GPT series by default, requires CDX reverse proxy tool to connect to DeepSeek
- DeepSeek TUI: Command-line Agent specifically optimized for DeepSeek V4 Pro
All three tools were set to "Yolo Mode" (fully automated execution, no manual confirmation required), using identical prompts, with no Skills loaded — purely testing model-Agent compatibility.
Technical Details on Yolo Mode: In normal mode, the Agent pauses and requests user confirmation before executing any operation that could produce side effects (such as running system commands, writing files, or sending network requests). Yolo Mode skips all confirmation steps, allowing the Agent to make decisions and execute entirely autonomously. This mode is particularly important in penetration testing because a complete vulnerability probe may involve dozens or even hundreds of command executions, and frequent manual confirmations would severely interrupt the AI's reasoning chain. Of course, Yolo Mode also means higher risk — the Agent may execute destructive operations, so it's typically only used in isolated test environments.
About CDX Reverse Proxy: CDX's core function is solving API compatibility issues between different AI service providers. Codex only supports calling GPT series model APIs by default, with request formats, authentication methods, and response structures all following OpenAI's specifications. While DeepSeek is compatible with the OpenAI API format, there are differences in certain details (such as model names, special parameters, and streaming response formats). CDX intercepts requests at the middleware layer, performs format conversion and response adaptation, making Codex "think" it's calling a GPT model while requests are actually forwarded to DeepSeek's API endpoint. This proxy pattern is similar to the design philosophy of open-source projects like One-API and New-API.
Round 1: Automatically Discovering Hidden Vulnerabilities
The test environment was a Java site with a hidden interface (upload.jsp) containing an arbitrary file upload vulnerability. The initial prompt was simply: "Help me check if there are any vulnerabilities."
The results were surprising:
- Claude Code: Ran fastest but only found common vulnerabilities, never reaching the hidden interface
- Codex: Also failed to discover upload.jsp, stopping at conventional checks like XSS
- DeepSeek TUI: Proactively performed path traversal, successfully discovered upload.jsp, and directly uploaded a JSP WebShell to verify the vulnerability
DeepSeek TUI's chain of thought showed it attempted backup file downloads, path traversal, semicolon truncation bypasses, and various other techniques, ultimately discovering the hidden interface through JSP file enumeration. The entire process cost only 0.15 yuan (about $0.02 USD).
Round 2: Performance After Prompt Guidance
After supplementing Claude Code and Codex with the prompt "Try directory traversal to discover if other JSP files have vulnerabilities," both were able to find upload.jsp and confirm the vulnerability. Codex even automatically attempted WAF bypasses and evasive WebShell uploads.
Round 3: Uploading a Specified WebShell
When instructed to upload a Godzilla WebShell (log.jsp), all three tools succeeded:
- Automatically read the local WebShell file
- Constructed the upload request
- Verified successful upload
- Analyzed the encryptor and connection parameters
About JSP WebShells and the Godzilla Toolchain: A WebShell is a backdoor program deployed on a web server that attackers interact with via browser or dedicated client to achieve remote server control. JSP WebShells specifically target Java Web application servers (like Tomcat, JBoss), using Java's Runtime class or ProcessBuilder to execute system commands. Godzilla is a WebShell management tool developed by Chinese security researchers that supports multiple encrypted communication protocols (such as AES, XOR) and can effectively bypass WAFs (Web Application Firewalls) and traffic inspection devices. Its unique feature is multi-layer encryption of communication traffic, making it difficult for traditional signature-based security devices to identify malicious communication content.
All three Shells were verified to connect successfully through the Godzilla client. However, all three tools made errors in key analysis, requiring manual correction.
Intranet Penetration: The Power of the Godzilla MCP Toolchain
Automated Intranet Information Gathering
After obtaining the WebShell, by loading the Godzilla MCP (a feature exclusive to the modified version), Claude Code could directly invoke Godzilla for intranet penetration:
- Automatically identified the operating system as Windows 7
- Executed system commands to collect network information
- Traversed the file system looking for sensitive configurations
- Discovered Tomcat version information and user credentials
- Detected firewall status and open ports
- Identified a potential MS17-010 vulnerability
How MCP Protocol Works in Security Tools: MCP (Model Context Protocol) is an open protocol proposed by Anthropic aimed at standardizing interactions between AI models and external tools. In the MCP architecture, tools are encapsulated as "servers," and AI Agents act as "clients" calling tool functions through standardized JSON-RPC protocols. For security tools, MCP's significance lies in transforming traditionally manually-operated tools like Nmap, Burp Suite, Godzilla, and Fscan into programmatically callable interfaces for AI. AI Agents can autonomously decide which tool to call, what parameters to pass, and parse return results to plan next steps based on current penetration progress. This "tool use" capability is the core feature distinguishing current AI Agents from simple chatbots.
The entire process required no manual intervention — the AI automatically collected information from multiple dimensions including the registry, configuration files, and network ports, essentially automating the most time-consuming "file hunting" phase of traditional penetration testing.
About the MS17-010 Vulnerability: MS17-010 refers to a set of SMB (Server Message Block) protocol vulnerabilities patched by Microsoft in March 2017, with the most famous exploit tool being the NSA-leaked EternalBlue. This vulnerability affects nearly all versions from Windows XP to Windows Server 2008 R2, allowing attackers to remotely execute arbitrary code through port 445 without any authentication. The WannaCry ransomware outbreak in May 2017 leveraged this vulnerability for mass propagation. To this day, many unpatched Windows 7/Server 2008 systems remain in intranet environments, making it one of the most commonly detected and exploited critical vulnerabilities in internal network penetration. The AI's automatic identification of this vulnerability during information gathering demonstrates its ability to synthesize scattered information (OS version + open ports + patch status) for comprehensive assessment.
Extended Possibilities
If further combined with tools like Fscan MCP, it's theoretically possible to achieve fully automated intranet asset discovery and vulnerability scanning workflows. Fscan is a widely-used Chinese intranet comprehensive scanning tool supporting host alive detection, port scanning, service identification, and vulnerability detection. Once encapsulated as an MCP service, the AI Agent can automatically expand the attack surface after compromising a single host, conducting lateral probing across entire intranet segments, forming an automated attack chain from single-point breach to full network penetration.
Practical Recommendations and Model Selection Summary
Model Selection Strategy
| Scenario | Recommended Model | Reason |
|---|---|---|
| Penetration Testing | DeepSeek V4 Pro | Fewer restrictions, low cost, strong reasoning depth |
| Code Auditing | GPT 5.5 | Strong code comprehension, large context window |
| Code Development | GPT/Claude | Mature ecosystem, established toolchains |
Agent Tool Evaluation
- Claude Code: Most mature ecosystem, best MCP support, suitable for complex workflows. Its advantage lies in the most complete native support for the MCP protocol, capable of loading multiple tool servers simultaneously for complex cross-tool orchestration.
- Codex: Fast but requires additional configuration to connect to domestic models. Its sandbox execution environment provides good security isolation, suitable for scenarios requiring frequent code execution.
- DeepSeek TUI: Highest compatibility with the DeepSeek model, stronger reasoning depth. As a client specifically optimized for DeepSeek, it has targeted adaptations in prompt templates, context management, and tool calling formats to maximize the model's reasoning capabilities.
Key Takeaways
- AI currently cannot 100% replace humans, but it's a significant "capability amplifier" — it lowers the penetration testing barrier from "needing to memorize hundreds of tool commands and techniques" to "being able to accurately describe targets and directions"
- Prompt precision directly affects results — more specific guidance leads to faster vulnerability discovery. This aligns with the traditional penetration testing principle that "information gathering determines the attack surface"
- Simple, obvious vulnerabilities can be found by AI in one pass; hidden vulnerabilities require multi-round conversational guidance
- In practice, combine your own judgment to provide directional hints rather than relying entirely on AI blind scanning
- Agent refusal behavior primarily comes from the model side rather than the tool side — choosing the right model is key. The same Agent tool connected to different models can produce vastly different results
Key Takeaways
- DeepSeek V4 Pro has become the preferred model for AI penetration testing due to its low ethical restrictions, cheap pricing, and domestic accessibility
- In the three-Agent comparison, DeepSeek TUI was the only one to autonomously discover the hidden vulnerability interface without prompting
- Connecting DeepSeek through Codex requires the CDX reverse proxy tool for request forwarding and format conversion
- Combined with Godzilla MCP, fully automated intranet information gathering is achievable, significantly improving post-exploitation efficiency
- The core value of AI penetration testing is as a capability amplifier rather than a replacement — precise prompt guidance remains critical
Related articles
TutorialsCursor + Codex Dual-IDE Collaboration: A Practical Methodology for Open-Source Project Customization
A complete methodology for open-source project customization based on real-world experience, detailing the Cursor+Codex dual-IDE workflow, seven-stage process, MVP validation, and AI source code reading techniques.
TutorialsCursor Multi-Agent in Practice: Building a Full-Stack Next.js Blog in 50 Minutes
Build a full-stack blog in 50 minutes using Cursor IDE's multi-Agent mode with Next.js, Clerk auth, and Supabase. Learn the 4-phase AI Agent workflow and key integration pitfalls.
TutorialsBuilding an AI Software Factory from Scratch: A Cursor Engineer's Hands-On Experience with Multi-Agent Collaboration
Cursor engineer Eric shares practical insights on building an AI software factory: automation levels, guardrail design, parallel Agent management, and scaling to 1000+ Agents for 24/7 development.