AI Coding Real-World Test: GPT-5, Gemini 2.5 Pro, Kimi K2, and Grok 4 All Fail at Web Scraping

Test Background

When it comes to real-world validation of AI coding capabilities, the actual performance of large language models often falls far short of their marketing claims. In this third episode of the AI Coding Arena series, the tester used the same IDE (Cursor international version) to compare GPT-5, Gemini 2.5 Pro, Kimi K2-0905, and Grok 4 on a static web scraping task.

About the Testing Tool: Cursor is an AI-native IDE built on a deep modification of VS Code. Its core feature is the direct integration of large language models into the code editing and terminal execution environment, with support for multi-model switching. In Cursor's Agent mode, models can not only generate code but also autonomously execute terminal commands, read/write files, and install dependencies, forming a complete "perceive-plan-execute" loop. This setup makes testing closer to real development scenarios and better exposes models' weaknesses at the engineering execution level—because errors are immediately executed with real consequences, rather than remaining at the code generation level.

About the Test Task: Static web scraping refers to data extraction from web pages where the server directly returns complete HTML content. Compared to dynamic scraping (which requires JavaScript rendering), it's theoretically less difficult. The core workflow includes: sending HTTP requests to fetch HTML documents, parsing the DOM tree using CSS selectors or XPath, extracting target content, and recursively following links. Common toolchains include Python's requests+BeautifulSoup or the Scrapy framework. This type of task serves as an excellent benchmark for AI coding assistants because it tests both code generation ability and the capacity to understand target website structure and debug effectively.

In previous tests, Claude successfully scraped 126 pages, setting the current benchmark. Could these four models surpass or at least match that result? The outcome was shocking—all of them failed.

All four models begin executing the task simultaneously

Execution Process Comparison of the Four Models

Notable Differences in Startup Speed

After receiving the same scraping requirements, the four models exhibited distinctly different execution strategies:

Grok 4: Fastest to start, immediately began installing dependencies without first writing a requirements file, directly executing pip install
Kimi K2: Impressively fast, quickly showing a Complete status
GPT-5: Noticeably slow to start, running virtual environment configuration multiple times
Gemini 2.5 Pro: Adopted a conservative strategy, insisting on waiting for the virtual environment to be fully installed before writing any code

Engineering Best Practices Context: Python virtual environments (venv/conda) are standard engineering practice for isolating project dependencies, preventing package version conflicts between different projects. The proper workflow is: first create a requirements.txt declaring dependencies, then batch install them in a virtual environment. Grok 4 skipping the requirements file to directly pip install, and Gemini 2.5 Pro serially waiting for the environment installation to complete before writing code, both violate engineering best practices. The former makes dependencies difficult to reproduce, while the latter wastes time that could be used for parallel processing—reflecting that these models' understanding of software engineering conventions remains superficial.

Environment configuration during model execution

Specific Problems Each Model Encountered

Gemini 2.5 Pro ran into issues during the environment installation phase—it misidentified the file path because the path contained a space. This kind of basic error is inexcusable for a top-tier large language model. Its logic was also flawed: insisting on waiting for the virtual environment to finish installing before writing code, rather than handling tasks in parallel.

GPT-5 successfully initiated the scraping task but set the maximum page limit to 10, and ultimately all scraped content was empty—the body text was never successfully extracted.

Grok 4 performed even more absurdly—it not only installed a bunch of dependencies that are never normally needed, but inexplicably created a GUI interface. The final result showed "No Links Found"—not a single link was discovered.

Kimi K2 completed fastest, but upon opening the result file, all scraped body content was empty, making it effectively a failure as well.

Why Does "Empty Content" Occur? The "empty body text" issue appearing across multiple models typically stems from CSS selectors or XPath paths that don't match the target website's actual DOM structure. Modern websites often use semantic class names (like article__body, post-content) or deeply nested div structures. If a model fails to correctly analyze the target page's HTML structure, it will use incorrect selectors resulting in empty scraping results. Additionally, some websites use dynamically generated class names or lazy-loading mechanisms for content areas—even static pages may contain such pitfalls. The "empty content" problem with GPT-5 and Kimi K2 was most likely caused by selector targeting failures.

Grok 4 execution results

Web Scraping Test Results Summary

Model	Execution Speed	Pages Scraped	Content Quality	Final Result
Claude (previous test)	Normal	126 pages	Correct	✅ Success
Kimi K2-0905	Fastest	13 pages	Body empty	❌ Failed
GPT-5	Slower	10 pages (limited)	Body empty	❌ Failed
Grok 4	Fast	0 pages	No content	❌ Failed
Gemini 2.5 Pro	Slowest	0 pages	No content	❌ Failed

Deep Analysis: Why Did They All Fail?

This test exposed several critical issues:

Insufficient Understanding of Web Page Structure

Static web scraping may seem simple, but it requires models to accurately understand the target website's DOM structure, link patterns, and content distribution. These four models clearly lack sufficient practical execution capability in this area.

Lack of Engineering Thinking

Gemini couldn't even handle a space in a file path, and Grok installed irrelevant dependencies while creating an unnecessary GUI. This demonstrates that these models still have very weak engineering judgment in real programming scenarios.

"Fast" Doesn't Mean "Good"

Kimi K2 finished fastest, but the results were all empty data. Speed advantages are meaningless in the face of quality failures.

Parameter Scale Doesn't Equal Engineering Execution Capability

There's a long-standing misconception in the industry: that larger parameter counts and more training data automatically translate to stronger coding ability. In reality, coding execution capability (especially for multi-step tasks in Agent mode) also heavily depends on the model's instruction-following ability, self-error-correction ability, and long-context memory capability. GPT-5, Grok 4, and similar models perform excellently on benchmarks (like HumanEval, SWE-bench), but these tests are typically single-shot code generation, which differs significantly from real engineering tasks requiring multiple rounds of tool calls and dynamic debugging. This is why "arena-style" real-world task testing is more valuable than benchmark scores.

Implications for Developers

This test result reminds us:

Don't blindly trust a model's brand halo—actual task performance is the hard metric
In AI-assisted programming, Claude currently maintains a clear advantage in code execution tasks
Even the latest versions of top-tier models may underperform expectations on specific engineering tasks
When choosing an AI coding assistant, you should test against your actual use cases
There's a significant gap between benchmark scores and real engineering task performance—developers should prioritize real-world test data that closely mirrors their own business scenarios

Conclusion

The results of this arena test are quite ironic: four highly anticipated top-tier large language models all failed at a relatively basic static web scraping task, performing worse than Claude, which showed decent results in previous testing. This demonstrates that the competition in AI coding capability is far from settled, and a model's parameter scale and brand recognition don't directly translate into reliable coding execution power. For developers, the answer to "which model should I use" always needs to be verified through actual testing.