US vs. China AI Computer Control Divergence: Why Programming Tools Still Haven't Integrated GUI Agents
US vs. China AI Computer Control Diver…
US and China diverge on AI computer control approaches while programming tools face three integration bottlenecks.
AI computer control (GUI Agent) capabilities have surpassed humans, with the US and China taking divergent paths: the US (Anthropic, OpenAI, Google) pursues product packaging for rapid market capture, while China (ByteDance UI-TARS, Alibaba Qwen-VL) pursues open-source research for ecosystem building. Yet programming tools like Cursor and Copilot still face three bottlenecks for GUI Agent integration: permission risks, sandbox architecture constraints, and inference latency. Claude Code offers a pragmatic alternative via command-line, and the endgame points toward multi-agent collaborative Autonomous Software Engineers.
AI computer control success rates have surpassed humans, yet the Cursor and GitHub Copilot you use every day still can't open a browser, test code, or read error messages on their own.
This isn't because the technology isn't ready. It's because two fundamentally different approaches in the US and China are competing, while three real-world bottlenecks are blocking the path to integration.
The US Approach: Product Packaging, Speed First
The American playbook is crystal clear—package capabilities into products, rush them to market, keep technical details proprietary, and focus on whether the capability can be called directly and monetized immediately.
Three giants each own a piece of the territory:
-
Anthropic targets desktop-level control. Its Computer Use lets AI control real desktops—opening browsers, clicking buttons, operating software—exposed as an API for developers to call. This was the first time any company turned "AI operating a computer" into an official product interface.
Technically, Computer Use employs a "screenshot-reason-execute" loop: at each step, the AI captures the current screen, feeds the image into the Claude model for visual understanding, the model outputs the next action (click coordinates, keyboard input, scroll commands), and the execution layer calls system APIs to complete the operation before taking another screenshot to begin the next cycle. The key breakthrough is unifying visual understanding and action planning within a single model, rather than a stitched-together architecture of separate perception and planning modules.
-
OpenAI targets web agents. The original Operator was specifically designed to let AI book restaurants, check flights, and fill out forms in the browser on behalf of users. In July 2025, these capabilities were merged into ChatGPT's Agent features, and the Operator site officially shut down on August 31.
-
Google DeepMind targets browser automation. Project Mariner's direction is enabling AI to automatically complete complex tasks on the web, with related capabilities being integrated into Gemini.

The three take different paths, but share one thing in common: technical details remain closed, and capabilities are packaged as services pushed directly to users. The American logic is capture the market first, explain the principles later—whoever gets the users first wins.
The China Approach: Open-Source Research, Ecosystem First
China is taking a completely opposite path—open-source research first, trading technical transparency for global influence. Nothing is hidden or locked away; instead, methodologies are laid out on the table for developers worldwide to build upon.
Two core engines drive this approach:
Engine One: ByteDance's UI-TARS
In early 2025, ByteDance open-sourced the complete methodology for Native GUI Agents. To understand the significance, you need to know the technical evolution of GUI Agents.
GUI Agents (Graphical User Interface Agents) are AI systems that can perceive screen visual content and simulate human operations. Early systems relied on OCR (Optical Character Recognition) to extract text and Accessibility Trees to parse interface structure—essentially "translating" graphical interfaces into structured data before processing. The new generation directly takes screenshots as input, using multimodal large models to understand interface semantics end-to-end and output operation coordinates, eliminating intermediate assembly modules. UI-TARS takes exactly this latter path—purely vision-driven, end-to-end. What ByteDance accomplished was turning this from a paper into open-source engineering.
Engine Two: Alibaba's Qwen-VL Series
Qwen-VL is currently one of the most downloaded open-source multimodal models globally, specifically enhanced for GUI interface understanding—it can comprehend the semantics of buttons, menus, and forms, and can work with Agent frameworks to complete actual operations. It's one of the most widely deployed foundations for GUI Agent engineering in China.

At the academic level, Chinese research teams are densely publishing related results at top conferences like CVPR and ICLR, establishing discourse power through academic influence and building ecosystems through open source, ensuring that tools worldwide grow on their foundations.
Four-Dimensional Comparison: Two Competing Logics
| Dimension | United States | China |
|---|---|---|
| Core Strategy | Product packaging, fast market capture | Open-source research, ecosystem building |
| Representative Forms | Computer Use, ChatGPT Agent, Mariner | UI-TARS open-source, Qwen-VL series |
| Openness Level | Low, technical details nearly black-box | High, methodologies directly open to the public |
| Competitive Focus | Capturing users | Capturing ecosystems |
It's not about who's right or wrong—these are two completely different competitive logics. But both paths point to the same endgame: making AI truly capable of executing tasks, not just answering questions.
Why Programming Tools Haven't Integrated GUI Agents: Three Real-World Bottlenecks
AI computer control capabilities are already mature, and programming tools should logically integrate them. But three real-world gates stand in the way—it's not that no one thought of it, it's that the time hasn't come.

Bottleneck One: The Permission Abyss
GUI Agent means AI can click on anything on your computer. In programming scenarios, it might accidentally delete a database, commit code by mistake, or touch files you don't want it touching. It's not that the technology can't do it—product teams don't dare open this permission. If something goes wrong, who takes responsibility?
Bottleneck Two: Sandbox Constraints
Cursor and Copilot essentially run inside VS Code's plugin sandbox. VS Code's plugin sandbox (Extension Host) is a strict isolated runtime environment where plugins can only interact with the editor through officially exposed VS Code APIs, with no direct access to OS-level mouse, keyboard, or screenshot interfaces. This design was originally intended for security isolation—preventing malicious plugins from damaging user systems—but it also becomes an architectural barrier to GUI Agent integration.
As a deeply customized version of VS Code, Cursor inherits these same sandbox constraints. Breaking through this limitation requires restructuring the tool from a plugin form into a standalone desktop application or system-level daemon process. This isn't adding a feature—it's rebuilding the foundation.
Bottleneck Three: Compute and Latency
This one is the most easily overlooked but the most fatal. GUI Agent inference latency comes from three stacked stages: image compression and upload (a 1080p screenshot still requires hundreds of KB of network transfer after compression), large model visual reasoning (multimodal model inference computation far exceeds text-only models, with single inferences typically taking 1-3 seconds in the cloud), and waiting for state confirmation after action execution.
These three stages stack sequentially, resulting in end-to-end latency of 2-5 seconds per operation step, with inference time accounting for 70%-90% of total task time. A GUI Agent takes several seconds per step and several minutes to run a complete test. But programmers' tolerance for IDE responsiveness is at the millisecond level—the psychological tolerance threshold for code completion is approximately 100-200 milliseconds. There's an order-of-magnitude gap between the two, a physical limitation that cannot be fully bridged through engineering optimization under current hardware and network conditions. No one can tolerate a completion tool that freezes for minutes. Until latency comes down and costs drop, this is the real problem blocking integration.
The Claude Code Insight: An Alternative Path That Bypasses GUI
When discussing AI programming tools, Claude Code deserves special mention. Some people lump it together with GUI Agents, but that's inaccurate.
Claude Code is a command-line tool. It doesn't operate graphical interfaces—it operates terminals, file systems, and command lines. It doesn't take screenshots or click mice; instead, it directly calls underlying interfaces—reading files, modifying code, running commands, testing and deploying. It punches straight through the "glass" layer of GUI, taking the path of bypassing graphical interfaces to reach the system layer directly. This path naturally sidesteps sandbox constraints and screenshot inference latency, making it the most pragmatic breakthrough path under existing engineering constraints.

But it belongs to the same wave as GUI Agents: AI moving from answering questions to autonomously executing tasks.
The Endgame: Autonomous Software Engineers
Multi-agent collaboration is replacing the single-Agent paradigm—one agent plans, one executes, one verifies. The connective tissue between them is MCP (Model Context Protocol).
MCP is a standardized protocol proposed and open-sourced by Anthropic in late 2024, designed to solve interoperability between AI models and external tools and data sources. In multi-agent systems, different Agents are often powered by different models running in different environments. MCP provides a unified "slot" standard that enables planning Agents, execution Agents, and verification Agents to pass context, tool invocation requests, and execution results in a structured manner without developing custom interfaces for each Agent pair. The significance of this protocol is analogous to what the USB interface did for hardware ecosystem standardization—reducing integration costs and letting ecosystems grow naturally.
Mapping this architecture to AI programming, the future looks like this: you open your programming tool, and it doesn't just modify code—it opens a browser to verify pages on its own, reads error messages on its own, locates bugs on its own, submits fixes on its own, and then tells you "done."
This form has a name—Autonomous Software Engineer. It's not here to replace programmers, but to let programmers hand off mechanical work and save their time for things that truly require judgment.
Two years ago, GUI Agent success rates were only 12%; today they've surpassed humans. The US is wrapping this capability in products; China is spreading it through open source. Both paths point to the same thing: making AI go from only being able to think to truly being able to act.
The true mark of a technology's success isn't when everyone talks about it, but when no one mentions it anymore—because it's become a taken-for-granted underlying capability. The endgame is established; all that remains is time.
Key Takeaways
- The US takes a product packaging approach (Anthropic Computer Use, OpenAI Agent, Google Mariner)—black-box technology but rapid market capture
- China takes an open-source research approach (ByteDance UI-TARS, Alibaba Qwen-VL)—trading technical transparency for global ecosystem influence
- Programming tool integration of GUI Agents faces three major bottlenecks: permission risks, sandbox architecture limitations, and inference latency with costs
- Claude Code takes a command-line approach that bypasses GUI to reach the system layer directly, belonging to the same wave of AI autonomous task execution as GUI Agents
- The endgame is multi-agent collaborative Autonomous Software Engineers, achieving planning, execution, and verification coordination through the MCP protocol
Related articles
Industry InsightsAI Product Development in Practice: Model Selection, Building Moats, and Paths to Commercialization
Practical strategies for AI product development: why not to train models from scratch, when to use APIs vs. fine-tuning, building product moats, and the full path from evaluation systems to commercialization.
Industry InsightsNo Product Fits Your Needs? Building It Yourself Is the Best Starting Point for Indie Developers
Can't find a product that fits? Building from personal pain points is the best entry for indie developers. Niche needs + AI tools = rapid product creation.
Industry InsightsOpenAI Codex Tutorials Mass-Copied on Bilibili, Highlighting AI Content Farm Problem
At least 9 Bilibili accounts mass-published identical OpenAI Codex tutorial videos, exposing content farm operations in the AI tools space.