Claude Sonnet 4 Invents Its Own Browser Automation: The Wild Debugging Journey of a CSS Bug

Claude Sonnet 4 autonomously invents browser automation tools to debug a CSS bug, raising cost and security concerns.
Simon Willison shares a striking case where Claude Sonnet 4 (Fable), given only a screenshot and a one-line prompt, autonomously invented a PyObjC-based browser screenshot solution, built a CORS proxy server, penetrated Shadow DOM, and completed a full CSS debugging workflow — all without any pre-configured browser tools. The $12 debugging session highlights both the remarkable tool-making capabilities of frontier AI coding agents and the serious security implications of running autonomous agents with system-level permissions.
A CSS Bug Sparks an AI's Autonomous Exploration
Simon Willison (renowned developer and creator of Datasette) recently shared a jaw-dropping use case involving Claude Sonnet 4 (codename Fable). Simon is one of the core figures in the Python/Django community and a co-creator of the Django framework. His project Datasette is a widely popular open-source tool that instantly transforms SQLite databases into interactive web interfaces and APIs. As an early advocate of LLM tooling practices, his blog is one of the most influential sources of information in the AI development space.
This time, he simply gave the AI a screenshot and a one-line prompt — "look at the dependencies and figure out why there's a horizontal scrollbar here" — and walked away from his computer. When he came back a few minutes later, he found that Claude had autonomously opened browser windows and was performing a series of complex debugging operations.

This case perfectly illustrates Simon's assessment of Claude Sonnet 4: relentlessly proactive. It has a vast repertoire of tricks and will spare no effort deploying whatever means necessary to achieve its goal.
Claude Sonnet 4 Invents Its Own Browser Screenshot Solution
What's most astonishing is that Claude Code itself wasn't configured with any browser automation tools — but Fable found its own way. To understand this, you need to grasp what Claude Code actually is: it's Anthropic's command-line coding agent tool that allows Claude to operate directly within a developer's terminal environment — including reading and writing files, running shell commands, managing git repositories, and more. Unlike traditional code completion tools (like GitHub Copilot), coding agents have the ability to autonomously plan and execute multi-step tasks. It's essentially an autonomous AI agent with system-level permissions.
The PyObjC + screencapture Combo
When Claude discovered that macOS's osascript couldn't be used due to permission restrictions, it didn't give up. Instead, it invented a workaround:
- Used
uv run --with pyobjc-framework-Quartz pythonto temporarily install and run the PyObjC framework - Enumerated all windows via the Quartz API, filtering by window name to locate the target Safari window
- Retrieved the window number (e.g., 153551)
- Used
screencapture -x -o -l 153551to precisely capture that specific window
This solution involves a clever combination of multiple low-level technologies. PyObjC is a bridging framework between Python and Objective-C that allows Python code to directly call macOS's native Cocoa and Quartz APIs. Quartz is the core component of macOS's graphics subsystem, responsible for window compositing, screen rendering, and other low-level operations. Through Quartz APIs like CGWindowListCopyWindowInfo, you can enumerate metadata for all windows in the system (including window ID, title, position, owning application, etc.) — a fairly low-level system programming operation. screencapture is macOS's built-in screenshot command-line tool, and its -l parameter allows precise capture by specifying a window ID. Claude's ability to combine these disparate system tools into a complete solution demonstrates system-level problem-solving capabilities that go far beyond simple code generation.
It's worth noting that Claude chose to use uv rather than the traditional pip to install dependencies. uv is an ultra-fast Python package manager developed by Astral (the same company behind the Ruff linter), written in Rust, and 10–100x faster than pip. Its uv run --with command allows installing dependencies in a temporary virtual environment and immediately executing a script without any prior environment setup. This ephemeral environment approach is perfectly suited for one-off system probing tasks — use it and discard it without polluting the project's dependency environment.
This entire solution was improvised by Fable on the spot. No one ever taught it this trick.
Building a CORS Server to Extract Page Data
To retrieve CSS computed values and other diagnostic information from the browser, Claude wrote a minimal Python HTTP server:
from http.server import HTTPServer, BaseHTTPRequestHandler
class H(BaseHTTPRequestHandler):
def do_POST(self):
n = int(self.headers.get("Content-Length", 0))
open("/tmp/diag.json", "w").write(self.rfile.read(n).decode())
self.send_response(200)
self.send_header("Access-Control-Allow-Origin", "*")
self.end_headers()
The Access-Control-Allow-Origin: * response header in this code involves a core browser security mechanism — CORS (Cross-Origin Resource Sharing). Browsers block web pages from making requests to servers at different origins (protocol + domain + port) by default — this is the same-origin policy. A page on the local dev server (e.g., localhost:8000) needs to send a POST request to another local server that Claude spun up, and without the CORS header, the browser would outright block the request. By setting the wildcard *, Claude explicitly tells the browser to allow requests from any origin to access this data collection server.
It then injected JavaScript into the page template to POST critical data like the textarea's scrollWidth and clientWidth to this server, and read the results from the file system. This is a complete data exfiltration pipeline, entirely designed autonomously by the AI.
Full CSS Debugging Workflow Recap
Based on a single screenshot and a one-line prompt, Claude Sonnet 4 completed all of the following steps:
- Environment setup: Configured fake environment variables on its own and started a local dev server
- Multi-browser testing: First tested with Playwright across Chrome/Firefox/WebKit, but couldn't reproduce the bug
- Identifying the key clue: Determined that the user's default browser was Safari
- Creating test pages: Wrote standalone HTML test files to verify different CSS configurations
- Bypassing permission restrictions: After osascript was denied, invented the PyObjC solution
- Simulating user interaction: Injected JavaScript into the template to automatically trigger the
/keyboard shortcut after 1.2 seconds to open a modal - Diving into Shadow DOM: Penetrated a Web Component's shadow root to access the textarea element
- Verifying the fix: First hacked a fix into the template to confirm it worked, then reported the proper fix
The "diving into Shadow DOM" step deserves special attention. Shadow DOM is one of the core technologies in the Web Components standard — it creates an encapsulated, isolated DOM tree (called a shadow tree) for DOM elements. Styles and structure inside the Shadow DOM are completely isolated from the outer document, and external CSS selectors and JavaScript methods like querySelector cannot penetrate the shadow boundary by default. This encapsulation mechanism is widely used in UI component libraries and design systems. To access elements inside a Shadow DOM, you must first obtain the host element's shadowRoot property and then query within it. Claude's ability to recognize that the target textarea was inside a Shadow DOM and correctly penetrate the shadow root to access it demonstrates a deep understanding of modern web architecture.
In the end, the problem that consumed so much creativity actually required just a two-line CSS fix.
A Dual Warning: Cost and Security
$12 to Fix a CSS Bug
Simon used AgentsView to track the token consumption of this session. At full API pricing, this debugging session cost approximately $12.11. During the process, Fable hit some invisible limit and automatically downgraded to Opus to finish the task.
Simon is currently on the $100/month Claude Max plan. Anthropic has promised ample Fable usage through June 22, after which it will be billed at API prices.
Security Risks of AI Coding Agents
The flip side of this case is a profound security warning. As Simon pointed out:
Coding agents can do everything you can do via terminal commands — and frontier models know all the tricks, and apparently some new ones that have never been documented.
If Fable were to receive malicious instructions — such as prompt injection attacks hidden in code, malicious content in issue threads, or harmful content accidentally pasted by a user — its "relentless proactivity" would become an enormous security threat.
Prompt Injection is a core class of security attacks targeting large language model applications. Attackers hide malicious instructions within seemingly normal input content — such as code comments, GitHub Issue descriptions, Markdown documents, or even image alt text — attempting to hijack the AI agent's behavior. When the AI agent processes this content, it may execute the hidden malicious instructions as if they were the user's intent. For coding agents with system-level permissions, this type of attack is especially dangerous: an attacker could use a seemingly harmless code repository to trick the AI agent into performing data theft, backdoor implantation, or other malicious operations. The industry currently has no perfect defense, and this is considered one of the most urgent unsolved problems in AI security.
Specifically, a hijacked Fable could:
- Open arbitrary browser windows
- Read the contents of other windows on screen
- Start local servers
- Modify project source code
- Send data over the network
Simon compared this risk to the "Challenger disaster" — an analogy with deep implications. The root cause of the 1986 Challenger space shuttle disaster was that known risks were systematically underestimated and ignored — engineers had long warned that the O-rings could fail in cold temperatures, but management chose to proceed with the launch under schedule pressure. Simon is implying the same structural risk: the industry has long known that running AI agents in non-sandboxed environments poses serious security risks, but under pressure to deliver efficiency and features, this risk is widely ignored. As model capabilities surge dramatically (as demonstrated by Fable's autonomous tool-creation abilities), this "known but ignored risk" is rapidly approaching a critical point.
Implications for Developers
What Claude Sonnet 4 demonstrates isn't just stronger coding ability — it's an entirely new problem-solving paradigm: when standard tools aren't available, the AI will autonomously invent alternatives. This "tool-making" capability is a fundamental distinction from previous models.
But this also means developers need to rethink several things:
- Permission boundaries: System permissions granted to AI coding agents need much finer-grained control
- Cost monitoring: Unconstrained exploration can generate unexpectedly high costs
- Sandboxed execution: Using coding agents in production environments must involve isolation mechanisms
- Audit trails: All AI operations need to be fully logged for post-hoc review
As Simon put it, Fable's intelligence is a double-edged sword — it may be better at recognizing malicious instructions, but once compromised, the damage it can cause will far exceed that of previous models.
Key Takeaways
Related articles

Codex VS Claude Code: The Token Economics Behind a 10x Price Gap
Same coding task: Codex costs $15, Claude Code costs $155. Deep dive into the real reasons behind the 10x gap — it's not pricing, it's token volume, output style, and context strategy.

Gemma 4 Open-Source Model Local Deployment Guide: Ollama Installation & Mobile Setup
Step-by-step guide to deploying Google's Gemma 4 open-source model locally with Ollama and running the lightweight version on mobile with tool calling support.

The Decline of Tokenmaxxing: Why Selling Outcomes Matters More Than Selling Tokens
The Tokenmaxxing craze is fading as enterprise AI procurement shifts from chasing Token counts to focusing on actual business outcomes. Learn why outcome-based AI evaluation is the right approach.