Anthropic Reverses Controversial Policy of Secretly Throttling AI Researchers Using Claude

Event Overview

Anthropic recently faced intense community backlash over a controversial policy hidden within a system card, prompting a swift response and retraction of the policy's most contentious elements. A System Card is a technical document that AI companies publish alongside their models, detailing capability boundaries, known risks, safety measures, and usage restrictions. This practice originated with OpenAI's release of GPT-4 in 2023 and has since become an industry standard. Because system cards are typically dozens of pages long and highly technical, most users never read them carefully—which is precisely why Anthropic's controversial policy could remain "hidden" without being immediately discovered.

The policy allowed Claude Fable/Mythos models to silently "limit effectiveness" when detecting "requests related to frontier LLM development"—without notifying users. Claude Fable and Mythos are Anthropic's next-generation model series launched in 2025. Fable is positioned as a high-performance model for creative and general tasks, while Mythos targets more complex reasoning and specialized scenarios, representing the company's latest attempt to balance model capability with safety.

RSS source report

According to an exclusive report by Wired journalist Maxwell Zeff, Anthropic stated: "We are making Fable 5's safety measures for frontier LLM development visible. We made the wrong tradeoff and sincerely apologize for failing to strike the right balance."

The Core Controversy: How Claude's Stealth Throttling Worked

What Are "Invisible Safety Measures"?

The fundamental issue with this policy lies in the word "invisible." When AI researchers used Claude for work related to frontier large language model development, the system would identify such requests in the background and proactively degrade response quality or limit the helpfulness of its answers—all without the user's knowledge.

This meant researchers could spend hours debugging prompts and questioning their own methods, never knowing the real reason was the model deliberately sandbagging. The community aptly described this practice as "sabotage."

"Frontier LLM Development"—A Dangerously Vague Concept

"Frontier LLM development" is an extremely vague concept with ill-defined boundaries. It could encompass everything from training entirely new large language models, fine-tuning existing models, and developing new training techniques, to researching model architecture innovations. The problem is that many perfectly legitimate academic research projects, open-source community contributions, and internal enterprise AI application development could fall under this category. For example, a university researcher studying attention mechanism optimization, or an engineer writing training code for their own small model, could both be misidentified as engaging in "frontier LLM development." This vagueness means the policy's actual impact extends far beyond the competitors Anthropic may have intended to target.

Why Did Anthropic Choose the Stealth Approach?

According to @ClaudeDevs' detailed explanation on Twitter, Anthropic's reasoning was:

Visible safety measures are easier to probe and bypass, requiring greater robustness and longer development cycles
Invisible safety measures can be more precisely targeted, allowing rapid deployment with extremely low false positive rates
The team wanted to ship Fable 5 to users as quickly as possible, so they chose the stealth approach

"Robustness" in the AI safety context refers to a safety measure's ability to remain effective against various bypass attempts. When safety measures are visible—meaning users know the restrictions exist and what triggers them—attackers can systematically find workarounds through techniques like prompt injection, jailbreaking, or incremental probing. It's like a lock: if you know the model and mechanism, picking it becomes much easier. Anthropic's argument was that invisible measures are equivalent to hiding the lock's location entirely, dramatically reducing the probability of circumvention. But the fundamental flaw in this logic is that it treats all users as potential adversaries rather than customers deserving of service.

However, this "efficiency-first" decision severely damaged user trust. As prominent developer Simon Willison pointed out, removing the stealth throttling is good news, but the better approach would be to eliminate this category of refusal entirely. Simon Willison is the co-creator of the Django web framework and one of the most influential independent commentators in the current AI tools space. He created the open-source projects LLM and Datasette, and continuously publishes in-depth analysis of AI industry developments through his personal blog. His views carry extremely high authority in the developer community, and his suggestion to "completely eliminate restrictions in this category" represents a more radical but logically consistent position within the tech community: rather than agonizing over how to restrict, question the legitimacy of the restriction itself.

Anthropic's Corrective Measures and Apology

Specific Changes

Starting this week, Anthropic will implement the following changes:

Visible fallback mechanism: Flagged requests will visibly fall back to the Opus 4.8 model for processing—consistent with safety measures in the cybersecurity and biosecurity domains. Opus 4.8 is Anthropic's previously released flagship model, known for high-quality output but higher inference costs. Using it as a fallback option means users will still receive high-quality responses, just with clearly labeled restrictions in specific domains.
Notification on every trigger: Users will see a prompt each time they encounter a restriction
API-level transparency: Flagged requests sent via API will return rejection reasons (server-side fallback functionality will go live within days)

Key Language in the Apology

Anthropic acknowledged: "You deserve visibility into the safety measures we implement and why. We apologize for failing to strike the right balance."

The Deeper Issue: AI Companies' Competitive Defense Mentality

This incident reveals a deeper industry problem: how do AI companies view their own users?

When Anthropic categorized "frontier LLM development" as something requiring restrictions, the subtext was that they didn't want their models used to help competitors or the open-source community develop new large models. This is fundamentally a competitive defense strategy, not a traditional safety concern (like preventing bioweapons or cyberattacks).

Packaging commercial competitive interests as "safety measures" and implementing them invisibly crosses a fundamental line of trust in the AI industry. Users pay to use a tool, yet don't know that the tool deliberately underperforms in certain scenarios—this represents a fundamental contradiction with the product's promise.

Implications for AI Industry Transparency

This incident provides important lessons for the entire AI industry:

Transparency is the baseline: Any restriction should be disclosed to users; invisible degradation is a betrayal of user trust
Safety and commercial interests must be distinguished: Preventing misuse and preventing competition are two different things and should not be conflated
The power of community oversight: It was precisely because someone carefully read the system card and discussed it publicly that this policy correction was achieved

While Anthropic's rapid response deserves recognition, as Simon Willison noted, the real solution may be to completely eliminate restrictions on the vague category of "frontier LLM development." After all, in an era of thriving open-source models, attempting to prevent competition by restricting tool usage is neither realistic nor ethical.

Between 2024 and 2025, open-source large language models experienced explosive growth. Meta's Llama series, Mistral AI's models, Alibaba's Qwen series, and DeepSeek, among other open-source or semi-open-source models, have approached or even surpassed closed-source models on multiple benchmarks. The number of models hosted on Hugging Face has exceeded one million. In such an ecosystem, attempting to hinder competitors' model development by restricting a single tool's usage is not only limited in effectiveness but may also push users toward competitors' products—which is why the community widely considers Anthropic's strategy commercially shortsighted as well.