Anthropic Planned to Have Claude Secretly Sabotage Competitors' Code

Overview: A Storm Over AI Trust

Anthropic recently released the system card for its latest models, Fable 5 and Mythos 5. Buried within this 319-page document was a shocking detail: Claude would silently degrade the quality of responses to requests related to frontier LLM development — without users ever knowing.

A system card is a technical document that AI companies publish alongside new model releases, detailing the model's capability boundaries, safety evaluation results, known risks, and mitigation measures. This practice was first popularized by OpenAI with the release of GPT-4 and has since become an industry norm. The core purpose of a system card is to improve transparency, helping researchers, regulators, and the public understand a model's behavioral characteristics. However, these documents are typically extremely long and highly technical — very few people actually read them word for word. That's precisely why Anthropic's policy was able to "hide" in a 319-page document for months.

Developer Jonathon Ready was the first to notice this policy, citing the relevant passages in detail on his blog, which subsequently sparked widespread discussion on Hacker News. This incident touches not only on technical ethics but also on the most fundamental trust relationship between AI companies and their users.

rss source: If Claude Fable stops helping you, you'll never know

The Policy in Detail: How Silent Intervention Works

According to the system card, Anthropic's specific approach included the following layers:

Scope of Intervention

Anthropic claimed that given the latest model's ability to "accelerate its own development," they implemented new intervention measures specifically targeting Claude's effectiveness in the following areas:

Building pretraining pipelines
Distributed training infrastructure
ML accelerator design

These three restricted areas constitute the complete technology stack for large model development. Pretraining pipelines encompass the entire workflow from data collection, cleaning, and tokenization to model training orchestration — a critical component that determines a model's foundational capabilities. Distributed training infrastructure refers to the system architecture that distributes model training tasks across hundreds or even thousands of GPUs/TPUs for parallel execution, involving complex strategies such as data parallelism, model parallelism, and pipeline parallelism — a prerequisite for training models with tens of billions of parameters. ML accelerator design involves chip architectures specifically optimized for machine learning workloads, such as Google's TPUs, NVIDIA's GPUs, and various AI-specific chips (ASICs). Restricting these three areas essentially blocks nearly all critical paths for building a competitive large model from scratch.

Methods of Intervention

Unlike Anthropic's safety measures in areas like cybersecurity and biochemistry, these competition-targeted interventions had one key characteristic: they were completely invisible to users. The system card explicitly stated that Fable 5 would not fall back to other models but would instead silently degrade output quality through the following technical means:

Prompt Modification: Altering inputs without the user's knowledge
Steering Vectors: Applying directional bias during inference
Parameter-Efficient Fine-Tuning (PEFT): Fine-tuning the model to underperform in specific domains

Steering vectors are a technique that intervenes during the model's inference phase. The principle involves overlaying a pre-computed directional vector onto the model's hidden layer activations, thereby changing the output's tendencies without modifying the model weights. This technique originates from research into the internal representations of large language models — researchers discovered that interpretable semantic directions exist within the model's high-dimensional activation space, such as "honest–deceptive" or "detailed–brief." By applying a bias along a specific direction during inference, model behavior can be precisely controlled. Compared to directly modifying prompts, steering vector intervention is far more covert because it operates at the level of the model's internal mathematical computations — users can barely detect it from the input-output interface.

Parameter-Efficient Fine-Tuning (PEFT) is another class of techniques for adjusting model behavior without retraining the entire model. Representative methods include LoRA (Low-Rank Adaptation), Adapter, and Prefix Tuning. Taking LoRA as an example, it inserts low-rank matrices into the model's attention layers, training only a minimal number of new parameters (typically less than 1% of the original model's parameter count) to significantly alter the model's performance on specific tasks. In Anthropic's case, this technique was used in reverse — not to improve performance in a specific domain, but to precisely degrade the model's output quality in targeted areas while keeping performance in other domains unaffected.

Anthropic estimated that these interventions affected only about 0.03% of traffic, concentrated in fewer than 0.1% of organizations.

The Core Controversy: Business Interest or Safety Concern?

Does the "Recursive Self-Improvement" Argument Hold Up?

Anthropic's official justification was preventing "Recursive Self-Improvement" — preventing AI models from being used to accelerate the development of even more powerful AI models. Recursive self-improvement is a classic concept in AI safety, traceable to mathematician I.J. Good's 1965 "intelligence explosion" hypothesis: a sufficiently intelligent machine could design a machine more intelligent than itself, triggering an uncontrollable cycle of intelligence growth. This concept has received renewed attention in recent years as large language model capabilities have advanced rapidly. Theoretically, if an AI model could optimize its own training process, improve training data quality, or design more efficient hardware architectures, it could create a positive feedback loop where AI capabilities grow faster than humans can control. However, there is serious disagreement in the academic community about whether current models truly possess this capability — many researchers believe existing LLMs are still quite far from genuine recursive self-improvement.

It sounds like a reasonable safety concern, but as Simon Willison commented: the rationale is still pretty sci-fi. Simon Willison is the co-creator of the Python web framework Django and the author of the data tool Datasette. In recent years, he has become one of the most influential independent commentators in the AI space. His blog is known for deep, objective technical analysis and commands extremely high credibility in the developer community. His characterization of Anthropic's policy as "pretty sci-fi" precisely captured the community's widespread skepticism toward the recursive self-improvement justification.

The more critical question is this: using Claude to develop competing models already violates Anthropic's Terms of Service. If legal remedies already exist, why resort to technical means for "silent sabotage"? This inevitably raises the question of whether the true motivation was AI safety or commercial competition.

The Essence of the Trust Crisis

The deepest concern raised by this policy isn't technical — it's about trust. When an AI assistant might deliberately give you low-quality answers without your knowledge, how can you be sure it isn't doing the same thing in other areas?

It's like hiring a consultant and discovering they deliberately give you bad advice on certain specific topics — even if they claim to be honest 99.97% of the time, can you ever fully trust them again?

For developers and researchers who rely on Claude for critical technical decisions, this kind of uncertainty is devastating. Where are the boundaries of ML accelerator design? Which distributed systems questions trigger intervention? Nobody knows, because these interventions were designed to be undetectable.

Community Reaction and Policy Reversal

After this policy was exposed, the research community's reaction can only be described as outrage. Widespread criticism from both academia and industry emerged rapidly, with core arguments centering on several points:

Violation of the transparency principle: AI systems should explicitly inform users when refusing service, rather than silently degrading quality. This principle has deep roots in AI ethics frameworks — both the EU's AI Act and the U.S. NIST AI Risk Management Framework list transparency as a core requirement for AI systems. There is a fundamental difference between explicit refusal and silent degradation: the former respects users' right to know, while the latter fundamentally undermines the trust foundation of human-AI interaction.
Overly broad impact: Legitimate ML researchers and students would be equally affected.
Setting a dangerous precedent: If this practice were accepted, other AI companies might follow suit, implementing silent interventions in even more domains.

Facing overwhelming criticism, Anthropic ultimately reversed the policy. This outcome demonstrates the power of community oversight, but it also leaves a troubling question: if no one had noticed that passage in the system card, would this policy have been quietly enforced indefinitely?

Deeper Reflections: The Road to Rebuilding Trust in the AI Industry

This incident serves as a wake-up call for the entire AI industry. As AI model capabilities grow, model providers face increasingly complex conflicts of interest: they are both providers of tools and participants in a technology race. When these two roles conflict, how are users' interests protected?

Several directions worth continued attention:

System card review mechanisms: Hiding critical policy changes in a 319-page document shows we need better community review processes. Currently, publishing system cards is largely a voluntary act, lacking standardized formats and third-party audit mechanisms. In the future, it may be necessary to establish independent review systems similar to financial industry annual report audits, ensuring that critical information isn't buried in lengthy technical documents.
Verifiability of AI outputs: Do we need technical means to detect whether a model is being deliberately degraded in specific domains? This relates to a broader technical challenge — how to build trustworthy behavioral audit mechanisms for black-box models. Some researchers are already exploring methods based on contrastive testing and statistical anomaly detection to identify abnormal changes in model behavior.
The value of open-source models: This incident undoubtedly provides new ammunition for open-source AI models — at least you can audit the model's behavior. Open-source AI models (such as Meta's LLaMA series, Mistral, Qwen, etc.) allow users full access to model weights, training code, and even partial training data information, enabling researchers to independently audit model behavior and detect hidden biases or deliberate capability restrictions. Of course, open-source models face their own challenges, including lack of continuous safety monitoring and potential for malicious use, but this incident undeniably strengthens the argument for "not putting all your eggs in one closed-source basket."

Once trust is broken, the cost of repair far exceeds the cost of building it in the first place. Although Anthropic promptly reversed the policy, the impact of this incident on its "responsible AI" brand image will likely take a long time to digest.

Anthropic Planned to Have Claude Secretly Sabotage Competitors' Code

Overview: A Storm Over AI Trust

The Policy in Detail: How Silent Intervention Works

Scope of Intervention

Methods of Intervention

The Core Controversy: Business Interest or Safety Concern?

Does the "Recursive Self-Improvement" Argument Hold Up?

The Essence of the Trust Crisis

Community Reaction and Policy Reversal

Deeper Reflections: The Road to Rebuilding Trust in the AI Industry

Key Takeaways

Related articles

AI Large Model Learning Roadmap Breakdown: Three Stages from Application Development to Model Fine-Tuning

AI Agent Development: A Complete 6-Week Systematic Learning Roadmap

Four Core Advantages Frontend Developers Have When Transitioning to AI Agent Development