Claude Opus 4.8 Identifies Itself as DeepSeek: Data Contamination or Distillation? A Technical Analysis

Incident Recap: Claude Opus 4.8 Fails Within Two Hours of Launch

On May 29, Anthropic released its new model Claude Opus 4.8, simultaneously announcing $65 billion in funding with a valuation reaching $965.1 billion. However, within less than two hours of going live, numerous users reported a laughably embarrassing issue.

When developers tested Opus 4.8 via API using Chinese, asking "Who are you?", the model responded with "I am Tongyi Qianwen" or "I am DeepSeek." This wasn't an isolated incident—it reproduced consistently across repeated tests. Users on the Linux community, Weibo, and X platform all independently verified the problem. Even more absurdly, some users chatting normally with Claude in Chinese would suddenly receive responses in English.

To put it in perspective, this is like Apple releasing a new iPhone that displays "I am Samsung" on the boot screen.

Distillation Attack or Data Contamination? Two Competing Technical Explanations

The Distillation Hypothesis

Many users speculated whether Claude used distillation techniques to acquire capabilities from other large models. Knowledge Distillation was first proposed by Geoffrey Hinton et al. in 2015, with the core idea of transferring knowledge from a large "teacher model" to a smaller "student model." The soft labels output by the teacher model contain relative relationship information between categories, carrying more information than hard labels. In the era of large language models, the meaning of distillation has expanded—it's no longer limited to model compression but also includes obtaining high-quality training data through interactions with stronger models, then using that data to train one's own model. While technically effective, this practice is controversial at the commercial and legal level, as it may circumvent the original model's terms of use.

A notable detail: on February 23 this year, Anthropic issued an official statement publicly accusing DeepSeek, Zhipu, and MiniMax of using approximately 24,000 fake accounts to interact with Claude over 16 million times, calling it "industrial-scale distillation attacks." The language was extremely aggressive, even elevating distillation to a security risk level, and they lobbied the U.S. Department of Defense to push legislation attempting to define distillation as "hostile action."

Netizens joked: now the "candidate labels" from distillation have come back to haunt them.

The Data Contamination Hypothesis (The More Likely Explanation)

Compared to the distillation hypothesis, data contamination is the more reasonable explanation. Data Contamination refers to training data being mixed with content that doesn't match expectations, causing the model to produce incorrect outputs in specific scenarios. The pre-training phase of large language models typically requires trillions of tokens of text data, most of which is crawled from the internet. On the Chinese internet, conversational data containing "I am Qianwen" or "I am DeepSeek" far outnumbers "I am Claude"—because Tongyi Qianwen and DeepSeek have significantly higher usage and discussion volume among Chinese-speaking users than Claude. When such conversations are included in training data without sufficient cleaning, the model statistically "learns" incorrect self-identification.

When asked in Chinese, Claude—having learned from the latest corpus—determines that the most likely tokens to output are "I am DeepSeek" or "I am Qianwen." This alignment issue doesn't occur with English queries because the frequency of "I am Claude" in English corpus is sufficiently high, and English alignment work is more thorough.

This demonstrates that training data quality control is particularly critical in multilingual scenarios, especially when models need to handle non-primary supported languages. Data Cleaning and Deduplication processes need language-specific filtering rules designed for different languages, rather than simply applying the English data processing pipeline.

In-Depth Technical Analysis

The Missing Chinese Alignment Problem

Anthropic has nominally abandoned the Chinese market, and therefore did not perform large-scale alignment work for Chinese during the alignment phase. Alignment refers to the technical process of making AI model behavior consistent with human intentions and values. A complete alignment pipeline typically includes: Supervised Fine-Tuning (SFT) after pre-training, which uses high-quality human-annotated conversation data to teach the model how to respond correctly; Reinforcement Learning from Human Feedback (RLHF), which trains a reward model through human preference rankings and then optimizes the policy using algorithms like PPO; and self-alignment methods such as Constitutional AI. The difficulty of multilingual alignment lies in the fact that each language requires sufficient high-quality annotated data and evaluation benchmarks, and non-English languages often receive insufficient investment.

When users ask questions in Chinese, the model likely doesn't go through a complete thinking or reasoning process, but instead directly retrieves the "best answer" learned during the SFT phase. SFT is a critical step after pre-training in the LLM training pipeline—during this phase, the model is trained using carefully crafted human-written Q&A pairs, learning how to respond to users in conversational form. This phase determines the model's basic behavioral patterns, including self-identification ("Who am I"), response style, and safety boundaries. If SFT data is contaminated with self-introduction corpus from other models, or if the volume of Chinese SFT data is insufficient, the model may exhibit identity confusion in Chinese contexts.

This exposes a core issue: in the rush to ship models, alignment work at the reasoning level is incomplete, especially for non-English languages. For AI application developers requiring multilingual support, this is a warning signal worth heeding.

Technical Reasons for Bilingual Mixed Responses

Regarding the issue of users speaking Chinese but receiving English replies, the likely cause is that at the hardware level, Claude's memory retains the user's bilingual information, sometimes resulting in language-switching (exchangeable) responses. From a technical perspective, this involves the model's language identification and maintenance mechanism—ideally, the model should identify the user's input language and consistently respond in the same language, but when alignment training is insufficient, the model may fall back to the language with the highest proportion in its training data (i.e., English). This further confirms that while Opus 4.8 improved in coding and agentic capabilities at launch, its alignment work was quite sloppy.

Understanding AI Distillation Correctly: It's Not a Boogeyman

Distillation Is Standard Practice in the AI Industry

Distillation is a very common technical approach in the AI industry. In MIT's Robotics courses, many vision model capabilities are achieved through distillation. The core logic of distillation is: leveraging the capabilities of multiple small, specialized models across different scenarios, training a large model through multi-teacher augmented distillation. In this process, the specialized capabilities of each teacher model need to be distilled—somewhat similar to the logic of MoE (Mixture of Experts) models.

Mixture of Experts (MoE) is a sparsely-activated neural network architecture whose core idea is dividing the model into multiple "expert" sub-networks, activating only a subset during each inference. A Router is responsible for deciding which experts should process each input token. The advantage of this architecture is that the model's total parameter count can be very large (providing stronger representational capacity) while the computational cost per inference remains relatively small. GPT-4 is widely believed to use an MoE architecture, and open-source models like Mixtral and DeepSeek-V2 have explicitly adopted this design. The multi-teacher strategy in distillation shares a similar philosophy with MoE—different teacher models excel in different domains, and combining their knowledge can train a more comprehensive student model.

Smart Distillation vs. Brute-Force Distillation

Distillation involves many techniques and nuances, primarily across the following dimensions:

Choice of distillation target: Should you distill tokens, logits, or hidden states? Logits are the raw output values before the final softmax layer of a neural network, containing the model's confidence distribution across all possible outputs. Compared to distilling only the final output tokens, distilling logits preserves richer information—such as what the model considers the second and third most likely answers, and the relative probability relationships between options. Hidden state distillation goes even further, attempting to align the student model's intermediate layer representations with the teacher model, which requires compatible architectures between the two models. The choice of distillation level directly impacts knowledge transfer efficiency and final model performance.
Positioning of distillation: How to incorporate distillation as a module in post-training or even continued pre-training? Distillation can occur at different training stages—pre-training phase distillation focuses on basic capability transfer, while post-training phase distillation is more concerned with optimizing performance on specific tasks.
Supporting mechanisms: Whether to combine with a Universal Verifier, Reinforcement Learning (RLHF), and multi-teacher logits-level alignment? A universal verifier can perform quality control on generated content during distillation, ensuring the student model doesn't learn errors or biases from the teacher model.

There's enormous depth here—simple "hard distillation" alone cannot effectively solve these challenges, nor can it quickly boost model capabilities through brute force. So-called "hard distillation" refers to simply having the teacher model generate large volumes of responses, then directly using those responses as training data to fine-tune the student model. While simple, this approach has limited effectiveness and easily introduces systematic biases from the teacher model.

The Value of the Open-Source Ecosystem Should Not Be Denied

The logic of condemning all distillation wholesale is inappropriate. Within the entire open-source ecosystem, multiple models pushing industry boundaries through mutual distillation-like processes represents a more reasonable and beneficial state for AI development. Historically, knowledge sharing in open-source communities has always been an important driver of technological progress—the Linux operating system, Apache Web Server, and deep learning frameworks like TensorFlow and PyTorch are all successful examples of open-source collaboration. In the large model space, the release of open-source models like Meta's LLaMA series, Mistral, and DeepSeek has dramatically lowered the barriers to AI research and application, driving rapid development across the entire industry.

Anthropic should not weaponize distillation. The open-source ecosystem—from the model level to the hardware level—has enormous room for further mutual benefit and sharing.

From this Claude Opus 4.8 incident, we see not just a technical bug, but a deeper reflection on multilingual alignment, training data quality control, and attitudes toward open-source collaboration in an industry racing through rapid iterations. For AI practitioners, how to maintain solid foundational alignment work while pursuing model capabilities remains a topic requiring ongoing attention.