DeepSeek OCR2, Kimi K2.5, and Microsoft Maia 200 All Launched on the Same Day

On January 28, the AI field saw multiple major announcements: DeepSeek released its new visual understanding model OCR2, Moonshot AI officially launched Kimi K2.5, Microsoft's custom AI chip Maia 200 began deployment, and Alibaba's Tongyi Qianwen released its most powerful thinking model. This article breaks down each of these developments.

DeepSeek OCR2: Rebuilding the Visual Encoder with a Large Language Model

DeepSeek released its brand-new OCR2 model, whose biggest technical innovation lies in adopting an encoder architecture called Deep Encoder VR. Unlike traditional multimodal models that commonly use CLIP as the visual encoder, OCR2 boldly replaces CLIP with a large language model (a small-parameter version of Qwen 2.5), giving the model stronger visual perception capabilities.

To understand the significance of this innovation, some background on CLIP is needed. CLIP (Contrastive Language-Image Pre-training) is a vision-language pre-training model proposed by OpenAI in 2021 that maps images and text into the same semantic space through contrastive learning. Thanks to its powerful zero-shot transfer capability, CLIP quickly became the standard visual encoder for multimodal large models, widely adopted by mainstream models like LLaVA and InstructBLIP. However, CLIP is fundamentally a discriminative model with inherent limitations in fine-grained visual understanding (such as text recognition and chart parsing), and there exists a modality gap between its feature representations and the internal representations of language models, requiring additional projection layers to bridge. DeepSeek OCR2's approach of replacing CLIP with a language model essentially unifies visual perception tasks under the language model's autoregressive framework, eliminating the modality gap and naturally aligning visual features with the language model's representation space.

Achieves higher performance with the same tokens as Gemini 3 Pro

This architectural change delivers substantial performance improvements. In the OmniDocBench V1.5 benchmark, OCR2 achieved higher performance while using the same number of tokens as Gemini 3 Pro. This means the model's comprehension ability has been significantly enhanced while maintaining image compression ratio and decoding efficiency.

Even more noteworthy is that Section 6.2 of the paper mentions that Deep Encoder VR has the potential to evolve into a universal multimodal encoder. DeepSeek stated they will continue exploring integration technologies for more modalities. From this, we can speculate that the next generation or the one after of DeepSeek models will likely be a natively multimodal model covering text, images, audio, and even video.

Kimi K2.5 Officially Released: Cluster Agent Mode Is the Biggest Highlight

Moonshot AI officially released the Kimi K2.5 model. In fact, the new model had already been quietly pushed to some users the night before, with the official announcement coming on January 28.

Core Capability: Over 100 Sub-Agents Working in Coordination

K2.5's core highlight is the Cluster Agent mode, which can autonomously command over 100 sub-agents and intelligently select and invoke from among 1,500 tools. This "one master brain dispatching hundreds of assistants" architecture gives Kimi unprecedented parallel processing capability for complex tasks.

Multi-Agent Systems (MAS) are a classic research direction in artificial intelligence that has been revitalized and engineered for production in the era of large language models. Typical architectures include: an Orchestrator agent responsible for task decomposition and scheduling, Sub-agents responsible for specific execution, with agents collaborating through message passing or shared memory. Kimi K2.5's Cluster Agent mode pushes this architecture to a new scale—100+ sub-agents operating in parallel means tasks can be highly parallelized, theoretically compressing complex engineering tasks that would take hours of sequential processing down to minutes. The ability to invoke 1,500 tools covers virtually all software engineering scenarios including code execution, web search, file operations, and API calls. This architecture shares the same design philosophy as open-source multi-agent frameworks like AutoGen and CrewAI, but achieves a qualitative leap in scale and engineering maturity.

Can autonomously command a cluster of over 100 sub-agents

In SWE (Software Engineering) benchmarks, K2.5 reached the level of Gemini 3 Pro and maintains a leading position across multiple benchmarks and agent evaluations.

Four Usage Modes

K2.5 currently offers four usage modes: Fast mode, Thinking mode, Agent mode, and Cluster Agent mode. The Cluster Agent mode is only available to paid users. Both the Kimi web version and app are now officially available, and the older K2 model is still retained for users to choose from.

The old K2 model is still available

Microsoft Maia 200 Custom AI Chip Officially Deployed

Microsoft officially launched its self-developed AI accelerator chip Maia 200, manufactured using TSMC's 3-nanometer process technology. According to Microsoft, the chip achieves industry-leading performance on several key metrics and has already begun deployment in Microsoft's data centers.

This marks an important step for Microsoft in achieving AI infrastructure autonomy. The wave of self-developed AI chips stems from strategic anxiety over the NVIDIA GPU supply chain—NVIDIA's H100/H200 series GPUs hold over 80% market share in AI training and inference, and their high prices and tight supply put enormous cost pressure on cloud computing giants. Google launched TPUs (Tensor Processing Units) as early as 2016, now in their fifth generation; Amazon AWS introduced the Trainium and Inferentia series; Meta developed the MTIA inference chip. Microsoft's Maia series is a continuation of this trend—Maia 200 uses TSMC's 3-nanometer process, representing the most advanced commercial chip manufacturing technology available today. The core advantages of self-developed chips include: deep customization and optimization for proprietary AI workloads (such as attention computation and KV Cache management in Transformer inference), long-term procurement cost reduction through vertical integration, and building supply chain resilience amid escalating geopolitical risks.

Alibaba Tongyi Qianwen Releases Qwen3 Max Thinking Official Version

On the evening of January 26, Alibaba Tongyi released the Qwen3 Max Thinking official version model. As Alibaba Tongyi's most powerful large language model to date, it achieved excellent results across multiple benchmarks.

Achieves excellent results across multiple benchmarks

You might not have noticed that Qwen3 Max Thinking is a closed-source model, currently available for use on the Qianwen web client, desktop client, and QwenChat, with the API also open simultaneously. This forms a nuanced layered strategy alongside Alibaba's previous strong push for open-source models—Alibaba Tongyi was previously known for its open-source strategy, with the Qwen series accumulating a large user base and ecosystem on Hugging Face. Open-source models serve the purpose of ecosystem building and developer mindshare, while the closed-source flagship model serves as the core asset for commercial monetization and technical moat, converging with OpenAI's business logic (closed-source GPT-4 + open API). It's worth noting that "Thinking Model" specifically refers to models with explicit Chain-of-Thought reasoning capabilities that improve accuracy on complex reasoning tasks by performing internal "thinking" steps before generating the final answer. OpenAI's o1/o3 series pioneered this paradigm. The release of Qwen3 Max Thinking indicates that slow-thinking reasoning capability has become a standard competitive dimension for top-tier models.

AI Leaderboard Updates

In terms of usage rankings, Sonnet 4.5 quickly surged to first place on Monday—this "Monday surge to the top" pattern has persisted for some time. In image generation, Gemini holds an absolute dominant position. The only Chinese model on the leaderboard is Qwen3 VR, ranked ninth. As for the AI model performance leaderboard, there have been no significant changes recently.

Summary

Today's AI developments reveal several noteworthy trends: first, visual understanding models are evolving from CLIP dependency toward deeper language model-driven approaches; second, agents are developing from single-agent to multi-agent cluster collaboration; third, tech giants are accelerating self-developed AI chips to build differentiated competitive advantages. These changes collectively point in one direction—AI is moving from single-point capability breakthroughs toward systematic ecosystem construction.

Key Takeaways

DeepSeek OCR2 replaces CLIP with a large language model as the visual encoder, surpassing Gemini 3 Pro with the same token count on OmniDocBench, with potential to evolve into a universal multimodal encoder
Kimi K2.5 officially launches with Cluster Agent mode capable of dispatching 100+ sub-agents and 1,500 tools, reaching Gemini 3 Pro level on SWE benchmarks
Microsoft launches its self-developed Maia 200 AI chip on TSMC's 3nm process, now deploying in data centers—an important step for cloud giants to reduce NVIDIA dependency and build supply chain resilience
Alibaba Tongyi releases the closed-source Qwen3 Max Thinking official version as its most powerful language model, signaling that slow-thinking reasoning has become standard for top-tier models
The AI industry is moving from single-point breakthroughs toward systematic ecosystem construction encompassing multimodal fusion, multi-agent collaboration, and self-developed chips