DeepSeek OCR2, Kimi K2.5, and Microsoft Maia 200 All Launched on the Same Day

Jan 28 AI highlights: DeepSeek OCR2, Kimi K2.5, Microsoft Maia 200 chip, and Alibaba's top thinking model all launch
January 28 saw a flurry of AI releases: DeepSeek OCR2 replaces CLIP with an LLM as visual encoder, outperforming Gemini 3 Pro; Kimi K2.5 introduces Cluster Agent mode dispatching 100+ sub-agents with 1,500 tools; Microsoft's 3nm Maia 200 chip begins deployment to reduce NVIDIA dependency; Alibaba releases its closed-source flagship Qwen3 Max Thinking. AI is evolving from single-point breakthroughs toward systematic ecosystem construction.
On January 28, the AI field saw multiple major announcements: DeepSeek released its new visual understanding model OCR2, Moonshot AI officially launched Kimi K2.5, Microsoft's custom AI chip Maia 200 began deployment, and Alibaba's Tongyi Qianwen released its most powerful thinking model. This article breaks down each of these developments.
DeepSeek OCR2: Rebuilding the Visual Encoder with a Large Language Model
DeepSeek released its brand-new OCR2 model, whose biggest technical innovation lies in adopting an encoder architecture called Deep Encoder VR. Unlike traditional multimodal models that commonly use CLIP as the visual encoder, OCR2 boldly replaces CLIP with a large language model (a small-parameter version of Qwen 2.5), giving the model stronger visual perception capabilities.
To understand the significance of this innovation, some background on CLIP is needed. CLIP (Contrastive Language-Image Pre-training) is a vision-language pre-training model proposed by OpenAI in 2021 that maps images and text into the same semantic space through contrastive learning. Thanks to its powerful zero-shot transfer capability, CLIP quickly became the standard visual encoder for multimodal large models, widely adopted by mainstream models like LLaVA and InstructBLIP. However, CLIP is fundamentally a discriminative model with inherent limitations in fine-grained visual understanding (such as text recognition and chart parsing), and there exists a modality gap between its feature representations and the internal representations of language models, requiring additional projection layers to bridge. DeepSeek OCR2's approach of replacing CLIP with a language model essentially unifies visual perception tasks under the language model's autoregressive framework, eliminating the modality gap and naturally aligning visual features with the language model's representation space.

This architectural change delivers substantial performance improvements. In the OmniDocBench V1.5 benchmark, OCR2 achieved higher performance while using the same number of tokens as Gemini 3 Pro. This means the model's comprehension ability has been significantly enhanced while maintaining image compression ratio and decoding efficiency.
Even more noteworthy is that Section 6.2 of the paper mentions that Deep Encoder VR has the potential to evolve into a universal multimodal encoder. DeepSeek stated they will continue exploring integration technologies for more modalities. From this, we can speculate that the next generation or the one after of DeepSeek models will likely be a natively multimodal model covering text, images, audio, and even video.
Kimi K2.5 Officially Released: Cluster Agent Mode Is the Biggest Highlight
Moonshot AI officially released the Kimi K2.5 model. In fact, the new model had already been quietly pushed to some users the night before, with the official announcement coming on January 28.
Core Capability: Over 100 Sub-Agents Working in Coordination
K2.5's core highlight is the Cluster Agent mode, which can autonomously command over 100 sub-agents and intelligently select and invoke from among 1,500 tools. This "one master brain dispatching hundreds of assistants" architecture gives Kimi unprecedented parallel processing capability for complex tasks.
Multi-Agent Systems (MAS) are a classic research direction in artificial intelligence that has been revitalized and engineered for production in the era of large language models. Typical architectures include: an Orchestrator agent responsible for task decomposition and scheduling, Sub-agents responsible for specific execution, with agents collaborating through message passing or shared memory. Kimi K2.5's Cluster Agent mode pushes this architecture to a new scale—100+ sub-agents operating in parallel means tasks can be highly parallelized, theoretically compressing complex engineering tasks that would take hours of sequential processing down to minutes. The ability to invoke 1,500 tools covers virtually all software engineering scenarios including code execution, web search, file operations, and API calls. This architecture shares the same design philosophy as open-source multi-agent frameworks like AutoGen and CrewAI, but achieves a qualitative leap in scale and engineering maturity.

In SWE (Software Engineering) benchmarks, K2.5 reached the level of Gemini 3 Pro and maintains a leading position across multiple benchmarks and agent evaluations.
Four Usage Modes
K2.5 currently offers four usage modes: Fast mode, Thinking mode, Agent mode, and Cluster Agent mode. The Cluster Agent mode is only available to paid users. Both the Kimi web version and app are now officially available, and the older K2 model is still retained for users to choose from.

Microsoft Maia 200 Custom AI Chip Officially Deployed
Microsoft officially launched its self-developed AI accelerator chip Maia 200, manufactured using TSMC's 3-nanometer process technology. According to Microsoft, the chip achieves industry-leading performance on several key metrics and has already begun deployment in Microsoft's data centers.
This marks an important step for Microsoft in achieving AI infrastructure autonomy. The wave of self-developed AI chips stems from strategic anxiety over the NVIDIA GPU supply chain—NVIDIA's H100/H200 series GPUs hold over 80% market share in AI training and inference, and their high prices and tight supply put enormous cost pressure on cloud computing giants. Google launched TPUs (Tensor Processing Units) as early as 2016, now in their fifth generation; Amazon AWS introduced the Trainium and Inferentia series; Meta developed the MTIA inference chip. Microsoft's Maia series is a continuation of this trend—Maia 200 uses TSMC's 3-nanometer process, representing the most advanced commercial chip manufacturing technology available today. The core advantages of self-developed chips include: deep customization and optimization for proprietary AI workloads (such as attention computation and KV Cache management in Transformer inference), long-term procurement cost reduction through vertical integration, and building supply chain resilience amid escalating geopolitical risks.
Alibaba Tongyi Qianwen Releases Qwen3 Max Thinking Official Version
On the evening of January 26, Alibaba Tongyi released the Qwen3 Max Thinking official version model. As Alibaba Tongyi's most powerful large language model to date, it achieved excellent results across multiple benchmarks.

You might not have noticed that Qwen3 Max Thinking is a closed-source model, currently available for use on the Qianwen web client, desktop client, and QwenChat, with the API also open simultaneously. This forms a nuanced layered strategy alongside Alibaba's previous strong push for open-source models—Alibaba Tongyi was previously known for its open-source strategy, with the Qwen series accumulating a large user base and ecosystem on Hugging Face. Open-source models serve the purpose of ecosystem building and developer mindshare, while the closed-source flagship model serves as the core asset for commercial monetization and technical moat, converging with OpenAI's business logic (closed-source GPT-4 + open API). It's worth noting that "Thinking Model" specifically refers to models with explicit Chain-of-Thought reasoning capabilities that improve accuracy on complex reasoning tasks by performing internal "thinking" steps before generating the final answer. OpenAI's o1/o3 series pioneered this paradigm. The release of Qwen3 Max Thinking indicates that slow-thinking reasoning capability has become a standard competitive dimension for top-tier models.
AI Leaderboard Updates
In terms of usage rankings, Sonnet 4.5 quickly surged to first place on Monday—this "Monday surge to the top" pattern has persisted for some time. In image generation, Gemini holds an absolute dominant position. The only Chinese model on the leaderboard is Qwen3 VR, ranked ninth. As for the AI model performance leaderboard, there have been no significant changes recently.
Summary
Today's AI developments reveal several noteworthy trends: first, visual understanding models are evolving from CLIP dependency toward deeper language model-driven approaches; second, agents are developing from single-agent to multi-agent cluster collaboration; third, tech giants are accelerating self-developed AI chips to build differentiated competitive advantages. These changes collectively point in one direction—AI is moving from single-point capability breakthroughs toward systematic ecosystem construction.
Key Takeaways
- DeepSeek OCR2 replaces CLIP with a large language model as the visual encoder, surpassing Gemini 3 Pro with the same token count on OmniDocBench, with potential to evolve into a universal multimodal encoder
- Kimi K2.5 officially launches with Cluster Agent mode capable of dispatching 100+ sub-agents and 1,500 tools, reaching Gemini 3 Pro level on SWE benchmarks
- Microsoft launches its self-developed Maia 200 AI chip on TSMC's 3nm process, now deploying in data centers—an important step for cloud giants to reduce NVIDIA dependency and build supply chain resilience
- Alibaba Tongyi releases the closed-source Qwen3 Max Thinking official version as its most powerful language model, signaling that slow-thinking reasoning has become standard for top-tier models
- The AI industry is moving from single-point breakthroughs toward systematic ecosystem construction encompassing multimodal fusion, multi-agent collaboration, and self-developed chips
Related articles
Tech FrontiersGitHub Agent HQ Launch: AI Coding Tools Enter the Era of Platform Competition
GitHub Universe unveils Agent HQ platform for unified coding agent management, Copilot upgrades with multi-model support. OpenAI completes restructuring, Anthropic tests new model, NVIDIA open-sources AI models.
Tech FrontiersGemini 3.5 Flash Achieves a Massive Leap on the GDPval Benchmark
Google Gemini 3.5 Flash surpasses Gemini 3.1 Pro on the GDPval benchmark. The lightweight Flash model leverages post-training techniques to approach frontier-level performance, redefining the balance between quality and cost.
Tech FrontiersGoogle Gemini Antigravity Weekly Quota Tripled — AI Coding Without Limits
Google Gemini triples Antigravity weekly quotas following a prior daily quota boost. Analyzing the impact on developers and its strategic significance in AI coding.