StepFun STEP3.7 Flash Tops AA Benchmark — Multimodal Reasoning Speed Takes Off
StepFun STEP3.7 Flash Tops AA Benchmar…
STEP3.7 Flash tops AA benchmark while AI safety, embodied intelligence, and infrastructure see major advances.
StepFun's STEP3.7 Flash claims #1 on Artificial Analysis in speed, cost-efficiency, and end-to-end multimodal reasoning. Meanwhile, 67 AI leaders including Altman, Amodei, and Hassabis jointly call for mandatory synthetic DNA screening legislation. Kairos Homeworld launches 300K real-home digital twins for robot training, and Huawei Cloud introduces Agentic Infra with 2001 PFLOPS and sub-10ms token latency.
Core Event: STEP3.7 Flash Tops the AA Benchmark
StepFun's newly released STEP3.7 Flash large model has topped the AA Benchmark (Artificial Analysis), claiming first place across three dimensions: speed, cost-effectiveness, and end-to-end multimodal performance. The model's popularity on Open Router has surged to second globally, demonstrating the strong momentum of Chinese-developed large models in the open-source community.
Artificial Analysis (AA Benchmark) is an independent AI model benchmarking platform focused on evaluating large language model performance from a practical usage perspective, covering three core dimensions: inference speed (tokens/second), quality (composite scores based on multiple standard test sets), and price (cost per million tokens). Unlike academically-oriented benchmarks such as MMLU and HumanEval, the AA Benchmark is more aligned with developers' real-world usage scenarios, giving it high reference value in the engineering community. Open Router is a unified AI model API routing platform where developers can access hundreds of different large models through a single interface — its popularity rankings directly reflect global developers' actual usage preferences and models' market competitiveness.
Most impressive is its multimodal real-time interaction capability — in actual testing, STEP3.7 Flash can simultaneously observe a flight simulator's instrument panel and joystick display, providing real-time guidance to users operating an aircraft. This end-to-end multimodal reasoning speed truly achieves "fast enough to take flight." End-to-End Multimodal refers to a model that processes text, images, audio, video, and other input modalities simultaneously within a unified architecture and directly outputs responses, rather than processing through multiple independent modules chained together. Traditional approaches typically require a vision model to first recognize image content, then pass the recognition results as text to a language model — this pipeline architecture introduces additional latency and information loss. End-to-end architecture integrates perception and reasoning into one, dramatically reducing response latency and making real-time interaction possible. The model is now open-source, and developers can download and try it directly.
AI Safety: Three Industry Giants Issue Rare Joint Call for Legislation
Sam Altman (OpenAI), Dario Amodei (Anthropic), Demis Hassabis (Google DeepMind), and 67 other leaders from the tech and safety communities — normally fierce competitors — have co-signed an open letter jointly calling on the U.S. Congress to legislate mandatory screening of all synthetic DNA orders, to prevent AI from being used to create lethal biological weapons.
This call comes against the backdrop of an increasingly serious security situation in synthetic biology. Synthetic biology allows researchers to customize arbitrary gene sequences through DNA synthesis companies. Currently, there are hundreds of DNA synthesis service providers worldwide — some have voluntarily joined the International Gene Synthesis Consortium (IGSC) screening system, but many suppliers still do not perform rigorous customer identity verification or sequence hazard screening. As AI large models rapidly improve their capabilities in biology, theoretically anyone could use AI to design gene sequences of dangerous pathogens and then obtain physical DNA through unregulated synthesis channels. The core demand of this open letter is to upgrade screening from industry self-regulation to legally mandated requirements, closing regulatory loopholes.
The open letter has sparked widespread controversy. Nearly 75% of online comments are negative, with many viewing it as a lobbying effort by major companies to build industry barriers. However, the fact that these "sworn enemies" can sit together and reach consensus indicates that the biosecurity threats posed by AI have already crossed an industry-recognized red line. Regardless of motivation, safety regulation in synthetic biology is indeed an urgent real-world problem that needs to be addressed.
Embodied Intelligence: 300,000 Residential Units Become Robot Training Grounds
Da Xiao Robotics, in collaboration with MMLab at The Chinese University of Hong Kong, has released Kairos Homeworld — the world's first embodied AI simulation environment that digitally reconstructs 300,000 real Chinese residential floor plans at a 1:1 scale.
Embodied AI refers to AI systems that learn and execute tasks through physical bodies interacting with the real world. Unlike purely language or image-based AI, it needs to understand physical laws, spatial relationships, and dynamic interactions. The core bottleneck in training embodied AI is data acquisition — having real robots repeatedly trial-and-error in real environments is extremely costly and inefficient. High-fidelity simulation environments (Sim-to-Real) have become the mainstream solution: train at scale in virtual worlds first, then transfer learned policies to real robots. Kairos Homeworld's breakthrough lies in its scenes being derived from precise digital twins of real residences rather than artificially designed simplified scenarios, which dramatically narrows the "Domain Gap" between simulation and reality, making trained policies more easily generalizable to real home environments.
The level of detail in this training ground is remarkable: from 30-square-meter studio apartments to large open-plan units, every object's material, density, and friction coefficient in the scene has been physically modeled. Robots need only a single natural language instruction to generate corresponding household task training scenarios. This is essentially building robots a massive-scale "move-in ready" model home, poised to significantly accelerate the training iteration speed of household robots.
Infrastructure: Huawei Cloud Releases New Agentic Infra Paradigm
At the Huawei Cloud Inspire Creator Conference, Huawei Cloud officially introduced the Agentic Infra paradigm, releasing a series of products including unified training-inference infrastructure and ASS Zero-Zone computing clusters.
Agentic Infra is Huawei Cloud's infrastructure paradigm designed for the AI Agent era. The key difference between AI Agents and traditional large model calls is that Agents need to perform multi-turn reasoning, tool invocation, and environment interaction — a single task may generate tens or even hundreds of thousands of tokens in intermediate reasoning processes, placing demands on infrastructure throughput and latency far exceeding traditional Q&A scenarios. Unified training-inference refers to flexibly scheduling training and inference workloads within the same cluster, avoiding idle waste of computing resources — in traditional setups, training clusters and inference clusters are often deployed independently, resulting in one being underutilized when the other is busy.
Key technical specifications include:
- Support for 100,000-GPU scale computing power
- Token generation latency compressed to under 10 milliseconds
- Computing power reaching 2001 PFLOPS (2001 quadrillion floating-point operations per second)
This means Huawei Cloud is building a solid infrastructure foundation for the entire Agent era from the hardcore underlying technology level. The 2001 PFLOPS computing scale combined with 10-millisecond-level token generation latency means the cluster can simultaneously support large numbers of Agents concurrently executing complex task chains. For developers requiring large-scale token generation, 10-millisecond-level latency means a true "Token Factory" level experience.
Application Ecosystem: WPS Notes and Tencent Enterprise AI Both Making Moves
WPS AI-Native Notes
Kingsoft Office has officially launched WPS Notes, an AI-native multimodal note-taking product that supports voice, image, webpage, and other multimodal content input. Its core differentiation lies in deeply embedding AI throughout the entire pipeline of understanding, organizing, searching, and reusing — snap a photo of a discussion whiteboard or record a meeting audio clip, and the system automatically summarizes, extracts, and organizes it into structured notes.
Notably, AI-Native products are fundamentally different from traditional products with "added AI features": the former is designed from the ground up with AI as the core engine, with all interaction flows built around AI capabilities; the latter simply layers AI assistance onto existing products. WPS Notes belongs to the former — it's not a traditional note-taking app with an AI summary button added, but rather the entire information input, organization, and retrieval flow is AI-driven, with every user input being understood and structured by AI in real time.
Tencent WorkBody Enterprise Edition
Tencent Cloud has released the WorkBody Enterprise Edition and the Agent Suite office intelligence toolkit, providing 24/7 online "expert digital employees" that integrate with Tencent Docs and cloud storage, supporting human-AI collaborative team modes. This marks AI tools' official entry into the "deep waters" of enterprise-level AI collaboration, moving beyond just "helping individuals be more productive."
Enterprise-level AI collaboration is called "deep waters" because it requires solving systemic problems far more complex than personal tools: multi-level permission management to ensure data security, deep integration with existing enterprise IT systems, consistency and traceability of AI outputs in multi-person collaboration scenarios, and the construction and maintenance of enterprise private knowledge bases. These challenges make the deployment cycle and technical threshold of enterprise-level AI products far higher than consumer-facing AI tools.
Quick Updates
- Kuaishou Kling AI Second Anniversary: Global users surpass 100 million, with nearly 50,000 enterprise clients, firmly positioned as a leader in the video generation track
- Bilibili AI Creation Open Competition: Launching under the banner of "China's Build in Public," with no age or technical background restrictions — non-developers account for 60% of registrations, with prize selections determined by user coin votes
- Tsinghua × BAAI Published in Science: The Brain multimodal foundation model for neuroscience successfully reveals the neural mechanisms by which memory reactivation during sleep regulates sleep dynamics. By integrating data from multiple brain imaging modalities including fMRI (functional magnetic resonance imaging) and EEG (electroencephalography), the model constructs a unified brain activity representation space, validating the long-standing neuroscience hypothesis that the brain "replays" waking experiences during sleep to consolidate memories, and further discovering that this reactivation process actively regulates transitions between sleep stages. This work not only advances understanding of human cognitive mechanisms but also provides inspiration for developing AI systems that more closely resemble how the human brain learns.
Summary
From STEP3.7 Flash's performance breakthrough to Huawei Cloud's infrastructure upgrade, from embodied intelligence training ground innovation to the deployment of enterprise-level AI collaboration tools, the AI industry is showing a clear trend: speed and scale are becoming the new competitive dimensions. Whether it's millisecond-level model inference responses, the digital reconstruction of 300,000 residential units, or the computing power stacking of 100,000-GPU clusters, "fast" and "large" are redefining the boundaries of what's possible with AI applications.
Key Takeaways
Related articles

Claude Code for Test Development in Practice: An AI Programming Workflow That Doubles Your Efficiency
A practical guide to Claude Code for test development: auto-generating test scripts, Plan Mode workflows, MCP + Playwright integration, and Subagent parallel tasks to build systematic AI-assisted workflows.

Hermes Agent Hands-On Review: An AI Efficiency Revolution for Indie Game Developers
Indie game developer reviews Hermes Agent vs OpenClaude: intelligent context compression, real-time Memory, remote control via Telegram, and practical use cases in game dev, social media, and email.

Vibe Coding Beginner's Guide: Tool Selection Across Three Categories with Practical Examples
A comprehensive guide to Vibe Coding's three tool categories: Agent frameworks, CLI Coding, and IDE tools, with practical examples including Snake game and data analysis workbench.