AI Daily: Claude Autonomous Tasks Exceed 16 Hours, GPT 5.5 Proves Mathematical Theorems

Multiple AI breakthroughs on May 10, 2025 reshape autonomy, math reasoning, and workforce dynamics.
On May 10, 2025, the AI industry saw a wave of breakthroughs: Claude Mythos Preview surpassed the 16-hour autonomous task threshold, GPT 5.5 Pro helped a Fields Medal laureate complete a weeks-level math proof in one hour, DeepSeek plans to launch its full-modality V4.1 model in June, Baidu's ERNIE 5.1 compressed parameters to one-third while cutting costs to 6%, and Cloudflare laid off 20% of staff due to AI efficiency gains. These developments show that model autonomy, reasoning, and efficiency are advancing rapidly, and AI's impact on employment has shifted from forecast to reality.
Overview
On May 10, 2025, the AI industry saw a flurry of major announcements: Anthropic's Claude model broke the 16-hour barrier on autonomous tasks, GPT 5.5 Pro helped a Fields Medal winner complete a mathematical proof, DeepSeek accelerated its multimodal roadmap, Baidu officially released ERNIE 5.1, and Cloudflare laid off 20% of its workforce due to AI-driven efficiency gains. This article walks through each of these key developments with in-depth analysis.

Claude Mythos Preview: Autonomous Task Capability Enters the "Overnight" Tier
Anthropic employee Alex Albert revealed that evaluation organization METR recently conducted a risk assessment on an early version of Claude Mythos Preview. Across a suite of 228 tasks, the model achieved a 50% success rate at an estimated task duration of over 16 hours — more than double that of the previous best model.
The significance of this data is that frontier models are leaping from "multi-hour" tasks to "overnight" tasks. When a model can continuously execute complex tasks unsupervised for over half a day, it means we are gradually approaching the practical threshold for unattended autonomous deployment. METR did note that the suite contains relatively few tasks exceeding 16 hours, introducing statistical uncertainty in this range — but the trend itself is attention-grabbing enough.
For enterprise users, this means future AI Agents could genuinely handle workflows requiring extended autonomous operation — from code review to data pipeline monitoring, from document generation to complex multi-step research tasks.
GPT 5.5 Pro: Completing in One Hour What Took Mathematicians Weeks
Cambridge University professor and Fields Medal laureate Timothy Gowers revealed on his personal blog that he used the yet-to-be-publicly-released ChatGPT 5.5 Pro to complete a mathematical proof on upper bound estimates for sumset diameters within one hour. Previously, only an exponential upper bound had been proven for this problem, but through multiple interactive iterations, the model autonomously improved the bound to a polynomial level.

An MIT student who reviewed the proof described it as "logically rigorous and cleverly conceived," reaching a level that would take a human mathematician weeks of intensive work to produce. However, two key caveats should be noted: first, the model has not yet been publicly released; second, the proof process involved human interactive guidance, so independent end-to-end reproduction is not yet possible.
Nevertheless, this case marks a shift in large language model capabilities in mathematical reasoning — from "computational assistance" to "collaborative creation." If GPT 5.5 Pro can consistently reproduce similar performance after its official release, it will have profound implications for the working paradigm of mathematical research.
DeepSeek: Multimodal Beta Opens Up, V4.1 Aims for Full-Modality Coverage
DeepSeek has recently expanded beta access for its image mode significantly, with most accounts now able to upload images in the chat interface for semantic understanding and cross-media interaction. According to multiple media reports, DeepSeek plans to launch the V4.1 model in June, filling in image and audio processing capabilities to achieve full-modality coverage. The new version is also expected to support the MCP protocol, enabling integration with enterprise and toolchain ecosystems.
The image mode is still in beta, and the exact capability boundaries and stability remain to be seen after the official release. But from a strategic perspective, DeepSeek is accelerating its transformation from a "text-only powerhouse" to a "full-modality platform" — converging with the roadmaps of international players like OpenAI and Google.
Baidu ERNIE 5.1: Two-Thirds Parameter Reduction, Cost Down to 6%
Baidu officially launched ERNIE 5.1, now available on the Qianfan Model Plaza and ERNIE Bot. The new version employs multi-dimensional elastic pre-training technology, compressing total parameters to roughly one-third of version 5.0, while reducing pre-training costs to just 6% of comparable industry models at the same scale.

On the LMSYS Arena search leaderboard, ERNIE 5.1 ranks first among Chinese models and fourth globally. It's worth noting, however, that search capability benchmarks differ from real-world search experiences, and the stability and concurrent throughput of its API service still require further developer testing and feedback.
The core highlight of ERNIE 5.1 isn't about being "bigger" — it's about being "more efficient." As large model competition enters its second half, parameter efficiency and inference cost are becoming the key competitive dimensions. Baidu's move signals that China's leading AI companies have shifted from "stacking parameters" to a pragmatic approach of "cutting costs and boosting efficiency."
Cloudflare Lays Off 20%: The Chain Reaction of AI Replacing Human Labor Has Begun
Cloudflare announced it would cut approximately 20% of its workforce, totaling around 1,100 employees. CEO Matthew Prince stated explicitly that the layoffs were not due to declining performance, but rather because AI technology has dramatically improved operational efficiency, enabling the company to operate with a leaner structure while revenue hits all-time highs.

Data shows that internal AI usage at Cloudflare grew by 600% over the past three months. This is a signal worth paying close attention to: when leading tech companies formally incorporate "AI replacing human labor" into their financial cost-accounting logic, it could trigger a chain reaction across the enterprise services sector. Other company executives will start asking — if Cloudflare can use AI to reduce headcount by 20%, can we do the same?
Other Noteworthy Developments
StepFun's Voice Model Ranks Third Globally
On the voice arena blind-test leaderboard by third-party evaluation firm Artificial Analysis, StepFun's Step Audio 2.5 TTS model ranked third globally, making it the highest-ranked Chinese speech synthesis model on the list. The leaderboard is based on real-user double-blind preference voting, reflecting community recognition of the model's naturalness and emotional expressiveness.
Ant Group's Trillion-Parameter Model Appears on Hugging Face
According to leads from the Reddit community, Ant Group Research Institute has published a trillion-parameter model called Ling2.6-1T on Hugging Face. The model is optimized for inference efficiency and token costs in real-world complex scenarios, particularly suited for coding and everyday Agent workflows. However, no official confirmation has been seen yet, and actual performance awaits weight verification and independent testing.

Open-Source Tool: GPT Image 2 PPT Skills
Community developer Junyao open-sourced the GPT Image 2 PPT Skills project on GitHub, leveraging the GPT Image 2 model to convert text or templates directly into visually striking slides. The project currently relies heavily on the specific model's image understanding capabilities, and text rendering accuracy and layout formatting still require manual verification.
Oncology Decision AI System
The Onco Agent research team published a technical preprint on Hugging Face, showcasing an oncology decision system combining a dual-model architecture, eight-node topology, and hierarchical retrieval-augmented generation. Since it's based on synthetic data and has not undergone large-scale clinical blind review, it currently serves primarily as an engineering reference for multi-agent safety architectures and cannot be directly applied to actual medical diagnosis.
Summary and Outlook
Today's AI industry developments reveal several clear trends: the time scale of model autonomy is extending (Claude 16+ hours), mathematical reasoning capabilities are approaching expert level (GPT 5.5 Pro), full-modality coverage is becoming standard (DeepSeek V4.1, ERNIE 5.1), and AI's impact on the job market has shifted from prediction to reality (Cloudflare's 20% layoff).
For practitioners, tracking the expansion of model capability boundaries is certainly important — but the more worthwhile question is: when AI can work autonomously for 16 hours and assist in completing top-tier mathematical proofs, how do we need to adjust our own work methods and value positioning?
Related articles
Tech FrontiersGitHub Agent HQ Launch: AI Coding Tools Enter the Era of Platform Competition
GitHub Universe unveils Agent HQ platform for unified coding agent management, Copilot upgrades with multi-model support. OpenAI completes restructuring, Anthropic tests new model, NVIDIA open-sources AI models.
Tech FrontiersGemini 3.5 Flash Achieves a Massive Leap on the GDPval Benchmark
Google Gemini 3.5 Flash surpasses Gemini 3.1 Pro on the GDPval benchmark. The lightweight Flash model leverages post-training techniques to approach frontier-level performance, redefining the balance between quality and cost.
Tech FrontiersGoogle Gemini Antigravity Weekly Quota Tripled — AI Coding Without Limits
Google Gemini triples Antigravity weekly quotas following a prior daily quota boost. Analyzing the impact on developers and its strategic significance in AI coding.