#模型评测

7 related articles

2026年6月3日·1 min

Claude Haiku 4.5 Real-World Test: Failed All 5 Programming Tasks

Testing Claude Haiku 4.5 on 5 visual programming tasks including 3D modeling and physics simulation reveals systematic failures in reasoning, instruction following, and code quality.

Gemini 3.5 Flash Achieves a Massive Leap on the GDPval Benchmark

Tech Frontiers

2026年6月3日·1 min

Gemini 3.5 Flash Achieves a Massive Leap on the GDPval Benchmark

Google Gemini 3.5 Flash surpasses Gemini 3.1 Pro on the GDPval benchmark. The lightweight Flash model leverages post-training techniques to approach frontier-level performance, redefining the balance between quality and cost.

Gemini 3.5 Flash Tops the Vending Bench Cost-Efficiency Frontier

Tech Frontiers

2026年6月3日·1 min

Gemini 3.5 Flash Tops the Vending Bench Cost-Efficiency Frontier

Google Gemini 3.5 Flash achieves cost-intelligence Pareto optimality on Vending Bench. Analysis of the benchmark methodology, Pareto Frontier implications, and practical significance for AI developers.

GPT-5.5 vs DeepSeek-V4: Who Wins in a Four-Round Head-to-Head Test?

Product Reviews

2026年6月3日·2 min

GPT-5.5 vs DeepSeek-V4: Who Wins in a Four-Round Head-to-Head Test?

GPT-5.5 vs DeepSeek-V4 in four comprehensive rounds covering world knowledge, context memory, logical reasoning, and coding — a detailed comparison of real performance differences.

Gemini 3.2 Pro Leaked Tests Disappoint, GPT-5.6 Already in Internal Testing

Tech Frontiers

2026年6月3日·3 min

Gemini 3.2 Pro Leaked Tests Disappoint, GPT-5.6 Already in Internal Testing

Gemini 3.2 Pro leaked tests show mediocre results with minor SVG improvements but weak UI. GPT-5.6 enters internal testing while Claude's new preview achieves breakthrough cybersecurity performance.

GPT-5.5 After 3 Weeks of Real-World Testing: Does It Really Crush Opus 4.7 at Coding?

Product Reviews

2026年6月2日·2 min

GPT-5.5 After 3 Weeks of Real-World Testing: Does It Really Crush Opus 4.7 at Coding?

EVERY team tested GPT-5.5 for 3 weeks using SABench. GPT-5.5 scored 62.5 vs Opus 4.7's 33 in coding execution, but the best workflow combines Opus for planning with GPT-5.5 for execution.

Claude 4.5 vs Gemini 3 Pro: A Comprehensive Coding Showdown

Product Reviews

2026年5月27日·3 min

Claude 4.5 vs Gemini 3 Pro: A Comprehensive Coding Showdown

In-depth comparison of Claude 4.5 vs Gemini 3 Pro across five benchmarks including ARC-AGI-V2, SWE-Bench, and Terminal Bench 2.0, revealing their real coding and reasoning strengths.