Gemini 3.5 Flash Falls Flat: Great Benchmarks, Terrible Real-World Performance, and a Buggy CLI

A Reality Check After Google I/O: Gemini 3.5 Flash's True Performance

Google I/O just wrapped up, with Gemini 3.5 Flash unveiled as a flagship release boasting impressive benchmark numbers. However, after getting early access, well-known tech creator Theo delivered a starkly different verdict — the model performed terribly in real-world coding tasks, its companion CLI tool was riddled with bugs, and Google Cloud managed to ban a major customer on the same day. Even more disheartening, Google's talented open-source teams were sidelined in favor of a closed-source product whose demo couldn't even hide its plagiarism.

Gemini 3.5 Flash: Great Benchmarks, but a 20x Price Hike

On paper, Gemini 3.5 Flash looks impressive. It outperforms Gemini 3.1 Pro on nearly every benchmark, scores just behind GPT-5.5 on Terminal Bench, and achieves state-of-the-art results across multiple dimensions including Toolathon, financial agents, and reasoning. Artificial Analysis's intelligence index shows it leading the pack in speed-to-performance ratio, with a generation speed approaching 300 tokens/second.

It's worth noting that AI benchmark systems have systemic limitations. Current mainstream evaluations like MMLU, HumanEval, and Terminal Bench essentially measure a model's "test-taking ability" on specific datasets, not real-world engineering capability. The industry calls this "Benchmark Overfitting" — models score high through targeted training on test sets without corresponding improvements in generalization. While third-party evaluators like Artificial Analysis incorporate dimensions like speed and cost, they still struggle to capture a critical ability: whether a model can self-correct during complex, multi-step real-world tasks. This is precisely why Gemini 3.5 Flash can surpass Gemini 3.1 Pro on paper yet completely fail in practical coding tests — the latter demands execution loops, error awareness, and iterative repair, all of which are blind spots in current benchmarks.

Gemini 3.5 Flash achieves state-of-the-art results across multiple benchmarks

But Google deliberately hid a key detail on the release page: pricing. Not a single dollar sign anywhere. The reason is simple — they tripled the price. 3.5 Flash is priced at $1.50/million input tokens and $9/million output tokens. Compared to the previous 3 Flash at $0.50 input and $3 output, that's a 3x increase. Compared to Theo's favorite, 2.0 Flash ($0.10 input, $0.40 output), the increase exceeds 20x.

Even worse is the token efficiency problem. Token Efficiency is a critical metric for evaluating the actual cost of reasoning models, yet it's deliberately downplayed in most marketing materials. The core difference between reasoning models and standard language models is that the former generates extensive "Chain-of-Thought" content before delivering a final answer, and these intermediate reasoning steps all count toward token consumption. As a reasoning model, 3.5 Flash consumed approximately 72 million tokens in Artificial Analysis's benchmarks, while OpenAI's GPT-5.5 Medium used only 22 million — less than a third. This 3.3x gap means that even with lower unit pricing, the total bill could be higher. This reveals a core trade-off in reasoning model design: longer chains of thought typically yield more accurate answers but also higher latency and cost. A well-designed reasoning model needs to learn "thinking in moderation" — converging quickly on simple tasks while going deep on complex ones. 3.5 Flash has become the fourth most expensive model in these benchmarks, costing nearly twice as much as 3.1 Pro in practice.

What good is speed? If a model is 2x faster but generates 4x the tokens, the task actually takes longer to complete. Poor token efficiency isn't just a cost issue — it signals a fundamental inability to know when to stop thinking.

Real-World AI Coding Test: The Only Model That Failed

Theo ran a practical test using his in-development game Fish Slop: he gave each model the original source code and asked it to rewrite the codebase to be more stable and cleaner. The result was shocking — Gemini 3.5 Flash was the only model among all those tested that couldn't get the game running.

The code it produced was broken out of the box, and it never self-checked or ran validation. When Theo asked it to fix the issues, the "fixed" version was even worse: ugly halo effects appeared on screen, fish were too large to interact with, the feeding mechanism didn't work, the aging mechanism didn't work, newly generated images were low quality, and some didn't even have transparency set correctly.

By contrast, GPT-5.5 not only completed the same task flawlessly — Theo even asked it to convert the game to 3D, and it delivered. For a supposedly state-of-the-art model to turn in work like this, Theo called it "a genuine embarrassment."

The core issue appears to be that Google still hasn't cracked RL (Reinforcement Learning) — the model doesn't know how to check its own work, doesn't know how to self-correct, and just burns tokens aimlessly. This points to a critical technical divide in today's AI coding assistants. Traditional LLM training relies on supervised fine-tuning (SFT), where models learn "what to output given an input" but lack the ability to verify whether their output is correct. OpenAI's extensive use of RLHF (Reinforcement Learning from Human Feedback) and RLEF (Reinforcement Learning from Execution Feedback) in their o-series models taught models a meta-ability: run code, observe results, and adjust strategy based on error signals. This "think-execute-verify" loop (also known as an Agent Loop or ReAct framework) is the key to GPT-5.5's ability to complete complex game rewriting tasks. The description of Gemini 3.5 Flash "just burning tokens aimlessly" is a textbook symptom of lacking execution feedback training — the model hasn't internalized the need to verify whether its generated code actually runs.

Gemini 3.5 Flash Falls Flat: Great Benchmarks, Terrible Real-World Performance, and a Buggy CLI

A Reality Check After Google I/O: Gemini 3.5 Flash's True Performance

Gemini 3.5 Flash: Great Benchmarks, but a 20x Price Hike

Real-World AI Coding Test: The Only Model That Failed

Related articles

Qoder vs Cursor Real-World Comparison: Which $20/Month AI IDE Is Better?

Cursor Cloud Agent Demo: Eliminating Bottlenecks Across the Entire Software Development Lifecycle

Cursor 3.0 Deep Dive: Multi-Agent Parallelism, Design Mode, and Best-of-N Model Comparison