Two Years of AI Video Generation Evolution: From Blurry Otters to Cinema-Grade Complex Narratives

AI video generation evolved from blurry otters to cinema-grade complex narratives in just two years.
A viral tweet contrasting today's AI video capabilities with the state of the art from two years ago reveals the exponential progress in AI video generation. What was once cutting-edge — generating a simple scene of an otter using WiFi on a plane — has given way to complex multi-character narratives with physical detail. This article examines how diffusion models combined with Transformer architectures drove these breakthroughs, and how user expectations evolved in lockstep with the technology.
A Single Tweet Reveals the Stunning Progress of AI Video Generation
Recently, a tongue-in-cheek tweet sparked heated discussion in the AI community. The poster used an absurd description — "the ketchup blood from the sword wound isn't viscous enough when flying Shakespeare stabs the pizza robot while otters discuss Spirit Airlines" — to respond to critics nitpicking the quality of current AI-generated video.
The tweet may seem nonsensical, but it highlights a crucial fact: the pace of progress in AI video generation far exceeds most people's intuitive understanding.

The poster's core argument uses a simple baseline image for comparison — "an otter using WiFi on an airplane." This seemingly simple prompt represented the "State of the Art" just two years ago. In AI research, State of the Art (SOTA) refers to the best known performance on a specific task at any given time, and the speed at which it changes is itself an important indicator of how active a field is. In video generation, the definition of SOTA has been evolving rapidly: in early 2023, generating a roughly 4-second video clip was considered cutting-edge; when OpenAI released the Sora demo in early 2024, 60-second high-quality video became the new benchmark; by 2025, models from multiple companies can generate long video clips with complex narrative structures and multi-character interactions. Today, AI can generate video scenes containing complex narratives, multi-character interactions, and physical details (like sword wounds and liquid textures).
Why "Otter Using WiFi" Is a Good Benchmark for AI Video Capabilities
Choosing "an otter using WiFi on an airplane" as a comparison benchmark has a certain cleverness behind it.
It Tests the Comprehensive Capabilities of Video Generation Models
This prompt isn't complex, but generating it correctly requires the model to handle multiple dimensions simultaneously:
- Subject realism: Whether the otter's fur, movements, and expressions look natural
- Scene plausibility: The airplane interior's seats, windows, and lighting
- Conceptual anthropomorphism: Visualizing the abstract behavior of an animal "using WiFi"
- Overall coherence: Whether the image remains stable over time without jittering or warping
In early video generation models, even such a "simple" scene often exhibited typical defects like subject distortion, inter-frame flickering, and objects appearing or disappearing out of nowhere. The root cause of these issues lies in temporal consistency — one of the most critical technical challenges in video generation. Unlike static image generation, video must ensure that content across consecutive frames maintains smooth transitions in spatial position, color, and form. The "inter-frame flickering" phenomenon common in early models was essentially caused by models generating each frame independently, lacking global modeling of the temporal dimension. Technical approaches to solving this problem include introducing temporal attention layers to help the model perceive relationships between preceding and following frames, using optical flow estimation to constrain motion coherence, and adopting 3D convolutions or spatiotemporal joint modeling to replace purely 2D generation. Being able to stably generate a scene like "otter using WiFi" is itself a marker of these technologies gradually maturing.
From "Watchable" to "Picky": Rising User Expectations for AI Video
The complex scenes mocked in the tweet — flying Shakespeare, pizza robots, ketchup blood from sword wounds — perfectly illustrate how user expectations have risen with the tide. When technology evolves from "barely able to generate an otter" to "capable of generating complex multi-character narratives," the focus of criticism shifts from "whether the subject is coherent" to extremely granular concerns like "whether the physical texture of the ketchup is realistic enough."
It's worth understanding in depth here that physics simulation capability in video generation refers to whether a model can correctly render the physical laws of the real world — such as gravity, fluid dynamics, collision responses, and material deformation. The criticism that "the ketchup blood isn't viscous enough" is essentially a demand for precision in fluid viscosity simulation. Current state-of-the-art video generation models have implicitly learned many physical laws through training on massive amounts of real video data, but precise simulation of complex fluids, cloth wrinkles, light refraction, and similar phenomena remains an active research direction. Being able to raise this level of criticism itself demonstrates that baseline generation quality has crossed an important threshold.
This kind of criticism is, in itself, inverse proof of technological progress.
What Happened in AI Video Generation Over Two Years
Exponential Improvement in Generation Quality
From 2023 to 2025, the video generation field underwent several important technical iterations. The combination of Diffusion Models and Transformer architecture enabled breakthroughs in temporal consistency, physics simulation, and detail rendering.
Diffusion models are a class of deep learning models that generate content by gradually adding noise to data and then learning the reverse denoising process. Their core idea originates from the diffusion process in thermodynamics: the forward process gradually degrades a clear image into pure noise, while the reverse process gradually recovers meaningful images from noise. The proposal of DDPM (Denoising Diffusion Probabilistic Models) in 2020 laid the foundation for this paradigm. The Transformer architecture was originally proposed by Google in the 2017 paper "Attention Is All You Need," and its self-attention mechanism can capture dependencies between any positions in a sequence. When these two architectures are combined — as in the DiT (Diffusion Transformer) architecture adopted by OpenAI's Sora — the model can leverage both the high-quality generation capabilities of the diffusion process and the Transformer's ability to handle temporal relationships between video frames, achieving breakthroughs in both temporal consistency and spatial detail simultaneously.
A notable detail: this progress isn't linear but exhibits accelerating characteristics. Early models required months to achieve improvement on a single dimension, while recent model iteration cycles have shortened dramatically, with each update potentially bringing qualitative leaps.
Synchronized Evolution of User Expectations
Accompanying technological progress is the rapid adjustment of user expectations. This is a typical phenomenon in the adoption of new technologies:
- Initially, any AI video that could "move" was astonishing
- In the middle phase, people began focusing on subject realism and coherence
- Currently, criticism focuses on extreme dimensions like physical details, lighting logic, and micro-expressions
This "expectation inflation" phenomenon is closely related to technology adoption lifecycle theory and the Gartner Hype Cycle. In the early stages of a new technology, any working demonstration triggers the "Peak of Inflated Expectations"; subsequently, when the technology fails to meet overly high expectations, it enters the "Trough of Disillusionment"; and finally reaches the "Plateau of Productivity" through continuous improvement. However, AI video generation currently occupies a unique position: the pace of technological progress is so fast that users' expectation peaks are constantly being reset, forming a pattern of "perpetual ascent." This differs from the typical pattern of expectations rising then falling in previous technology cycles, reflecting the anomalous iteration speed of the AI field.
While this "expectation inflation" puts pressure on developers, it's also a sign of a healthy ecosystem — it drives technology to continuously evolve toward higher standards.
How to Objectively Evaluate the Current Level of AI Video Generation
The original tweet's attitude actually offers a perspective worth considering: when evaluating emerging technologies, the temporal dimension is crucial.
Judging today's output by today's standards alone causes people to overlook the true trajectory of technological evolution. A more constructive approach is to compare current achievements against benchmarks from two years ago, or even six months ago, to truly understand how far this technology has come.
Of course, this doesn't mean criticism has no value. Quite the opposite — it's precisely those "the ketchup blood isn't viscous enough" style nitpicks that constitute the demand signals for continued technological evolution. Criticism and progress form a positive feedback loop here.
Conclusion: Where Is AI Video Headed Next?
This humorous tweet actually encapsulates a profound trend in AI video generation: the speed of technological progress often exceeds our perceptual capacity, and our expectations are being reshaped at an equally rapid pace.
The next time we feel dissatisfied with flaws in an AI-generated video, it might be worth looking back — not long ago, even a scene as simple as "an otter using WiFi on an airplane" was still an unreachable technological frontier. In this era where iterations are measured in months, today's "not good enough" is very likely tomorrow's "taken for granted."
Key Takeaways
Related articles

Vibe Coding in Practice: A Junior Student Uses Cursor to Build a Multi-Agent System with 51 AI Officials Based on the Three Departments and Six Ministries Framework
A junior student uses Cursor and Vibe Coding to build a multi-agent system with 51 AI officials modeled on China's Three Departments and Six Ministries, featuring task distribution, approval workflows, and Token cost visualization.

How to Connect Codex to DeepSeek Models: Free Switching via CC Switch
Learn how to connect OpenAI Codex to DeepSeek models via CC Switch, enabling free switching between DeepSeek and GPT with complete setup and routing guide.

AI Coding Deployment Guide: A Complete Hands-On Workflow from Local Demo to Live Website
Most AI Coding tutorials stop at local demos. This guide walks through 8 key steps to deploy an AI-powered 3D figurine website from Codex coding to live server deployment.