ChatGPT vs Gemini vs Claude: Testing Three AI Models Recreating Classic Roblox Games from Scratch

Three AI models recreate classic Roblox games: ChatGPT wins visuals, Claude wins features, Gemini stages a comeback.
A creator spent 6+ hours testing ChatGPT Codex, Gemini 2.5 Flash, and Claude Code by having each recreate a classic Roblox game from scratch. ChatGPT delivered the highest visual fidelity with Murder Mystery 2, Claude produced the most feature-complete MeepCity recreation, and Gemini staged a dramatic comeback after its Canvas mode failed catastrophically. The test highlights that prompt engineering, model/mode selection, and iterative human correction remain essential in AI-assisted game development.
When ChatGPT, Gemini, and Claude are each tasked with recreating a classic Roblox game from scratch, what happens? One creator put them to the test and delivered the answers. This isn't just a trip down memory lane — it's a hardcore head-to-head comparison of today's leading AI coding capabilities.
Test Rules: Three AI Models, Three Classic Roblox Games
The rules for this test were straightforward: each AI model was responsible for recreating one classic Roblox game — not simple text generation, but actually writing runnable game code. Here's the breakdown:
- ChatGPT (Codex): Recreate Murder Mystery 2
- Gemini (Canvas / 2.5 Flash): Recreate Natural Disaster Survival
- Claude (Code mode): Recreate MeepCity
For those unfamiliar with Roblox, some context is helpful. Roblox isn't just a gaming platform — it's a massive user-generated content (UGC) ecosystem with over 70 million daily active users. It uses a proprietary programming language called Luau (an enhanced version of Lua), and developers create 3D game experiences through Roblox Studio. This means AI models recreating Roblox games must handle 3D scene construction, physics engine interactions, multiplayer network synchronization, and more — far beyond simple 2D web games. In this test, the creator actually had each AI generate web-based recreations (using JavaScript/Three.js, etc.), making it a cross-platform technical translation exercise in itself.
The three chosen games each represent distinct challenges: Murder Mystery 2 (launched 2014) is one of Roblox's most popular social deduction games, centered on role assignment and tense chase mechanics; Natural Disaster Survival (launched 2008) is one of Roblox's earliest classics, known for its physics-simulation-driven disaster systems; MeepCity (launched 2016) was the first Roblox game to surpass 1 billion visits, featuring rich social functionality and virtual life simulation. These three games respectively emphasize visual scenes, physics systems, and feature complexity — forming a comprehensive AI capability benchmark.
The creator prepared detailed prompts for each AI, used Gemini to generate nostalgic Roblox-style thematic prompts, and fed in extensive reference screenshots and Wiki materials as support. The entire process took over 6 hours, involving repeated adjustments, bug fixes, and supplementary image references — a substantial workload.
ChatGPT Recreates Murder Mystery 2: Highest Visual Fidelity

ChatGPT used Codex mode for development. Codex is OpenAI's AI agent environment optimized specifically for code generation — it can actually run code in a cloud sandbox, inspect output results, and autonomously iterate on corrections, rather than merely generating static code text. This "write code → run → inspect → modify" closed-loop capability gives Codex a natural advantage in game development tasks requiring visual feedback — it can "see" whether its generated visuals match expectations. The creator provided extensive original Murder Mystery 2 screenshots, including the main menu, lobby, factory map, and other scenes. After repeated iterations, the final result was quite impressive.
Highlights: The main menu recreation was remarkably faithful, with visual style nearly identical to the original. Classic elements like the lobby fountain, voting room, and secret doors were all successfully reproduced. Characters even had rare weapons on their backs, and the factory map layout closely matched the real thing. The game also implemented radio functionality, character selection (murderer/sheriff/innocent), and other core mechanics.
Shortcomings were also apparent: the initial version had completely inverted camera controls and character movement, requiring extensive manual correction. The fountain statue recreation fell short, and some character models exhibited the comical bug of "heads and bodies going separate ways." These 3D character rigging and transformation matrix issues are common AI weaknesses when handling spatial geometry — models excel at understanding code logic but still lack intuition for coordinate systems, rotation orders, and parent-child node hierarchies in three-dimensional space.

Overall, ChatGPT excelled in visual recreation and scene detail, with high UI accuracy. The creator gave it a decidedly positive evaluation.
Gemini Recreates Natural Disaster Survival: From Disaster to Comeback
Gemini's recreation journey was nothing short of a roller coaster. The creator first tried Gemini's Canvas feature, and the results were catastrophic — abnormal character movement, falling through the ground during earthquakes, and mountains of similar errors piling up. The creator couldn't help but quip: "This is probably the worst game I've ever seen."
Canvas is Google's interactive code generation interface within Gemini, designed to let users preview and edit AI-generated code in real-time during conversation. However, Canvas mode exposed severe limitations when handling complex 3D game projects: its context management capabilities are limited, making it difficult to maintain consistent understanding of the entire project architecture across multiple dialogue turns. Fixing one bug often introduces new ones, creating a vicious cycle.

The creator then switched to Gemini 2.5 Flash in high-performance mode, adopting a smarter strategy:
- Introducing a pre-made third-person player control JavaScript script
- Feeding in extensive nostalgic screenshots and Wiki materials
- Iterating adjustments until 4:28 AM
Gemini 2.5 Flash is Google's high-efficiency reasoning model. Compared to Canvas's lightweight interactive mode, the Flash series offers a much larger context window (up to 1 million tokens) and stronger long-range reasoning capabilities for code generation tasks. "High-performance mode" further unlocks the model's computational budget, allowing deeper "thinking" before generating code. This explains why such a massive quality difference can result from a mode switch within the same brand.
The comeback results were genuinely impressive. The game opens in the classic 2008 Roblox house, with the spiral staircase, green balloon, and other iconic elements faithfully recreated. Multiple disaster types (flood, earthquake, meteor shower, tsunami) all function properly, and there's even a working chat feature. Classic scenes like the lighthouse map and double-building map were also successfully reproduced. Notably, implementing the disaster system requires handling physics simulation — rising water fluid effects, structural building damage during earthquakes, meteor parabolic trajectories, and collision detection — all direct tests of AI's physics engine comprehension.
The creator was particularly fond of that "absolutely classic green balloon," specifically having the AI add a purchasable green balloon feature in the shop — a detail that perfectly captures the power of nostalgia.
Claude Recreates MeepCity: Most Complete Feature Set
Claude was responsible for MeepCity, arguably the most feature-complex of the three games, yet its performance was the most eye-opening.

The creator used Claude's Code mode for development. Claude Code is Anthropic's terminal-level AI coding agent. Unlike ChatGPT Codex's cloud sandbox, it runs directly in the user's local development environment, capable of reading the entire project's file structure, understanding inter-file dependencies, and directly modifying source code. This "whole-project awareness" gives Claude a structural advantage when handling complex projects like MeepCity that involve multiple interconnected subsystems — it can ingest large volumes of files and image references at once, coordinating module development from a global perspective.
Claude successfully recreated an impressive breadth of MeepCity features:
- Main plaza and commercial district: Pet shop, furniture store, home decoration store — all present
- Public facilities: School (with ABC and 1+1=2 teaching content), hospital (where you can lie in bed to rest)
- Fishing system: Complete cast, wait, and hook sequence
- Pet system: Purchasable Mip pets that follow the player (available in red, green, and blue RGB colors)
- Party room: Disco ball in the center, plus a DJ Mip
- Home decoration system: Arrows in the community point to the player's house, with an openable decoration menu
- NPC interaction and chat: Other players walking around the map and chatting with each other
From a technical standpoint, MeepCity's recreation difficulty lies in it being essentially a small-scale virtual world simulator, involving state management (player inventory, currency, pet ownership), AI behavior trees (NPC patrol and dialogue logic), UI systems (shop interfaces, decoration menus), and scene management (multiple switchable functional areas) — spanning multiple software engineering sub-domains. Claude's ability to coordinate development of these systems within a single project demonstrates its outstanding capabilities in complex architecture design.
Of course, there were some entertaining bugs: shopkeepers stuck inside tables with distorted proportions, all school chairs placed backwards, and home decoration that actually made things messier. But the overall visual style recreation was quite faithful, and paired with tranquil background music, the nostalgia factor was maxed out.
Comprehensive Comparison of Three AI Game Development Capabilities
| Dimension | ChatGPT (Codex) | Gemini (2.5 Flash) | Claude (Code) |
|---|---|---|---|
| Recreated Game | Murder Mystery 2 | Natural Disaster Survival | MeepCity |
| Visual Fidelity | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Feature Completeness | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| Initial Quality | Medium (needs heavy correction) | Terrible → Comeback | Fairly good |
| Debugging Difficulty | Medium | High | Medium |
Based on the test results, ChatGPT excels most in visual recreation, with main menu and scene details that are nearly indistinguishable from the original; Gemini experienced the widest swings, with Canvas mode being virtually unusable but 2.5 Flash delivering a quantum leap in quality; Claude stands out most in feature completeness, capable of handling complex multi-system interactive games.
It's worth noting a methodological limitation of this comparison: the three AIs each recreated different games, and those games inherently differ in recreation difficulty. Murder Mystery 2 emphasizes visual scenes, which plays to ChatGPT Codex's visual feedback strengths; MeepCity emphasizes feature systems, which aligns with Claude Code's whole-project management capabilities. Therefore, this test is better viewed as "a showcase of each AI's performance in its area of strength" rather than a strictly controlled variable experiment.
The Current State of AI Game Development and Practical Tips
This test revealed several important findings:
Prompt engineering is crucial. The creator didn't simply say "make me a game" — they prepared detailed thematic prompts, extensive reference screenshots, Wiki materials, and even pre-made control scripts. This carefully curated "feeding" strategy directly determined output quality.
From a technical perspective, this involves a core mechanism of current large language models: In-Context Learning. When you provide reference screenshots to an AI, multimodal models encode images as visual tokens that combine with text instructions to form the input context. The richer and more specific the reference images, the more accurately the model can understand target style and layout. Wiki materials provide structured game mechanic descriptions, helping the model build a clear functional requirements map. Pre-made control scripts directly reduce task complexity — rather than having the AI reinvent the wheel, you let it creatively assemble on existing foundations. This "decompose complexity, provide anchors" prompting strategy is one of the most effective practices in current AI-assisted development.
Model selection and mode switching matter enormously. Gemini Canvas's catastrophic failure versus 2.5 Flash's comeback demonstrates that capability differences between modes within the same brand can be night and day. Choosing the right tool matters more than blind usage.
Iterative correction is unavoidable. Even the best-performing AI required substantial manual intervention to adjust initial output. The creator spent over 6 hours on repeated modifications, indicating that AI is still far from "one-click" complete game generation — but as an assisted development tool, it has already demonstrated remarkable potential.
This finding aligns closely with current game industry AI adoption trends. According to 2024-2025 industry surveys, over 50% of game development studios have incorporated AI tools in some capacity, yet virtually none have achieved full-pipeline AI automation. AI's role in game development is shifting from "proof-of-concept toy" to "productivity tool for accelerating prototype development." Traditionally, an independent developer might need weeks to produce a prototype of MeepCity's complexity, but with AI assistance, that timeline was compressed to 6 hours — even if the output still needs polish, this efficiency gain already carries substantial commercial value.
Five months ago, the same test would have yielded far inferior results from these AIs. At this rate of progress, the future of AI-assisted game development is worth watching. Particularly as model context windows continue expanding (from 32K to 128K to 1 million tokens), multimodal understanding strengthens, and the Agentic Coding paradigm matures, AI's capability leap from "writing code snippets" to "managing complete projects" is accelerating.
Related articles

Five Common Claude Code Mistakes — How Many Are You Making?
Five common Claude Code mistakes developers make: copy-pasting code, skipping CLAUDE.md, inefficient prompting, ignoring docs, and poor context management — with fixes.

Andrew Ng's New Course Explained: A Practical Guide to Using OpenAI's O1 Reasoning Model
Deep dive into Andrew Ng and OpenAI's Reasoning with O1 course covering test-time scaling, new prompting paradigms, multi-model orchestration, and practical applications for developers.

Learning AI After College Entrance Exams: A Complete Path from Zero to Freelancing
How to efficiently learn AI skills during summer break after exams? A complete path from mastering prompts and hands-on projects to freelancing on platforms.