MiniMax M3 Code Test: How Does It Stack Up Against DeepSeek?

Introduction

The battle for programming prowess among Chinese AI models is heating up. After DeepSeek made its mark in code generation, MiniMax recently released its M3 model, also claiming exceptional programming capabilities. So who truly represents the ceiling of domestic coding models? This article puts MiniMax M3 through two real-world tests—HTML game development and backend management system scaffolding—and compares the results head-to-head with DeepSeek.

Test Environment Setup: Cloud Code + MiniMax M3

For this test, we used the Cloud Code plugin in VS Code paired with the MiniMax M3 model for code generation. Cloud Code is an AI programming assistant plugin in the VS Code ecosystem that supports multiple model integrations, allowing developers to call different vendors' LLM APIs through a unified interface for code generation, completion, refactoring, and more. Similar tools include GitHub Copilot, Cursor, Cline, and others that together form the mainstream AI-assisted programming toolchain. Cloud Code's core advantage lies in its open provider architecture—developers aren't locked into a single model and can flexibly switch between different AI backends based on task requirements, which is precisely what makes the side-by-side comparison between M3 and DeepSeek in this article possible.

The configuration process is relatively straightforward:

Log into the MiniMax dashboard to obtain an API Key
Add a new MiniMax provider in Cloud Code, selecting the Hailuo type
Enter the API Key, then select the latest M3 model from the model list
Set both the default model and fallback model to M3

Cloud Code Configuration Interface

You might not have noticed, but MiniMax M3 is a multimodal model—it supports not only pure text generation but also video, images, and other modalities. A multimodal model refers to an AI model capable of simultaneously processing and generating multiple data types (such as text, images, audio, and video). Unlike traditional text-only large language models, multimodal models integrate data representations from different modalities during training, giving them cross-modal understanding and generation capabilities. International models like GPT-4o and Gemini pioneered multimodal capabilities, and MiniMax M3 has followed this technical approach domestically. For programming scenarios, multimodal capability means the model can not only understand requirements described in text but can also potentially use screenshots, design mockups, and other visual inputs to assist code generation—particularly valuable in frontend UI development.

Regarding pricing, M3 offers two modes: monthly subscription (with usage refreshing on a timed window) and pay-per-use (purchasing a fixed Token quota that's consumed until depleted). Developers can choose flexibly based on their usage frequency.

Test One: Generating a Super Mario Game from Scratch

Test Method

The first test is quite hardcore—having M3 generate a Super Mario-style game in pure HTML from just a single prompt. No configuration files, no Cloud Code .md documentation—relying entirely on the model's understanding of the classic "Super Mario" game to recreate the game logic.

Generating a complete game in pure HTML (typically including inline CSS and JavaScript) places extremely high demands on an AI model's capabilities. The model needs to handle game rendering (Canvas API or DOM manipulation), physics engine (gravity, collision detection), game loop (requestAnimationFrame), user input handling (keyboard event listeners), and sound effects (Web Audio API) all within a single file. As a classic platform jumping game, Super Mario's core mechanics include precise jump physics models, terrain collision detection, character state machines (standing, running, jumping, dying), and more—all requiring the model to have deep "understanding" of game development. This test essentially evaluates the model's long code generation ability and complex logic organization capability.

The prompt was extremely concise: "Make an advanced Mario game, a Super Mario game, generate the game file in the root directory."

Generated Results

After a period of analysis and code writing, M3 generated a complete HTML game file in the root directory. The actual running results were impressive:

Visual presentation: Complete game interface with character and scene elements all present
Sound support: Built-in sound effects (synthesized via Web Audio API)
Controls feel: Supports hold-to-charge jumping—the longer you hold, the higher you jump
Core gameplay: Character movement, jumping, and other core mechanics run smoothly

Super Mario Game Result

Of course, the generated game has some shortcomings: it lacks enemy elements like poison mushrooms, has no multi-level system, and the character can't proceed to the next level after dying. But considering this was generated from a single sentence with zero configuration, the overall completeness is quite impressive. According to tester feedback, community members also tried the same task with other models and DeepSeek, but M3's performance was the best.

Test Two: Backend Management System Framework Page

Test Background

The second test is more aligned with real development scenarios—using the exact same prompt to have M3 build a backend management system framework page, then directly comparing it with DeepSeek's previously generated results.

The test project is an npm-based frontend project, starting with an empty project directory running on localhost:3003. The prompt was identical to the one used when testing DeepSeek previously—no modifications whatsoever—ensuring a fair comparison.

DeepSeek's Initial Generated Version

M3 vs DeepSeek: First-Generation Quality Comparison

After receiving the prompt, M3 first performed task analysis and build planning, then progressively executed code generation. The final result outperformed DeepSeek's first-generation version across multiple dimensions:

1. More Accurate Layout Understanding

The prompt explicitly required the left navigation bar to use a card format, and M3 accurately understood this requirement and implemented it correctly. In AI programming, "Prompt Comprehension" is one of the key metrics for measuring a model's practical utility. It refers not only to whether the model can identify keywords in the prompt, but more importantly whether it can accurately grasp the user's implicit intent and domain conventions. For example, "card format" has a clear visual meaning in frontend UI design—an independent visual container with rounded corners, shadows, and padding—rather than simple list items. This ability to map from natural language to precise technical implementation relies on the model's learning of extensive frontend code and UI design patterns during training. DeepSeek, on the other hand, failed to understand this requirement on its first generation, producing a left navigation that was simply a list with icons, requiring subsequent prompt optimization to correct.

M3's Generated Backend Framework Page

2. Proactive Feature Additions

M3 not only completed the basic framework scaffolding but also proactively read the project's system documentation files, adding additional functional modules based on contextual information—demonstrating stronger context comprehension. This proactive use of project context reflects the model's capability in "Agentic Coding"—not just passively executing instructions, but actively exploring the project environment and understanding project conventions to generate code that better fits actual requirements.

3. Detail Comparison

Both models had minor imperfections, such as unnecessary underlines and bullet points in navigation items. But overall, M3 achieved with zero optimization what DeepSeek required multiple rounds of prompt refinement to accomplish. Higher first-generation quality means fewer iteration cycles needed by developers, which directly translates to improved development efficiency.

Key Conclusion

M3's core advantage lies in its "get it right the first time" comprehension ability. With the same prompt, M3's first-generation result closely matches expectations, while DeepSeek requires multiple rounds of iterative optimization.

Token Consumption and Cost-Effectiveness Analysis

Throughout the testing process, M3's Token consumption was surprisingly low. After completing both the Super Mario game and the backend framework page tasks, usage statistics showed almost no significant change. In contrast, many developers have reported that using DeepSeek with Cloud Code is "very Token-hungry."

It's worth explaining the concept of Tokens and their impact on costs. A Token is the basic unit by which large language models process text—one Token corresponds to approximately 3-4 characters in English or 1-2 characters in Chinese. In API calls, Token consumption is divided into input Tokens (the prompts and context sent by the user) and output Tokens (the content generated by the model), with typically different pricing for each. In AI-assisted programming scenarios, since code generation tasks often require passing in large amounts of project context (such as file structures, existing code, configuration files, etc.), Token consumption is far higher than in ordinary conversation scenarios. Therefore, Token efficiency directly impacts developers' usage costs, which is why Token consumption is included as an important evaluation dimension.

Of course, there's a possibility of usage refresh delays, but from the actual experience, M3 does perform well in Token efficiency. For pay-per-use users, this means lower usage costs.

Summary and Recommendations

Through two rounds of real-world testing, MiniMax M3 demonstrated impressive code generation capabilities:

Comparison Dimension	MiniMax M3	DeepSeek
First-generation quality	⭐⭐⭐⭐⭐	⭐⭐⭐
Prompt comprehension	More precise	Requires multiple rounds
Token consumption	Lower	Higher
Multimodal support	Video/image support	Primarily text
Game generation	Best results	Slightly inferior

M3's biggest highlight is its deep prompt comprehension ability, delivering high-quality code on the first generation and significantly reducing the time developers spend repeatedly adjusting prompts. As a model that also supports multimodal inputs, M3 has indeed demonstrated strong competitiveness among domestic programming models.

It's worth noting that evaluating AI programming models is a multi-dimensional problem. This article's tests focus on frontend code generation and prompt comprehension. In other dimensions such as backend logic, algorithm implementation, and code debugging, different models may perform differently. When choosing tools, developers should make comprehensive considerations based on their actual tech stack and use cases.

Recommendations for developers: If you're pursuing a "get it right the first time" code generation experience, M3 is worth trying. If you're already comfortable with DeepSeek's workflow, you can also keep M3 as an alternative, switching flexibly between different scenarios. The landscape of AI-assisted programming tools is evolving rapidly—maintaining an open attitude toward new tools is key to staying ahead in the efficiency race.