DeepSeek V4 Flash MTP Speculative Decoding Real-World Test: A Guide to 20% Faster Local Inference
DeepSeek V4 Flash MTP Speculative Deco…
MTP speculative decoding brings ~20% inference speedup to DeepSeek V4 Flash, with best results in coding scenarios.
MTP (Multi-Token Prediction) is a speculative decoding technique that uses an embedded draft model to predict multiple candidate tokens, which the main model then verifies in parallel batches, transforming inefficient serial decoding into efficient parallel operations. Real-world testing shows approximately 20% speed improvement in code generation (31→37 Token/s), with minimal gains for creative text. Additional memory overhead is about 4GB, and Q4 quantization is recommended. Note that extra floating-point operations may cause subtle differences in reasoning paths.
What is MTP Speculative Decoding
MTP (Multi-Token Prediction) is a speculative decoding strategy. To understand its value, you first need to understand the core bottleneck of large model inference: the speed limitation isn't insufficient compute power, but memory bandwidth — generating each token requires loading hundreds of gigabytes of model weights from memory to the compute units, and this serial process is extremely time-consuming. Speculative Decoding was born to break through this bottleneck, a paradigm proposed almost simultaneously by Google Brain and DeepMind teams in 2023.
The core idea behind MTP is intuitive: a smaller "draft model" speculatively predicts the next several tokens, then the main model verifies whether these tokens are correct in batch (Prefill) mode. Correct predictions are kept; incorrect ones trigger a rollback for regeneration.
This approach works because it transforms inefficient serial decoding into efficient parallel verification. Large model inference has two phases: the Prefill phase processes the input prompt where all tokens can be computed in parallel with very high hardware utilization; the Decode phase generates output tokens one by one, and since each step depends on the previous result, it can only execute serially with hardware utilization typically below 10%. MTP uses the draft model to serially generate 5-10 candidate tokens, then has the main model verify them in parallel batches — essentially merging multiple inefficient serial memory loads into a single efficient parallel operation.
Unlike traditional speculative decoding that requires maintaining a completely separate draft model, MTP's innovation lies in embedding the drafting capability directly into the main model's training process. During pre-training, DeepSeek has the model simultaneously learn to predict both the next token and multiple future tokens. These additional prediction heads are the MTP layers. Since MTP layers share extensive underlying representations with the main model, they inherently have higher prediction accuracy — this is also why MTP layers can be extracted from the main model and used independently — they are fundamentally part of the main model.
The MTP layer is just an additional layer added on top of the main model, theoretically maintaining 100% accuracy — all outputs are ultimately verified by the main model.
DeepSeek introduced MTP technology as early as V3. With V4 Flash, the community has successfully extracted these MTP layers so they can be loaded and used independently, bringing tangible inference acceleration to local deployment users.
Performance Testing: Coding Scenarios Show the Most Significant Improvement
Flappy Bird Code Generation Test
In the Flappy Bird HTML game generation test, speed without MTP was 31.15 tokens per second, and with MTP enabled it reached 37.3 Token/s — an improvement of approximately 20%.
However, it's worth noting that speed fluctuates significantly with MTP enabled — sometimes surging above 40 Token/s, sometimes dropping to 25. This is because the draft model's prediction accuracy isn't stable: when predictions are correct, speed is excellent; when wrong, the rollback and re-run actually makes it slower than without MTP.

3D Tetris Complete Code Test
When generating nearly 6,000 tokens of 3D Tetris code, the three configurations performed as follows:
- MTP Disabled: 30.7 Token/s
- MTP Enabled (Q4 quantization): 36.2 Token/s
- MTP Enabled (Q3 quantization): 35.9 Token/s
An interesting detail: the Q3 quantized version was actually slower than Q4. Q4 and Q3 quantization refer to compressing model weights to 4-bit and 3-bit integer representations respectively (original floating point numbers are 16-bit or 32-bit). Quantization introduces rounding errors, and lower precision means larger errors. In speculative decoding scenarios, this error directly affects the draft model's token prediction distribution — the Q3 quantized MTP layer's predicted probability distribution deviates more from what the main model expects, causing more predictions to be rejected and triggering more frequent rollback operations. Each rollback not only wastes the draft model's computation but also requires re-running the main model to generate the correct token, creating a double penalty. Therefore, using the Q4 version of the MTP layer is recommended.
Text Generation Performance is Mediocre
When writing stories, MTP's improvement is negligible — from 32.9 to only 33 Token/s. The reason isn't hard to understand: in creative writing, the selectable token space is much larger, making it difficult for the draft model to accurately guess which word the main model will ultimately choose. In contrast, code has more deterministic syntactic structure and a narrower range of options, naturally giving MTP a higher hit rate.
Memory Overhead and Accuracy Analysis
Memory Usage
The MTP layer itself is approximately 3.6GB in size. After loading MTP, total memory usage increases from 149GB to 153.5GB, consuming about 4GB extra.
It's worth mentioning that MTP technology being practical on Mac is largely thanks to Apple Silicon's Unified Memory Architecture (UMA). In traditional PC architectures, CPU memory and GPU VRAM are independent, requiring data transfer through the PCIe bus at approximately 64GB/s bandwidth; whereas M-series chips have CPU, GPU, and Neural Engine sharing the same memory pool with bandwidth exceeding 400GB/s (M3 Ultra). This makes it possible to run ultra-large models like DeepSeek V4 Flash that require 148GB+ of memory, and the MTP layer's additional 4GB overhead incurs virtually no extra data transfer cost under unified memory architecture. For users who already need 148GB+ of memory to run DeepSeek V4 Flash, this additional overhead is entirely acceptable.
Subtle Accuracy Differences
Although MTP theoretically maintains 100% accuracy (all tokens are verified by the main model), an interesting phenomenon was discovered in actual testing: after loading the MTP layer, the additional floating-point operations cause extremely subtle numerical changes that affect the model's path selection in the search tree.

The root cause of this phenomenon lies in the numerical sensitivity of floating-point operations. When a large language model generates each token, it's actually sampling from a probability distribution over the entire vocabulary (typically tens of thousands of words). The additional matrix operations introduced by the MTP layer alter the precise numerical values of intermediate activations, and these tiny changes are amplified layer by layer through the model's deep network, potentially causing two tokens with similar probabilities to swap rankings. Even with temperature set to 0 (greedy decoding), the non-associativity of floating-point arithmetic (i.e., (a+b)+c ≠ a+(b+c)) can produce different results due to changes in computation order.
In the "car wash problem" test, this difference was particularly evident — with MTP disabled, the model correctly answered "you should drive to the car wash," but with MTP enabled, it answered "you should walk." Even more dramatically, simply adding an extra space in the prompt could lead to a completely different answer — any change in the prompt affects the numerical path of the entire attention computation. This indirectly reflects the fragility of current large models in logical reasoning.

Local Deployment Practical Guide
Running via the Inference App
Here are the specific steps:
- Download the DeepSeek V4 Flash MLX 9-Bit quantized version
- Download the corresponding MTP speculative decoder model
- After selecting the main model in Inference, the speculative decoder section will automatically display available MTP options
- Check the speculative decoder to enable it
Running via OpenAI-Compatible API
The app has built-in server functionality supporting OpenAI-compatible API. In development tools like Open Code, simply paste the model ID with the MTP tag into the configuration file to make calls. The first run will be slower due to system prompt caching; after enabling "persistent prompt cache," subsequent usage speed will improve noticeably.

Multi-Machine Distributed Computing
If you have multiple Macs, you can build a cluster by linking multiple nodes together to share the workload and further improve inference capability.
MTP Effect Comparison with Other Models
Smaller models like Qwen and Gemma (27 to 30 billion parameters) can achieve up to 2x speed improvement with MTP, which is very impressive. DeepSeek V4 Flash has over 100 billion parameters, and while the 20% improvement isn't as dramatic as with smaller models, considering the model scale, this gain is already quite substantial.
Additionally, other speculative decoding approaches like Eagle Free are worth noting. Eagle (Extrapolation Algorithm for Greater Language-model Efficiency) is another speculative decoding framework proposed by Stanford University. Its core innovation is having the draft model predict directly in the main model's feature space rather than token space, achieving higher prediction accuracy. Compared to MTP, Eagle's advantage is that it can train a dedicated draft model for any existing model without relying on multi-token prediction heads from original training; its disadvantage is that it requires additional training steps, and current optimization for Apple Silicon's MLX framework is insufficient, performing poorly on Mac and still needing further optimization or training of dedicated versions. With continued investment from the open-source community, such approaches are expected to see significant improvements on the Mac platform in the future.
Summary and Usage Recommendations
MTP brings a stable 20% performance improvement to DeepSeek V4 Flash, with the best results in code generation scenarios. Trading approximately 4GB of additional memory for significant speed improvement makes it a worthwhile optimization option for users deploying large models locally.
However, be aware that MTP may cause subtle differences in the model's reasoning path. In scenarios requiring extremely high output determinism, careful evaluation is recommended before deciding whether to enable it.
Key Takeaways
- MTP speculative decoding predicts tokens via a draft model and has the main model verify them in batches, transforming serial Decode operations into parallel Prefill operations, bringing approximately 20% inference speed improvement to DeepSeek V4 Flash
- Code generation scenarios show the most significant improvement (31→37 Token/s), while creative text generation shows minimal improvement because code has a more deterministic token selection space
- The MTP layer uses approximately 4GB of additional memory (3.6GB model size), with Q4 quantization performing better than Q3 — lower precision leads to more rejected predictions, creating a double performance penalty
- While theoretically maintaining 100% accuracy, the numerical non-associativity of additional floating-point operations may cause the model to take different reasoning paths, producing different results
- Apple Silicon's Unified Memory Architecture is the key hardware foundation for running such ultra-large models locally on Mac
- Supports usage through the Inference app or OpenAI-compatible API, compatible with development tools like Open Code
Related articles
TutorialsCursor + Codex Dual-IDE Collaboration: A Practical Methodology for Open-Source Project Customization
A complete methodology for open-source project customization based on real-world experience, detailing the Cursor+Codex dual-IDE workflow, seven-stage process, MVP validation, and AI source code reading techniques.
TutorialsCursor Multi-Agent in Practice: Building a Full-Stack Next.js Blog in 50 Minutes
Build a full-stack blog in 50 minutes using Cursor IDE's multi-Agent mode with Next.js, Clerk auth, and Supabase. Learn the 4-phase AI Agent workflow and key integration pitfalls.
TutorialsBuilding an AI Software Factory from Scratch: A Cursor Engineer's Hands-On Experience with Multi-Agent Collaboration
Cursor engineer Eric shares practical insights on building an AI software factory: automation levels, guardrail design, parallel Agent management, and scaling to 1000+ Agents for 24/7 development.