DeepSeek V4 Flash MTP Speculative Decoding Real-World Test: A Guide to 20% Faster Local Inference

What is MTP Speculative Decoding

MTP (Multi-Token Prediction) is a speculative decoding strategy. To understand its value, you first need to understand the core bottleneck of large model inference: the speed limitation isn't insufficient compute power, but memory bandwidth — generating each token requires loading hundreds of gigabytes of model weights from memory to the compute units, and this serial process is extremely time-consuming. Speculative Decoding was born to break through this bottleneck, a paradigm proposed almost simultaneously by Google Brain and DeepMind teams in 2023.

The core idea behind MTP is intuitive: a smaller "draft model" speculatively predicts the next several tokens, then the main model verifies whether these tokens are correct in batch (Prefill) mode. Correct predictions are kept; incorrect ones trigger a rollback for regeneration.

This approach works because it transforms inefficient serial decoding into efficient parallel verification. Large model inference has two phases: the Prefill phase processes the input prompt where all tokens can be computed in parallel with very high hardware utilization; the Decode phase generates output tokens one by one, and since each step depends on the previous result, it can only execute serially with hardware utilization typically below 10%. MTP uses the draft model to serially generate 5-10 candidate tokens, then has the main model verify them in parallel batches — essentially merging multiple inefficient serial memory loads into a single efficient parallel operation.

Unlike traditional speculative decoding that requires maintaining a completely separate draft model, MTP's innovation lies in embedding the drafting capability directly into the main model's training process. During pre-training, DeepSeek has the model simultaneously learn to predict both the next token and multiple future tokens. These additional prediction heads are the MTP layers. Since MTP layers share extensive underlying representations with the main model, they inherently have higher prediction accuracy — this is also why MTP layers can be extracted from the main model and used independently — they are fundamentally part of the main model.

The MTP layer is just an additional layer added on top of the main model, theoretically maintaining 100% accuracy — all outputs are ultimately verified by the main model.

DeepSeek introduced MTP technology as early as V3. With V4 Flash, the community has successfully extracted these MTP layers so they can be loaded and used independently, bringing tangible inference acceleration to local deployment users.

Performance Testing: Coding Scenarios Show the Most Significant Improvement

Flappy Bird Code Generation Test

In the Flappy Bird HTML game generation test, speed without MTP was 31.15 tokens per second, and with MTP enabled it reached 37.3 Token/s — an improvement of approximately 20%.

However, it's worth noting that speed fluctuates significantly with MTP enabled — sometimes surging above 40 Token/s, sometimes dropping to 25. This is because the draft model's prediction accuracy isn't stable: when predictions are correct, speed is excellent; when wrong, the rollback and re-run actually makes it slower than without MTP.

MTP Performance Comparison

3D Tetris Complete Code Test

When generating nearly 6,000 tokens of 3D Tetris code, the three configurations performed as follows:

MTP Disabled: 30.7 Token/s
MTP Enabled (Q4 quantization): 36.2 Token/s
MTP Enabled (Q3 quantization): 35.9 Token/s

An interesting detail: the Q3 quantized version was actually slower than Q4. Q4 and Q3 quantization refer to compressing model weights to 4-bit and 3-bit integer representations respectively (original floating point numbers are 16-bit or 32-bit). Quantization introduces rounding errors, and lower precision means larger errors. In speculative decoding scenarios, this error directly affects the draft model's token prediction distribution — the Q3 quantized MTP layer's predicted probability distribution deviates more from what the main model expects, causing more predictions to be rejected and triggering more frequent rollback operations. Each rollback not only wastes the draft model's computation but also requires re-running the main model to generate the correct token, creating a double penalty. Therefore, using the Q4 version of the MTP layer is recommended.

Text Generation Performance is Mediocre

When writing stories, MTP's improvement is negligible — from 32.9 to only 33 Token/s. The reason isn't hard to understand: in creative writing, the selectable token space is much larger, making it difficult for the draft model to accurately guess which word the main model will ultimately choose. In contrast, code has more deterministic syntactic structure and a narrower range of options, naturally giving MTP a higher hit rate.

Memory Overhead and Accuracy Analysis

Memory Usage

The MTP layer itself is approximately 3.6GB in size. After loading MTP, total memory usage increases from 149GB to 153.5GB, consuming about 4GB extra.

It's worth mentioning that MTP technology being practical on Mac is largely thanks to Apple Silicon's Unified Memory Architecture (UMA). In traditional PC architectures, CPU memory and GPU VRAM are independent, requiring data transfer through the PCIe bus at approximately 64GB/s bandwidth; whereas M-series chips have CPU, GPU, and Neural Engine sharing the same memory pool with bandwidth exceeding 400GB/s (M3 Ultra). This makes it possible to run ultra-large models like DeepSeek V4 Flash that require 148GB+ of memory, and the MTP layer's additional 4GB overhead incurs virtually no extra data transfer cost under unified memory architecture. For users who already need 148GB+ of memory to run DeepSeek V4 Flash, this additional overhead is entirely acceptable.

Subtle Accuracy Differences

Although MTP theoretically maintains 100% accuracy (all tokens are verified by the main model), an interesting phenomenon was discovered in actual testing: after loading the MTP layer, the additional floating-point operations cause extremely subtle numerical changes that affect the model's path selection in the search tree.

Speculative Decoder Settings Interface

The root cause of this phenomenon lies in the numerical sensitivity of floating-point operations. When a large language model generates each token, it's actually sampling from a probability distribution over the entire vocabulary (typically tens of thousands of words). The additional matrix operations introduced by the MTP layer alter the precise numerical values of intermediate activations, and these tiny changes are amplified layer by layer through the model's deep network, potentially causing two tokens with similar probabilities to swap rankings. Even with temperature set to 0 (greedy decoding), the non-associativity of floating-point arithmetic (i.e., (a+b)+c ≠ a+(b+c)) can produce different results due to changes in computation order.

In the "car wash problem" test, this difference was particularly evident — with MTP disabled, the model correctly answered "you should drive to the car wash," but with MTP enabled, it answered "you should walk." Even more dramatically, simply adding an extra space in the prompt could lead to a completely different answer — any change in the prompt affects the numerical path of the entire attention computation. This indirectly reflects the fragility of current large models in logical reasoning.

Car Wash Problem Test Comparison

Local Deployment Practical Guide

Running via the Inference App

Here are the specific steps:

Download the DeepSeek V4 Flash MLX 9-Bit quantized version
Download the corresponding MTP speculative decoder model
After selecting the main model in Inference, the speculative decoder section will automatically display available MTP options
Check the speculative decoder to enable it

Running via OpenAI-Compatible API

The app has built-in server functionality supporting OpenAI-compatible API. In development tools like Open Code, simply paste the model ID with the MTP tag into the configuration file to make calls. The first run will be slower due to system prompt caching; after enabling "persistent prompt cache," subsequent usage speed will improve noticeably.

Math Problem MTP Speed Improvement

Multi-Machine Distributed Computing

If you have multiple Macs, you can build a cluster by linking multiple nodes together to share the workload and further improve inference capability.

MTP Effect Comparison with Other Models

Smaller models like Qwen and Gemma (27 to 30 billion parameters) can achieve up to 2x speed improvement with MTP, which is very impressive. DeepSeek V4 Flash has over 100 billion parameters, and while the 20% improvement isn't as dramatic as with smaller models, considering the model scale, this gain is already quite substantial.

Additionally, other speculative decoding approaches like Eagle Free are worth noting. Eagle (Extrapolation Algorithm for Greater Language-model Efficiency) is another speculative decoding framework proposed by Stanford University. Its core innovation is having the draft model predict directly in the main model's feature space rather than token space, achieving higher prediction accuracy. Compared to MTP, Eagle's advantage is that it can train a dedicated draft model for any existing model without relying on multi-token prediction heads from original training; its disadvantage is that it requires additional training steps, and current optimization for Apple Silicon's MLX framework is insufficient, performing poorly on Mac and still needing further optimization or training of dedicated versions. With continued investment from the open-source community, such approaches are expected to see significant improvements on the Mac platform in the future.

Summary and Usage Recommendations

MTP brings a stable 20% performance improvement to DeepSeek V4 Flash, with the best results in code generation scenarios. Trading approximately 4GB of additional memory for significant speed improvement makes it a worthwhile optimization option for users deploying large models locally.

However, be aware that MTP may cause subtle differences in the model's reasoning path. In scenarios requiring extremely high output determinism, careful evaluation is recommended before deciding whether to enable it.

Key Takeaways

MTP speculative decoding predicts tokens via a draft model and has the main model verify them in batches, transforming serial Decode operations into parallel Prefill operations, bringing approximately 20% inference speed improvement to DeepSeek V4 Flash
Code generation scenarios show the most significant improvement (31→37 Token/s), while creative text generation shows minimal improvement because code has a more deterministic token selection space
The MTP layer uses approximately 4GB of additional memory (3.6GB model size), with Q4 quantization performing better than Q3 — lower precision leads to more rejected predictions, creating a double performance penalty
While theoretically maintaining 100% accuracy, the numerical non-associativity of additional floating-point operations may cause the model to take different reasoning paths, producing different results
Apple Silicon's Unified Memory Architecture is the key hardware foundation for running such ultra-large models locally on Mac
Supports usage through the Inference app or OpenAI-compatible API, compatible with development tools like Open Code