Hands-On Testing of DS4 Engine by Redis Creator: How Does DeepSeek V4 Perform Locally on a 128GB Mac?

Introduction

Antirez, the creator of Redis, recently launched a project called DS4 — an inference engine specifically designed for local deployment of the DeepSeek V4 Flash model on Apple computers. Through a unique quantization strategy, the project compresses the model from its original 284GB memory requirement down to 80-85GB, enabling it to run on a 128GB MacBook. A Bilibili content creator conducted an in-depth hands-on test of this project, covering everything from frontend mini-games to STM32 embedded development, thoroughly examining how the locally deployed quantized model performs in real-world programming scenarios.

我们先来验收一下这个结果啊

5minutes later

这里我们就不干预了

Core Technical Highlights of the DS4 Engine

Asymmetric Structure-Aware Quantization: Not a Simple "One-Size-Fits-All" Approach

The standout technical feature of the DS4 engine is that its quantization approach isn't a brute-force uniform compression. Model quantization is a technique that compresses floating-point weights in neural networks from high precision (e.g., FP16, 16 bits per parameter) to lower precision (e.g., Q8 at 8 bits, Q4 at 4 bits, Q2 at 2 bits). Lower precision means smaller memory footprint and faster inference, but also reduced model capability. Traditional quantization methods typically apply uniform precision across all layers, whereas DS4's innovation lies in assigning different precision levels based on the importance of different components within the MOE architecture — an approach known in academia as Mixed-Precision Quantization.

Unlike typical quantized models, DS4 applies asymmetric structure-aware quantization specifically tailored to DeepSeek V4 Flash's MOE (Mixture of Experts) architecture. The MOE architecture is one of the mainstream design paradigms for large-scale language models today. Its core idea is to split the model's feed-forward network layers into multiple "expert" sub-networks, activating only a small subset of experts during each inference pass to process the input. DeepSeek V4 Flash's MOE architecture contains hundreds of expert modules, but each token is routed to only a few experts for computation. This means that while the total parameter count is enormous (600B+), the actual computational load during inference is far less than that of a dense model of equivalent size.

The specific quantization strategy is as follows:

Shared experts, routing networks, projection matrices, and attention layers are all maintained at Q8 or FP16 high precision, ensuring that the model's decision-making capability and tool-calling reliability don't degrade significantly. The Router is responsible for deciding which experts each token should be assigned to — this decision process is critical to output quality, so maintaining high-precision quantization for it is well-justified technically
Layered processing based on MOE usage frequency, collinearity, and divergence, applying different quantization precision to different expert modules
KV cache disk persistence preserves the 1M ultra-long context capability. KV cache is a core mechanism in Transformer model inference — during autoregressive generation, the model needs to store attention key-value pairs for all previous tokens to avoid redundant computation. For models supporting 1 million token ultra-long contexts, KV cache memory usage can reach tens of gigabytes. DS4 persists the KV cache to disk, leveraging Apple silicon's high-speed NVMe storage advantage, offloading portions of the cache to disk when memory is insufficient, thereby maintaining ultra-long context capability within the 128GB memory constraint

This fine-grained quantization strategy compresses the model from 284GB to 80-85GB, running perfectly on a 128GB Mac. Apple's M-series chips use a Unified Memory Architecture where CPU and GPU share the same physical memory, eliminating the need for data copying between system memory and VRAM as in traditional PCs. A 128GB Mac can dedicate all its memory to model loading and inference. While Mac GPUs aren't as powerful as dedicated graphics cards, the memory bandwidth (M4 Max can reach 546GB/s) is sufficient to support large model inference — this is precisely why DS4 is optimized specifically for the Mac platform. The project's official documentation also explicitly states that its primary focus is on Coding performance.

Deployment Process and Usage Methods

In the hands-on test, the content creator used the Q2 quantized version for deployment. The entire model is approximately 60GB and took over two hours to download. Q2 means each parameter is represented with only 2 bits, resulting in significant information loss — this explains why this version performs noticeably worse than the full-precision version on complex tasks. One thing you might not notice: terminal downloads require manually configuring the proxy port (default 7890), because proxy tools' smart mode and global mode don't cover the terminal environment.

DS4 offers three usage modes:

CLI interactive mode — direct command-line conversation
HTTP server mode — can be connected to third-party tools
Built-in Code Agent — programming assistant supporting code writing and debugging

The content creator chose HTTP server mode, using the Mac as an inference server and calling it from a Windows PC through Trae (ByteDance's AI IDE) for demonstration.

Test 1: Snake Game Development

For a fair comparison, the content creator simultaneously created two projects: one using the locally deployed DS4 quantized model, and another using DeepSeek's official full-precision V4 Flash, executing the same Snake game development task.

Local Quantized Version Performance

Output a total of 446 lines of code
Unexpectedly proactively invoked Skills and produced two documents (PRD and architecture design)
The game ran directly without obvious bugs
However, token conversion time per conversation was quite long

Online Full-Precision Version Performance

Output 161 lines of code, more concise
Smoother gaming experience, faster speed
Had one minor bug: wouldn't auto-start, requiring a page refresh

Interestingly, while the local quantized version was slower, it excelled in Skill invocation and SOP adherence, even proactively generating technical documentation. This indicates that DS4 did a good job preserving Agent capabilities through quantization — directly related to its strategy of maintaining high precision for routing networks and attention layers. The model's "decision-making" ability (choosing which tools to call, which processes to follow) was well protected.

Test 2: STM32 Embedded Development

This was a more challenging test scenario. STM32 is a microcontroller series from STMicroelectronics based on ARM Cortex-M cores, widely used in industrial control, IoT devices, and consumer electronics. Unlike frontend development, embedded development involves hardware register configuration, cross-compilation toolchains (like ARM GCC), Makefile build systems, flash programming and debugging (via tools like ST-Link), and peripheral drivers (such as I2C/SPI protocol drivers for OLED screens). Each step can present issues related to specific hardware versions and pin configurations, requiring the AI model not only to generate code but also to understand hardware abstraction layers and low-level communication protocols — demanding far greater knowledge depth and reasoning capability than typical web development.

The content creator asked the locally deployed model to complete an STM32 microcontroller project: displaying a "Hello World" scrolling marquee effect on an OLED screen.

Problems Encountered

The entire process exposed multiple shortcomings of the quantized model in complex programming scenarios:

Makefile issues during compilation — The model fell into prolonged thinking. Although the solution was already documented in the Skills, the model took about half an hour to identify the cause and resolve it
Frequent tool-calling errors — Three tool-calling errors occurred during the entire test, directly interrupting the workflow. This is likely related to Q2 quantization's damage to the precision of model output formatting — tool calling requires the model to generate structured output strictly conforming to JSON Schema, and extremely low-precision quantization affects the model's ability to adhere to format constraints
Insufficient debugging capability — After flashing, the screen wouldn't light up and serial output was absent. The model attempted to debug on its own but with poor results, erroring out again after a 15-minute wait

Ultimately, the content creator had to switch to the official full-precision V4 Flash to troubleshoot, and eventually used DeepSeek V4 Pro to fully implement the scrolling marquee effect. The entire local model testing process took approximately 75 minutes — a rather painful experience.

Speed Bottleneck Analysis

In testing, the average output speed of the local deployment was approximately 23 tokens/second (consistent with official test data), with memory usage spiking to 110GB during inference. In large model inference, token generation speed directly determines user experience — generally, 30+ tokens/second is considered sufficient for smooth conversational experience (approaching human reading speed), while programming scenarios demand even higher speeds due to the need to generate large amounts of code. At 23 tokens/second, generating 100 lines of code (approximately 500-800 tokens) takes 20-35 seconds, which is barely acceptable for simple tasks.

But the more critical bottleneck lies in the "thinking phase" — DeepSeek V4 Flash employs a Chain-of-Thought-like reasoning mechanism where the model performs internal reasoning before outputting answers. This phase also consumes tokens but is invisible to the user, causing actual wait times to far exceed expectations. The peak of 83.8 tokens/second appeared during pure code output phases when the model didn't need complex reasoning, approaching the hardware's theoretical throughput ceiling. While code generation speed is "barely usable," the excessive wait times during model thinking and context switching phases severely impacted development efficiency.

Summary and Recommendations

Three Core Issues with DS4 Currently

After comprehensive testing, the DS4 project currently has three main issues:

Slow output speed — Overall approximately 23 tokens/second, with particularly long thinking wait times during complex tasks
Unstable tool calling — 3 calling errors occurred during testing, likely a "side effect" of model quantization where extremely low-precision quantization damages the model's ability to generate precise structured output
Degraded complex coding capability — Simple frontend development (like Snake) can pass in one shot, but embedded development coding and debugging is riddled with issues

What Use Cases Is DS4 Suitable For?

Despite its shortcomings, DS4 still has unique value:

Suitable as a local Agent — Performs well in Skill invocation and SOP adherence, suitable for building local knowledge base management and privacy-sensitive content processing
Suitable for assisted programming — If you handle core coding yourself and need a free, reasonably capable AI assistant, local deployment is a good choice
Not suitable as a primary programming tool — At the current stage, all local large models can only serve as efficiency tools and supplementary aids

Future Outlook

The DS4 project has significant potential. Salvatore Sanfilippo (known online as Antirez) is the creator of Redis, the open-source in-memory database that is one of the world's most widely used key-value storage systems, adopted at scale by companies like Netflix, Twitter, and GitHub. He is renowned for his deep understanding of low-level system optimization and his minimalist engineering philosophy, with over 20 years of experience in C language systems programming. After retiring from the Redis project in 2020, he turned to exploring AI, and the DS4 project reflects his signature style: solving practical problems with elegant engineering.

The fine-grained quantization approach targeting MOE architectures represents an important direction for local deployment. As Apple silicon performance continues to improve (the next-generation M5 series is expected to further increase memory bandwidth and GPU compute) and quantization techniques continue to advance (with more sophisticated algorithms like GPTQ and AWQ evolving), the experience of running 600B+ parameter models locally on a 128GB Mac is expected to improve significantly. But for now, if you're pursuing productivity, the full-precision API remains the more pragmatic choice.