oMLX + MTP + Qwen3.6: Local AI Coding Speed Breaks New Records

Introduction: A New Era for Local LLM-Powered Coding

While we're still debating subscription costs for cloud-based AI coding assistants, local large language model inference speeds have quietly crossed a major milestone. A developer used the combination of oMLX + Pi Coding Agent + Qwen3.6 35B to build a complete full-stack reminder app in under 5 minutes — from backend API to frontend UI, all generated in one shot by a locally running LLM, with zero manual code editing.

oMLX local coding demo

The highlight of this demo isn't just code quality — it's inference speed. Thanks to Multi-Token Prediction (MTP) technology, the model achieved a generation speed of 86.7 tokens/s, with prompt processing reaching an impressive 1,735 tokens/s. For a 35B-parameter model, those numbers are nothing short of remarkable.

Tech Stack Breakdown: Three Core Components

oMLX: A Blazing-Fast Inference Engine on Apple Silicon

oMLX is a local LLM inference tool built on Apple's MLX framework, specifically optimized for Apple Silicon (M-series chips). Apple MLX is a machine learning framework open-sourced by Apple in late 2023, designed specifically for Apple Silicon's Unified Memory Architecture (UMA). Traditional GPU inference requires copying model weights from system memory to GPU VRAM — a process that becomes a severe bottleneck with large models. Apple Silicon's UMA allows the CPU and GPU to share the same physical memory, eliminating data transfer overhead. This means a Mac with 128GB of unified memory can load an entire large model directly into GPU-accessible memory space, without being constrained by the 24GB or 48GB VRAM limits typical of NVIDIA GPUs. MLX's API design is heavily influenced by PyTorch and JAX, while employing lazy evaluation and dynamic graph mechanisms to efficiently schedule Apple Silicon's Neural Engine, GPU, and CPU resources at runtime.

oMLX is built on top of this framework, fully leveraging the advantages of unified memory architecture to run large-parameter models locally at extremely high speeds. Compared to traditional solutions like llama.cpp, oMLX delivers superior performance on the Mac platform.

The most critical upgrade in this demo was enabling Native MTP (Multi-Token Prediction). MTP was first systematically proposed by Meta in their research paper Better & Faster Large Language Models via Multi-token Prediction, and later adopted by Google in Gemma 4. Its core idea stems from Speculative Decoding: traditional autoregressive language models generate only one token per forward pass, meaning generating N tokens requires N full model inference steps — severely bottlenecked by memory bandwidth. MTP introduces one or more lightweight "drafter heads" into the model architecture. These small networks share most of the hidden states with the main model but can simultaneously predict tokens at multiple future positions. The main model then verifies these predictions, accepting correct ones and rejecting incorrect ones. Since verifying multiple tokens costs roughly the same as generating a single token (both require one forward pass), each inference step can effectively produce multiple tokens, boosting inference speed by approximately 2x. Qwen3.6 natively supports MTP, and its integration with oMLX activates this acceleration capability directly — no separate draft model configuration required.

Qwen3.6 35B MoE: Balancing Performance and Efficiency

The model used in this demo is Qwen3.6 35B, a Mixture of Experts (MoE) architecture model. MoE is a sparsely activated neural network architecture first proposed by Hinton et al. in 1991, which has experienced a renaissance in the large language model era. In a standard Transformer architecture, every Feed-Forward Network (FFN) layer uses all parameters to compute on every input token. MoE replaces each FFN layer with multiple parallel "expert" sub-networks and introduces a gating network (Router/Gate) that decides which experts each token should be routed to.

The advantage of MoE is clear: while the total parameter count is 35B, only a subset of expert networks are activated during each inference step (perhaps only 6-8B active parameters), so the actual computational cost is far less than a Dense model of equivalent size. Meanwhile, different experts can naturally "specialize" in different types of knowledge or tasks, with the model's total knowledge capacity determined by all experts' parameters combined. Google's Switch Transformer and Mistral's Mixtral 8x7B are landmark works applying MoE to LLMs, demonstrating that MoE can dramatically improve model capability while maintaining inference efficiency.

The developer configured a 131,072-token context window (approximately 131K), providing ample context space for complex full-stack projects and enabling the model to understand complete project requirements and generate all code files within a single conversation.

Pi Coding Agent: The Execution Layer for AI Coding

Pi Coding Agent serves as the coding agent layer, responsible for translating the LLM's output into actual file operations. It represents a paradigm shift from simple code completion to autonomous programming. Traditional AI coding assistants (like early GitHub Copilot) could only provide line-level or function-level code suggestions at the cursor position — developers still had to manually organize project structure, manage files, and execute commands. Coding Agents introduce a "plan-execute-verify" loop: they first analyze the requirements document and formulate an implementation plan; then use tool-calling capabilities to perform file system operations, run terminal commands, and read execution results; finally, they self-correct based on feedback. This architecture typically uses ReAct (Reasoning + Acting) or similar prompting frameworks, allowing the model to alternate between "thinking" and "acting."

Specifically, Pi Coding Agent can:

Automatically create project directory structures
Generate and write multiple code files
Execute dependency installation commands like npm install
Start frontend and backend services

This Agent paradigm lets developers simply provide a detailed requirements document, with the AI handling everything else automatically. Pi Coding Agent, Cursor Agent, Claude Code, and similar tools all fall into this category — they're redefining how developers interact with code.

Live Demo: Building an Apple-Style Reminder App in 5 Minutes

Requirements Design

The developer prepared a structured Markdown requirements document containing:

Project Overview: Build a full-stack reminder web app inspired by Apple Reminders, with a dark UI, category-based lists and tag system, local-first with no cloud sync
Tech Stack: Explicit frontend and backend technology choices
Project Structure: Expected file directory layout
Database Schema: Data table design
REST API Endpoints: Complete interface definitions
Frontend Requirements: Layout, sidebar, list selection, main content area, add/edit modals
Acceptance Criteria: Functional completeness requirements

Execution Process

After pasting the requirements document into Pi Coding Agent, the entire build process was fully automated:

The model first generated configuration files like package.json
Created source code files one by one following the predefined project structure
Generated the complete backend API and frontend components
Finally provided access URLs for both frontend and backend

The entire process took less than 5 minutes. The resulting app closely resembled Apple Reminders with full CRUD functionality.

Performance Data Analysis

Several key performance metrics from this demo deserve attention:

Metric	Value
Model Parameters	35B (MoE)
Token Generation Speed	86.7 tokens/s
Prompt Processing Speed	1,735 tokens/s
Context Window	131K tokens
Build Time	<5 minutes

Understanding these numbers requires knowledge of the two fundamentally different phases of LLM inference: Prefill and Decode. The Prefill phase processes all prompt tokens from the user input — these tokens can be computed in parallel, making this phase primarily compute-bound, which is why speeds can be very fast — reaching 1,735 tokens/s in this case. The Decode phase generates output tokens one at a time, where each new token depends on the previous token's result and cannot be parallelized, making it primarily memory-bandwidth-bound — requiring repeated reads of the entire model's weights from memory.

This is the fundamental reason why 1,735 tokens/s is the prompt processing (prefill) speed, while the actual token generation speed is 86.7 tokens/s. In coding scenarios, since requirements documents are typically long while the generated code volume is even larger, performance in both phases matters: prefill speed determines how quickly the model "understands" long documents, while decode speed determines the perceived code "typing speed" for the user.

For a 35B-parameter model, 86.7 tokens/s generation speed is outstanding. For comparison, an equivalent Dense model without MTP typically achieves only 40-50 tokens/s.

Practical Advice and Limitations

Hardware Requirements

This setup runs on an Apple M5 Max chip, meaning you'll need at least a Mac with a high-end M-series chip. Specifically:

M4 Pro (48GB) can run it, but the context window will need to be reduced
M4 Max / M5 Max (64GB+) is the ideal choice
More unified memory means a larger available context window

MoE vs. Dense: Trade-offs

The developer acknowledged in the video that this demo used an MoE model rather than a Dense model. While MoE models offer faster inference, they may not match the precision of same-parameter-count Dense models on certain complex reasoning tasks. However, based on actual results, MoE model performance is more than sufficient for code generation tasks.

Ideal Use Cases

This setup is particularly well-suited for:

Rapid prototyping
Privacy-sensitive enterprise internal projects
Personal projects requiring frequent iteration
Development environments with poor network conditions

Local Inference vs. Cloud Services: How to Choose

The competition between local AI inference and cloud API services is reshaping the developer tools market. Cloud services (such as OpenAI API, Anthropic Claude API) offer advantages like zero hardware investment, continuously updated models, and the ability to run the largest flagship models. But their drawbacks are equally clear: ongoing subscription/usage fees (which can reach hundreds of dollars per month for heavy users), data privacy risks (code must be uploaded to third-party servers), and dependence on network latency and availability. Local inference eliminates all these concerns — after a one-time hardware investment, inference costs are zero, and code never leaves your machine. As open-source model quality improves rapidly (Qwen, Llama, DeepSeek, and others), and consumer-grade hardware like Apple Silicon continues to gain inference capability, the quality gap between local and cloud solutions is shrinking fast. For many practical coding tasks, a 35B-class local model can already deliver code generation quality comparable to cloud-based flagship models.

Conclusion

Local AI coding is evolving from "usable" to "excellent." The oMLX + MTP + Qwen3.6 combination proves that with the right hardware, local LLMs can already handle complex coding tasks at quality and speed levels approaching cloud services. As Apple Silicon continues to iterate and open-source models evolve rapidly, the local AI coding experience will only get better. For developers who own a Mac, now is the perfect time to explore a local AI coding workflow.