oMLX + MTP + Qwen3.6: Local AI Coding Speed Breaks New Records

oMLX + Qwen3.6 enables full-stack app development in 5 minutes with local LLM inference on Mac.
A developer used the oMLX inference engine, Qwen3.6 35B MoE model, and Pi Coding Agent to build a complete full-stack reminder app on an Apple Silicon Mac in under 5 minutes. Powered by Multi-Token Prediction (MTP), the model achieved 86.7 tokens/s generation speed, proving that local AI coding has evolved from merely functional to genuinely competitive with cloud services in privacy, cost, and speed.
Introduction: A New Era for Local LLM-Powered Coding
While we're still debating subscription costs for cloud-based AI coding assistants, local large language model inference speeds have quietly crossed a major milestone. A developer used the combination of oMLX + Pi Coding Agent + Qwen3.6 35B to build a complete full-stack reminder app in under 5 minutes — from backend API to frontend UI, all generated in one shot by a locally running LLM, with zero manual code editing.

The highlight of this demo isn't just code quality — it's inference speed. Thanks to Multi-Token Prediction (MTP) technology, the model achieved a generation speed of 86.7 tokens/s, with prompt processing reaching an impressive 1,735 tokens/s. For a 35B-parameter model, those numbers are nothing short of remarkable.
Tech Stack Breakdown: Three Core Components
oMLX: A Blazing-Fast Inference Engine on Apple Silicon
oMLX is a local LLM inference tool built on Apple's MLX framework, specifically optimized for Apple Silicon (M-series chips). Apple MLX is a machine learning framework open-sourced by Apple in late 2023, designed specifically for Apple Silicon's Unified Memory Architecture (UMA). Traditional GPU inference requires copying model weights from system memory to GPU VRAM — a process that becomes a severe bottleneck with large models. Apple Silicon's UMA allows the CPU and GPU to share the same physical memory, eliminating data transfer overhead. This means a Mac with 128GB of unified memory can load an entire large model directly into GPU-accessible memory space, without being constrained by the 24GB or 48GB VRAM limits typical of NVIDIA GPUs. MLX's API design is heavily influenced by PyTorch and JAX, while employing lazy evaluation and dynamic graph mechanisms to efficiently schedule Apple Silicon's Neural Engine, GPU, and CPU resources at runtime.
oMLX is built on top of this framework, fully leveraging the advantages of unified memory architecture to run large-parameter models locally at extremely high speeds. Compared to traditional solutions like llama.cpp, oMLX delivers superior performance on the Mac platform.
The most critical upgrade in this demo was enabling Native MTP (Multi-Token Prediction). MTP was first systematically proposed by Meta in their research paper Better & Faster Large Language Models via Multi-token Prediction, and later adopted by Google in Gemma 4. Its core idea stems from Speculative Decoding: traditional autoregressive language models generate only one token per forward pass, meaning generating N tokens requires N full model inference steps — severely bottlenecked by memory bandwidth. MTP introduces one or more lightweight "drafter heads" into the model architecture. These small networks share most of the hidden states with the main model but can simultaneously predict tokens at multiple future positions. The main model then verifies these predictions, accepting correct ones and rejecting incorrect ones. Since verifying multiple tokens costs roughly the same as generating a single token (both require one forward pass), each inference step can effectively produce multiple tokens, boosting inference speed by approximately 2x. Qwen3.6 natively supports MTP, and its integration with oMLX activates this acceleration capability directly — no separate draft model configuration required.
Qwen3.6 35B MoE: Balancing Performance and Efficiency
The model used in this demo is Qwen3.6 35B, a Mixture of Experts (MoE) architecture model. MoE is a sparsely activated neural network architecture first proposed by Hinton et al. in 1991, which has experienced a renaissance in the large language model era. In a standard Transformer architecture, every Feed-Forward Network (FFN) layer uses all parameters to compute on every input token. MoE replaces each FFN layer with multiple parallel "expert" sub-networks and introduces a gating network (Router/Gate) that decides which experts each token should be routed to.
The advantage of MoE is clear: while the total parameter count is 35B, only a subset of expert networks are activated during each inference step (perhaps only 6-8B active parameters), so the actual computational cost is far less than a Dense model of equivalent size. Meanwhile, different experts can naturally "specialize" in different types of knowledge or tasks, with the model's total knowledge capacity determined by all experts' parameters combined. Google's Switch Transformer and Mistral's Mixtral 8x7B are landmark works applying MoE to LLMs, demonstrating that MoE can dramatically improve model capability while maintaining inference efficiency.
The developer configured a 131,072-token context window (approximately 131K), providing ample context space for complex full-stack projects and enabling the model to understand complete project requirements and generate all code files within a single conversation.
Pi Coding Agent: The Execution Layer for AI Coding
Pi Coding Agent serves as the coding agent layer, responsible for translating the LLM's output into actual file operations. It represents a paradigm shift from simple code completion to autonomous programming. Traditional AI coding assistants (like early GitHub Copilot) could only provide line-level or function-level code suggestions at the cursor position — developers still had to manually organize project structure, manage files, and execute commands. Coding Agents introduce a "plan-execute-verify" loop: they first analyze the requirements document and formulate an implementation plan; then use tool-calling capabilities to perform file system operations, run terminal commands, and read execution results; finally, they self-correct based on feedback. This architecture typically uses ReAct (Reasoning + Acting) or similar prompting frameworks, allowing the model to alternate between "thinking" and "acting."
Specifically, Pi Coding Agent can:
- Automatically create project directory structures
- Generate and write multiple code files
- Execute dependency installation commands like
npm install - Start frontend and backend services
This Agent paradigm lets developers simply provide a detailed requirements document, with the AI handling everything else automatically. Pi Coding Agent, Cursor Agent, Claude Code, and similar tools all fall into this category — they're redefining how developers interact with code.
Live Demo: Building an Apple-Style Reminder App in 5 Minutes
Requirements Design
The developer prepared a structured Markdown requirements document containing:
- Project Overview: Build a full-stack reminder web app inspired by Apple Reminders, with a dark UI, category-based lists and tag system, local-first with no cloud sync
- Tech Stack: Explicit frontend and backend technology choices
- Project Structure: Expected file directory layout
- Database Schema: Data table design
- REST API Endpoints: Complete interface definitions
- Frontend Requirements: Layout, sidebar, list selection, main content area, add/edit modals
- Acceptance Criteria: Functional completeness requirements
Execution Process
After pasting the requirements document into Pi Coding Agent, the entire build process was fully automated:
- The model first generated configuration files like
package.json - Created source code files one by one following the predefined project structure
- Generated the complete backend API and frontend components
- Finally provided access URLs for both frontend and backend
The entire process took less than 5 minutes. The resulting app closely resembled Apple Reminders with full CRUD functionality.
Performance Data Analysis
Several key performance metrics from this demo deserve attention:
| Metric | Value |
|---|---|
| Model Parameters | 35B (MoE) |
| Token Generation Speed | 86.7 tokens/s |
| Prompt Processing Speed | 1,735 tokens/s |
| Context Window | 131K tokens |
| Build Time | <5 minutes |
Understanding these numbers requires knowledge of the two fundamentally different phases of LLM inference: Prefill and Decode. The Prefill phase processes all prompt tokens from the user input — these tokens can be computed in parallel, making this phase primarily compute-bound, which is why speeds can be very fast — reaching 1,735 tokens/s in this case. The Decode phase generates output tokens one at a time, where each new token depends on the previous token's result and cannot be parallelized, making it primarily memory-bandwidth-bound — requiring repeated reads of the entire model's weights from memory.
This is the fundamental reason why 1,735 tokens/s is the prompt processing (prefill) speed, while the actual token generation speed is 86.7 tokens/s. In coding scenarios, since requirements documents are typically long while the generated code volume is even larger, performance in both phases matters: prefill speed determines how quickly the model "understands" long documents, while decode speed determines the perceived code "typing speed" for the user.
For a 35B-parameter model, 86.7 tokens/s generation speed is outstanding. For comparison, an equivalent Dense model without MTP typically achieves only 40-50 tokens/s.
Practical Advice and Limitations
Hardware Requirements
This setup runs on an Apple M5 Max chip, meaning you'll need at least a Mac with a high-end M-series chip. Specifically:
- M4 Pro (48GB) can run it, but the context window will need to be reduced
- M4 Max / M5 Max (64GB+) is the ideal choice
- More unified memory means a larger available context window
MoE vs. Dense: Trade-offs
The developer acknowledged in the video that this demo used an MoE model rather than a Dense model. While MoE models offer faster inference, they may not match the precision of same-parameter-count Dense models on certain complex reasoning tasks. However, based on actual results, MoE model performance is more than sufficient for code generation tasks.
Ideal Use Cases
This setup is particularly well-suited for:
- Rapid prototyping
- Privacy-sensitive enterprise internal projects
- Personal projects requiring frequent iteration
- Development environments with poor network conditions
Local Inference vs. Cloud Services: How to Choose
The competition between local AI inference and cloud API services is reshaping the developer tools market. Cloud services (such as OpenAI API, Anthropic Claude API) offer advantages like zero hardware investment, continuously updated models, and the ability to run the largest flagship models. But their drawbacks are equally clear: ongoing subscription/usage fees (which can reach hundreds of dollars per month for heavy users), data privacy risks (code must be uploaded to third-party servers), and dependence on network latency and availability. Local inference eliminates all these concerns — after a one-time hardware investment, inference costs are zero, and code never leaves your machine. As open-source model quality improves rapidly (Qwen, Llama, DeepSeek, and others), and consumer-grade hardware like Apple Silicon continues to gain inference capability, the quality gap between local and cloud solutions is shrinking fast. For many practical coding tasks, a 35B-class local model can already deliver code generation quality comparable to cloud-based flagship models.
Conclusion
Local AI coding is evolving from "usable" to "excellent." The oMLX + MTP + Qwen3.6 combination proves that with the right hardware, local LLMs can already handle complex coding tasks at quality and speed levels approaching cloud services. As Apple Silicon continues to iterate and open-source models evolve rapidly, the local AI coding experience will only get better. For developers who own a Mac, now is the perfect time to explore a local AI coding workflow.
Related articles
TutorialsCursor + Codex Dual-IDE Collaboration: A Practical Methodology for Open-Source Project Customization
A complete methodology for open-source project customization based on real-world experience, detailing the Cursor+Codex dual-IDE workflow, seven-stage process, MVP validation, and AI source code reading techniques.
TutorialsCursor Multi-Agent in Practice: Building a Full-Stack Next.js Blog in 50 Minutes
Build a full-stack blog in 50 minutes using Cursor IDE's multi-Agent mode with Next.js, Clerk auth, and Supabase. Learn the 4-phase AI Agent workflow and key integration pitfalls.
TutorialsBuilding an AI Software Factory from Scratch: A Cursor Engineer's Hands-On Experience with Multi-Agent Collaboration
Cursor engineer Eric shares practical insights on building an AI software factory: automation levels, guardrail design, parallel Agent management, and scaling to 1000+ Agents for 24/7 development.