#inference speed

55 related articles

2026年6月3日·2 min

RAG System End-to-End Breakdown: From Vector Indexing to Production Optimization

A deep dive into the complete RAG pipeline — covering vector embeddings, document chunking, retrieval and reranking, plus three production optimization techniques for building accurate enterprise AI knowledge base applications.

WhichLLM: One Command to Find the Best Local LLM for Your Hardware

Product Reviews

2026年6月3日·3 min

WhichLLM: One Command to Find the Best Local LLM for Your Hardware

WhichLLM is an open-source tool that auto-detects your hardware and recommends the best local LLM using real benchmark data. Simulate GPUs, filter fake benchmarks, and start chatting in one command.

Coze Workflow Tutorial: One-Click Short Video Generation Complete Guide

Tutorials

2026年6月2日·4 min

Coze Workflow Tutorial: One-Click Short Video Generation Complete Guide

Step-by-step guide to building an automated short video generation workflow on Coze, covering script writing, voiceover, AI images, video synthesis, and CapCut packaging.

llama.cpp MTP Acceleration Deployment Guide: Configuration Steps & Real-World Benchmarks

Tutorials

2026年6月2日·3 min

llama.cpp MTP Acceleration Deployment Guide: Configuration Steps & Real-World Benchmarks

Guide to enabling MTP multi-Token prediction acceleration in llama.cpp, covering CUDA setup, desktop configuration, model selection, and benchmarks showing ~60 Token/s with Qwen3 27B.

Practical Guide to Building a Local AI Knowledge Base with Qwen3.5 + RAGFlow + Ollama

Tutorials

2026年6月2日·4 min

Practical Guide to Building a Local AI Knowledge Base with Qwen3.5 + RAGFlow + Ollama

Step-by-step guide to building a local RAG knowledge base using RAGFlow, Ollama, and LM Studio with Docker, covering Embedding model deployment and network troubleshooting for private AI Q&A.

Tutorial: Building a Low-Cost AI Code Editor with DeepSeek-V3 + VSCode

Tutorials

2026年6月2日·2 min

Tutorial: Building a Low-Cost AI Code Editor with DeepSeek-V3 + VSCode

Step-by-step tutorial: Build a low-cost AI programming assistant using DeepSeek-V3 API with VSCode's Continue plugin. Covers setup, API Key configuration, code completion demo, and Ollama local deployment.

Cursor 2.0 Deep Dive: Hands-On Testing of Five Major Features Including Custom Models and Multi-Agent Parallel Development

Product Reviews

2026年6月2日·3 min

Cursor 2.0 Deep Dive: Hands-On Testing of Five Major Features Including Custom Models and Multi-Agent Parallel Development

Deep dive into Cursor 2.0's five major updates: custom Composer model, Git Worktrees multi-agent parallel development, Agent View mode, built-in browser, and more—with hands-on evaluation.

Core Principles of the Transformer Architecture: A Deep Dive into Self-Attention Mechanisms and Engineering Optimizations

Deep Dives

2026年6月2日·4 min

Core Principles of the Transformer Architecture: A Deep Dive into Self-Attention Mechanisms and Engineering Optimizations

Deep dive into Transformer architecture covering self-attention QKV mechanics, Encoder-Decoder structure, Flash Attention memory optimization, RoPE positional encoding, and GQA inference acceleration.

Hertzman: A Free, No-Install Local LLM Deployment Tool Review

Product Reviews

2026年6月2日·3 min

Hertzman: A Free, No-Install Local LLM Deployment Tool Review

Detailed review of Hertzman local inference engine covering one-click deployment, smart hardware recommendations, OpenAI-compatible API, and performance comparison with LM Studio.

Complete Guide to Configuring Local DeepSeek Model in PyCharm for AI-Assisted Programming

Tutorials

2026年6月2日·2 min

Complete Guide to Configuring Local DeepSeek Model in PyCharm for AI-Assisted Programming

Learn how to configure a local DeepSeek model in PyCharm via Ollama for free, privacy-safe AI-assisted programming. Includes installation steps, plugin setup, usage tips, and hardware recommendations.

Claude Code 2.1 Deep Dive: Hooks Automation, MCP Ecosystem & Multi-Agent Collaboration Fully Explained

Product Reviews

2026年6月2日·3 min

Claude Code 2.1 Deep Dive: Hooks Automation, MCP Ecosystem & Multi-Agent Collaboration Fully Explained

Full breakdown of Claude Code 2.1: Opus 4.6 model upgrade, Hooks deterministic automation, Skills multi-agent collaboration, MCP tool chain integration, plus IDE shortcuts and practical commands.

oMLX + MTP + Qwen3.6: Local AI Coding Speed Breaks New Records

Tutorials

2026年6月1日·3 min

oMLX + MTP + Qwen3.6: Local AI Coding Speed Breaks New Records

Using oMLX with MTP and Qwen3.6 35B on Apple Silicon Mac to achieve 86.7 tokens/s local coding speed, building a full-stack app in under 5 minutes.

Industry Insights

Qoder's Context Engineering in Practic…

2026年6月1日·4 min

Qoder's Context Engineering in Practice: Four-Layer Retrieval Engine and Memory System Architecture

Deep analysis of Qoder's (Tongyi Lingma international edition) context engineering architecture, including its four-layer retrieval engine, memory engine, context caching, and core product design.

Product Reviews

Claude Code vs Codex Deep Dive: A Prac…

2026年5月31日·3 min

Claude Code vs Codex Deep Dive: A Practical Guide to Choosing the Right AI Coding Tool

A comprehensive comparison of Claude Code and OpenAI Codex covering architecture, use cases, and benchmarks to help you choose the right AI coding tool.

Windsurf Integrates Claude Opus 4.7 Fast Mode with 2.5x Speed Boost

Tech Frontiers

2026年5月30日·1 min

Windsurf Integrates Claude Opus 4.7 Fast Mode with 2.5x Speed Boost

Windsurf integrates Claude Opus 4.7 fast mode with 2.5x speed boost while retaining full intelligence. Analysis of its impact on developer productivity and AI coding tool competition.

Step 3.7 Flash: Deep Dive into the 198B Sparse MoE Multimodal Model

Tech Frontiers

2026年5月30日·2 min

Step 3.7 Flash: Deep Dive into the 198B Sparse MoE Multimodal Model

Deep dive into StepFun AI's Step 3.7 Flash, a 198B sparse MoE vision-language model with 256K context and 3-level reasoning, excelling in multimodal understanding, AI coding, and Agent tool orchestration.

SGLang Hosts Agent Loops Office Hour, Focusing on Agentic Loop Architecture Optimization

Tech Frontiers

2026年5月30日·1 min

SGLang Hosts Agent Loops Office Hour, Focusing on Agentic Loop Architecture Optimization

SGLang team hosts an Agent Loops Office Hour exploring inference optimization for agentic loops, covering KV Cache reuse, low-latency multi-turn dialogue, and tool calling techniques.

Product Reviews

Claude Code with MiniMax M2: Testing a…

2026年5月29日·3 min

Claude Code with MiniMax M2: Testing a Low-Cost AI Coding Solution Across Three Real Projects

Real-world testing of MiniMax M2 as Claude Code's backend model across three projects: framework migration, iOS development, and full-stack MVP — at just 8% of Claude's price.

Tutorials

DeepSeek V4 Flash MTP Speculative Deco…

2026年5月29日·3 min

DeepSeek V4 Flash MTP Speculative Decoding Real-World Test: A Guide to 20% Faster Local Inference

Real-world testing of DeepSeek V4 Flash with MTP speculative decoding: ~20% speedup for code generation, minimal gains for text. Covers memory overhead, accuracy differences, Q4 vs Q3 quantization, and full deployment tutorial.

Tutorials

Building a SaaS Website with AI and Ze…

2026年5月29日·3 min

Building a SaaS Website with AI and Zero Code: A Complete Bolt + Cursor Walkthrough

Learn how to build a SaaS website with AI image generation, multimodal chat, and webpage replication using only Bolt and Cursor — no code required. Covers prompt design, architecture, and iteration techniques.