#constrained decoding

3 related articles

2026年6月6日·3 min

vLLM Deep Dive: How PagedAttention Enables High-Throughput LLM Inference

Deep dive into vLLM's core technologies for high-throughput LLM inference, including PagedAttention memory management, continuous batching, distributed deployment, and comparisons with TensorRT-LLM.

Practical Guide to Building an Intelligent Coding Assistant with the OpenAI API

Tutorials

2026年6月2日·3 min

Practical Guide to Building an Intelligent Coding Assistant with the OpenAI API

A detailed guide on building an intelligent code assistant with the OpenAI API, covering Chat Completions, Responses, and Assistants APIs, GPT-4.5 vs Codex models, and tools like Function Calling and Code Interpreter.

SGLang Hosts Agent Loops Office Hour, Focusing on Agentic Loop Architecture Optimization

Tech Frontiers

2026年5月30日·1 min

SGLang Hosts Agent Loops Office Hour, Focusing on Agentic Loop Architecture Optimization

SGLang team hosts an Agent Loops Office Hour exploring inference optimization for agentic loops, covering KV Cache reuse, low-latency multi-turn dialogue, and tool calling techniques.