Tutorial: Deploying a PD-Disaggregated SGLang Multi-Node Inference Cluster on AMD GPUs

dstack.ai releases a PD-disaggregated SGLang multi-node inference deployment solution for AMD GPUs.
The dstack.ai team shared a tutorial on deploying a PD-disaggregated SGLang inference framework on AMD GPUs. The solution separates the compute-intensive Prefill stage from the memory bandwidth-intensive Decode stage onto different GPU nodes, enabling multi-node cluster deployment via a single configuration file. It significantly improves inference throughput and latency while validating the maturity of AMD's ROCm ecosystem for AI inference.
Overview
Recently, the dstack.ai team shared a detailed tutorial on deploying a PD (Prefill-Decode) disaggregated SGLang inference framework on AMD GPUs. The solution supports multi-node cluster deployment through a single configuration file, offering a new practical approach to optimizing large model inference performance.
dstack.ai is an open-source infrastructure orchestration platform designed specifically for AI/ML workloads. Its design philosophy is similar to Kubernetes for AI, but deeply optimized for GPU clusters and large model training/inference scenarios. It abstracts away underlying cloud resource differences through declarative YAML configuration files, supports unified management across AWS, GCP, Azure, and on-premises data centers, and simplifies complex distributed system deployments into configuration declarations — significantly reducing the operational burden on MLOps teams.

What Is PD-Disaggregated Architecture?
Decoupling the Prefill and Decode Stages
During large language model inference, there are two critical stages: Prefill and Decode. The Prefill stage processes all tokens of the input prompt and is a compute-intensive task. The Decode stage generates output tokens one at a time and is a memory bandwidth-intensive task.
The computational differences between these two stages stem from fundamentally different underlying operation patterns. The Prefill stage requires parallel matrix multiplication across the entire input sequence, resulting in extremely high utilization of GPU compute units (CUDA Cores/Stream Processors) but relatively low demand on memory bandwidth. The Decode stage is the opposite: it generates only one token at a time with minimal computation, but must repeatedly read the massive KV Cache (Key-Value Cache) from GPU memory, making memory bandwidth the bottleneck. This disparity means that mixed deployment often leads to resource waste — either "compute waiting on bandwidth" or "bandwidth waiting on compute."
Traditional deployment approaches run both stages on the same set of GPUs, leading to suboptimal resource utilization. PD-disaggregated architecture assigns these two stages to different GPU nodes, allowing each set of hardware to be optimized for its specific workload characteristics, thereby significantly improving overall throughput and latency.
Cross-Node KV Cache Migration
A key technical challenge in PD-disaggregated architecture is cross-node KV Cache migration. After the Prefill node finishes processing the input, the generated KV Cache must be transferred to the Decode node over a high-speed network (typically InfiniBand or RoCE). SGLang reduces migration latency through optimized tensor transfer protocols and zero-copy techniques. The efficiency of this process directly impacts Time To First Token (TTFT), making inter-node network bandwidth a critical infrastructure metric for PD-disaggregated architecture — typically requiring 100Gbps+ network interconnects.
SGLang's PD-Disaggregated Implementation
SGLang is a high-performance framework for large model inference and serving. Its PD disaggregation feature allows users to deploy Prefill and Decode nodes independently. This architecture is particularly well-suited for large-scale production environments, enabling dynamic adjustment of the Prefill-to-Decode node ratio based on actual workload for more flexible resource scheduling.
AMD GPU Multi-Node Deployment: A Detailed Look
Cluster Management via a Single Configuration File
The standout feature of the dstack.ai solution is that a single configuration file is all it takes to deploy a multi-node cluster. This dramatically reduces operational complexity — users don't need to write separate deployment scripts for each node. A unified configuration declaration defines:
- Number of Prefill nodes and GPU allocation
- Number of Decode nodes and GPU allocation
- Inter-node communication settings
- Model loading and serving parameters
This "Infrastructure as Code" approach — leveraging service discovery, health checks, and automated network configuration — compresses what would otherwise require dozens of coordinated scripts into a single version-controlled, reproducible configuration file, greatly improving deployment maintainability and portability.
AMD GPU Ecosystem Advances in AI Inference
The successful deployment of this solution on AMD GPUs further demonstrates AMD's ecosystem maturity in the AI inference space. ROCm (Radeon Open Compute) is AMD's open-source software platform for high-performance computing and AI workloads — essentially AMD's counterpart to NVIDIA's CUDA. ROCm includes the HIP (Heterogeneous-compute Interface for Portability) programming interface, which allows CUDA code to be ported to AMD GPUs at relatively low cost. In recent years, ROCm has been steadily closing the gap in stability and performance. Next-generation AMD GPUs like the MI300X, with their massive HBM memory capacity (up to 192GB), offer unique advantages in large model inference scenarios — particularly for use cases that require running very large models on a single card or a small number of cards. As the ROCm software stack continues to mature, an increasing number of mainstream inference frameworks (including SGLang, vLLM, and others) now natively support AMD GPUs, providing users with a cost-effective alternative to NVIDIA.
Practical Value of PD-Disaggregated Architecture
Performance and Cost Advantages
The core benefits of PD-disaggregated architecture combined with multi-node deployment include:
- Higher inference throughput: Prefill and Decode nodes each handle their own tasks, eliminating resource contention
- Lower inference latency: Targeted optimization of execution efficiency for each stage
- Elastic scalability: Prefill and Decode node counts can be scaled independently
- GPU cost optimization: Different GPU instance types can be selected for different stages
Among these, Time To First Token (TTFT) and overall throughput are the two core metrics for LLM inference services, and they often exist in natural tension. In traditional mixed deployments, the heavy computation of the Prefill stage blocks token generation in the Decode stage, causing dramatic TTFT fluctuations. PD-disaggregated architecture eliminates this interference through physical isolation: Prefill nodes can continuously process new request inputs while Decode nodes focus exclusively on token generation for existing requests. This significantly improves P99 TTFT latency (99th percentile latency), which is critical for user-facing real-time conversational applications.
Applicable Scenarios
This deployment approach is particularly well-suited for:
- High-concurrency online LLM inference services
- Real-time applications with strict TTFT requirements
- RAG or document analysis scenarios involving long-context inputs
- Enterprise-grade multi-tenant shared inference clusters
Conclusion
The solution provided by dstack.ai demonstrates the best-practice direction for modern LLM inference deployment: combining architecture-level optimization (PD disaggregation) with infrastructure automation (single-file deployment) to achieve efficient multi-node inference clusters on AMD GPU hardware. The value of this solution lies not only in the technical implementation itself, but in how it organically integrates three dimensions — the computational characteristic differences between Prefill/Decode, KV Cache migration optimization, and ROCm ecosystem maturity — into a complete, production-ready engineering solution. For teams evaluating inference infrastructure options and looking to run large model inference services on AMD GPUs, this solution offers significant reference value.
Related articles
TutorialsCursor + Codex Dual-IDE Collaboration: A Practical Methodology for Open-Source Project Customization
A complete methodology for open-source project customization based on real-world experience, detailing the Cursor+Codex dual-IDE workflow, seven-stage process, MVP validation, and AI source code reading techniques.
TutorialsCursor Multi-Agent in Practice: Building a Full-Stack Next.js Blog in 50 Minutes
Build a full-stack blog in 50 minutes using Cursor IDE's multi-Agent mode with Next.js, Clerk auth, and Supabase. Learn the 4-phase AI Agent workflow and key integration pitfalls.
TutorialsBuilding an AI Software Factory from Scratch: A Cursor Engineer's Hands-On Experience with Multi-Agent Collaboration
Cursor engineer Eric shares practical insights on building an AI software factory: automation levels, guardrail design, parallel Agent management, and scaling to 1000+ Agents for 24/7 development.