Tutorial: Deploying a PD-Disaggregated SGLang Multi-Node Inference Cluster on AMD GPUs

Overview

Recently, the dstack.ai team shared a detailed tutorial on deploying a PD (Prefill-Decode) disaggregated SGLang inference framework on AMD GPUs. The solution supports multi-node cluster deployment through a single configuration file, offering a new practical approach to optimizing large model inference performance.

dstack.ai is an open-source infrastructure orchestration platform designed specifically for AI/ML workloads. Its design philosophy is similar to Kubernetes for AI, but deeply optimized for GPU clusters and large model training/inference scenarios. It abstracts away underlying cloud resource differences through declarative YAML configuration files, supports unified management across AWS, GCP, Azure, and on-premises data centers, and simplifies complex distributed system deployments into configuration declarations — significantly reducing the operational burden on MLOps teams.

PD-disaggregated SGLang Deployment

What Is PD-Disaggregated Architecture?

Decoupling the Prefill and Decode Stages

During large language model inference, there are two critical stages: Prefill and Decode. The Prefill stage processes all tokens of the input prompt and is a compute-intensive task. The Decode stage generates output tokens one at a time and is a memory bandwidth-intensive task.

The computational differences between these two stages stem from fundamentally different underlying operation patterns. The Prefill stage requires parallel matrix multiplication across the entire input sequence, resulting in extremely high utilization of GPU compute units (CUDA Cores/Stream Processors) but relatively low demand on memory bandwidth. The Decode stage is the opposite: it generates only one token at a time with minimal computation, but must repeatedly read the massive KV Cache (Key-Value Cache) from GPU memory, making memory bandwidth the bottleneck. This disparity means that mixed deployment often leads to resource waste — either "compute waiting on bandwidth" or "bandwidth waiting on compute."

Traditional deployment approaches run both stages on the same set of GPUs, leading to suboptimal resource utilization. PD-disaggregated architecture assigns these two stages to different GPU nodes, allowing each set of hardware to be optimized for its specific workload characteristics, thereby significantly improving overall throughput and latency.

Cross-Node KV Cache Migration

A key technical challenge in PD-disaggregated architecture is cross-node KV Cache migration. After the Prefill node finishes processing the input, the generated KV Cache must be transferred to the Decode node over a high-speed network (typically InfiniBand or RoCE). SGLang reduces migration latency through optimized tensor transfer protocols and zero-copy techniques. The efficiency of this process directly impacts Time To First Token (TTFT), making inter-node network bandwidth a critical infrastructure metric for PD-disaggregated architecture — typically requiring 100Gbps+ network interconnects.

SGLang's PD-Disaggregated Implementation

SGLang is a high-performance framework for large model inference and serving. Its PD disaggregation feature allows users to deploy Prefill and Decode nodes independently. This architecture is particularly well-suited for large-scale production environments, enabling dynamic adjustment of the Prefill-to-Decode node ratio based on actual workload for more flexible resource scheduling.

AMD GPU Multi-Node Deployment: A Detailed Look

Cluster Management via a Single Configuration File

The standout feature of the dstack.ai solution is that a single configuration file is all it takes to deploy a multi-node cluster. This dramatically reduces operational complexity — users don't need to write separate deployment scripts for each node. A unified configuration declaration defines:

Number of Prefill nodes and GPU allocation
Number of Decode nodes and GPU allocation
Inter-node communication settings
Model loading and serving parameters

This "Infrastructure as Code" approach — leveraging service discovery, health checks, and automated network configuration — compresses what would otherwise require dozens of coordinated scripts into a single version-controlled, reproducible configuration file, greatly improving deployment maintainability and portability.

AMD GPU Ecosystem Advances in AI Inference

The successful deployment of this solution on AMD GPUs further demonstrates AMD's ecosystem maturity in the AI inference space. ROCm (Radeon Open Compute) is AMD's open-source software platform for high-performance computing and AI workloads — essentially AMD's counterpart to NVIDIA's CUDA. ROCm includes the HIP (Heterogeneous-compute Interface for Portability) programming interface, which allows CUDA code to be ported to AMD GPUs at relatively low cost. In recent years, ROCm has been steadily closing the gap in stability and performance. Next-generation AMD GPUs like the MI300X, with their massive HBM memory capacity (up to 192GB), offer unique advantages in large model inference scenarios — particularly for use cases that require running very large models on a single card or a small number of cards. As the ROCm software stack continues to mature, an increasing number of mainstream inference frameworks (including SGLang, vLLM, and others) now natively support AMD GPUs, providing users with a cost-effective alternative to NVIDIA.

Practical Value of PD-Disaggregated Architecture

Performance and Cost Advantages

The core benefits of PD-disaggregated architecture combined with multi-node deployment include:

Higher inference throughput: Prefill and Decode nodes each handle their own tasks, eliminating resource contention
Lower inference latency: Targeted optimization of execution efficiency for each stage
Elastic scalability: Prefill and Decode node counts can be scaled independently
GPU cost optimization: Different GPU instance types can be selected for different stages

Among these, Time To First Token (TTFT) and overall throughput are the two core metrics for LLM inference services, and they often exist in natural tension. In traditional mixed deployments, the heavy computation of the Prefill stage blocks token generation in the Decode stage, causing dramatic TTFT fluctuations. PD-disaggregated architecture eliminates this interference through physical isolation: Prefill nodes can continuously process new request inputs while Decode nodes focus exclusively on token generation for existing requests. This significantly improves P99 TTFT latency (99th percentile latency), which is critical for user-facing real-time conversational applications.

Applicable Scenarios

This deployment approach is particularly well-suited for:

High-concurrency online LLM inference services
Real-time applications with strict TTFT requirements
RAG or document analysis scenarios involving long-context inputs
Enterprise-grade multi-tenant shared inference clusters

Conclusion

The solution provided by dstack.ai demonstrates the best-practice direction for modern LLM inference deployment: combining architecture-level optimization (PD disaggregation) with infrastructure automation (single-file deployment) to achieve efficient multi-node inference clusters on AMD GPU hardware. The value of this solution lies not only in the technical implementation itself, but in how it organically integrates three dimensions — the computational characteristic differences between Prefill/Decode, KV Cache migration optimization, and ROCm ecosystem maturity — into a complete, production-ready engineering solution. For teams evaluating inference infrastructure options and looking to run large model inference services on AMD GPUs, this solution offers significant reference value.