NVIDIA Dynamo 架构与设计思路分析

2026-05-15 · 来源：dynamo-architecture-analysis.md

architectureai-infrallm-inferencedistributed-servingkv-cachekubernetes

原文：raw/dynamo-architecture-analysis.md · 仓库：https://github.com/ai-dynamo/dynamo · 分析版本 1.2.0（commit 7997117，2026-05-15）

一句话定位

dynamo 是 NVIDIA 开源的数据中心级 LLM 推理编排层：用 Rust 运行时 + Python 组件 + Go K8s Operator，把 SGLang / vLLM / TensorRT-LLM 等推理引擎拼成具备分离式 prefill/decode、KV 感知路由、四级 KV 缓存（KVBM）和 SLA 自动扩缩容的协调集群。它不替代推理引擎，而是让一群 GPU/Node 变成"一个协调的推理系统"。

核心架构图

flowchart TB Client["Client (OpenAI API)"] subgraph FE["FRONTEND — Rust axum + Python wrapper"] F1["/v1/chat/completions
validate → preprocess → migration → route
lib/llm/src/http · components/"] end subgraph PLANES["三平面 (Request · Control · Storage/Events)"] direction LR Router["KV-Aware Router (lib/kv-router)
· Radix tree of block hashes / wkr
· cost = prefill_load + decode_cost
− overlap credits"] Runtime["DistributedRuntime (lib/runtime)
· Discovery trait
· Component registry
· HealthCheckManager
· Pipeline framework"] Planner["Planner — Python
components/.../planner
· Prometheus scrape
· Throughput + Load scaling laws
· Emits ScalingDecision → K8s Op"] end subgraph WORK["WORKER POOL — Python wrappers via PyO3"] direction LR PW["Prefill Worker
SGLang / vLLM / TRT-LLM
+ KVBM hooks"] DW["Decode Worker
SGLang / vLLM / TRT-LLM
+ KVBM hooks"] PW -- "KV transfer (NIXL / GDS)" --> DW PW <-- "KV-events (NATS JetStream)" --> DW end subgraph STORE["STORAGE / EVENTS PLANE"] direction TB KVBM["KVBM Tier Hierarchy (lib/kvbm-*)
G1 GPU device memory ── LRU
G2 CPU pinned ── LFU
G3 NVMe / SSD (NIXL)
G4 S3 / Azure / Object (NIXL)
Consolidator dedupe by SequenceHash (128-bit PLH)"] Bus["NATS JetStream + Object Store
· kv-events subject
· Radix tree snapshots"] end subgraph K8S["K8s Operator — Go (deploy/operator/)"] Op["DGDR (request) → DGD (graph) → DCD (per-component pods)
AIConfigurator profiles → Planner picks → Grove places (NVL72 aware)"] end Client -- "HTTP / SSE" --> F1 F1 --> Router Router -- "TCP / NATS Core" --> WORK Runtime -- "register / watch" --> WORK Planner -- "patch DGD" --> Op WORK <--> KVBM KVBM <--> Bus Op -- "topology-aware gang sched" --> WORK classDef fe fill:#1e3a8a,stroke:#1e40af,color:#fff classDef pl fill:#7c2d12,stroke:#9a3412,color:#fff classDef wk fill:#14532d,stroke:#166534,color:#fff classDef st fill:#5b21b6,stroke:#6d28d9,color:#fff classDef k8 fill:#831843,stroke:#9d174d,color:#fff class FE,F1 fe class PLANES,Router,Runtime,Planner pl class WORK,PW,DW wk class STORE,KVBM,Bus st class K8S,Op k8

原 ASCII 图

                    ┌─────────────────────────────────────────────┐
                    │              Client (OpenAI API)            │
                    └────────────────────┬────────────────────────┘
                                         │ HTTP / SSE
┌────────────────────────────────────────▼─────────────────────────────────────┐
│ FRONTEND (Rust axum + Python wrapper)     lib/llm/src/http  +  components/  │
│   /v1/chat/completions  ─► validate ─► preprocess ─► migration ─► route     │
└────────────────────────────────────────┬─────────────────────────────────────┘
                                         │
            ┌────────────────────────────┼────────────────────────────┐
            │     REQUEST PLANE          │      CONTROL PLANE         │
            │   (TCP / NATS Core)        │   (etcd / K8s / file)      │
            │                            │                            │
            ▼                            ▼                            ▼
┌──────────────────────┐  ┌────────────────────────┐  ┌────────────────────────┐
│ KV-Aware Router      │  │ DistributedRuntime     │  │ Planner (Python)       │
│ lib/kv-router        │  │ lib/runtime            │  │ components/.../planner │
│ - Radix tree of      │  │ - Discovery trait      │  │ - Prometheus scrape    │
│   block hashes/wkr   │  │ - Component registry   │  │ - Throughput + Load    │
│ - Cost function:     │  │ - HealthCheckManager   │  │   scaling laws         │
│   prefill_load +     │  │ - Pipeline framework   │  │ - Emits ScalingDecision│
│   decode_cost -      │  │                        │  │   to K8s Operator      │
│   overlap credits    │  └─────────┬──────────────┘  └──────────┬─────────────┘
└──────────┬───────────┘            │                            │
           │                        │ register / watch           │ patch DGD
           ▼                        ▼                            ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ WORKER POOL (Python wrappers around backends via PyO3)                      │
│                                                                             │
│   Prefill Worker        ────KV transfer (NIXL/GDS)────►   Decode Worker    │
│   ┌──────────────┐                                        ┌──────────────┐ │
│   │ SGLang/vLLM/ │      ◄────KV-events (NATS JetStream)─► │ SGLang/vLLM/ │ │
│   │ TRT-LLM      │                                        │ TRT-LLM      │ │
│   │ + KVBM hooks │                                        │ + KVBM hooks │ │
│   └──────┬───────┘                                        └──────┬───────┘ │
└──────────┼───────────────────────────────────────────────────────┼─────────┘
           │                                                       │
┌──────────▼───────────────────────────────────────────────────────▼─────────┐
│ STORAGE / EVENTS PLANE                                                      │
│                                                                             │
│   KVBM Tier Hierarchy (lib/kvbm-*)            NATS JetStream + Object Store │
│   G1: GPU device memory   ──LRU──┐            - kv-events subject           │
│   G2: CPU pinned          ──LFU──┤            - Radix tree snapshots        │
│   G3: NVMe/SSD (NIXL)            │                                          │
│   G4: S3/Azure/Object (NIXL)     │                                          │
│                                  │                                          │
│   Consolidator dedupes by         │                                          │
│   SequenceHash (128-bit PLH)      │                                          │
└─────────────────────────────────────────────────────────────────────────────┘
                                         ▲
                                         │ topology-aware gang sched
┌────────────────────────────────────────┴─────────────────────────────────────┐
│ K8s Operator (Go, deploy/operator/)                                          │
│   DGDR (request)  ─►  DGD (graph)  ─►  DCD (per-component pods)             │
│   AIConfigurator profiles ─► Planner picks ─► Grove places (NVL72 aware)    │
└──────────────────────────────────────────────────────────────────────────────┘

模块分层

层 / 模块	职责
HTTP 前端（Rust axum + Python wrapper）	OpenAI 兼容入口、SSE 流式、聚合、validation
预处理流水线	Tokenize、prompt template、多模态 decode、采样归一化
迁移层（`migration.rs`）	失败 worker 在飞请求自动迁移到新 worker（RetryManager）
KV-aware Router	XXH3 hash → radix tree → cost-based softmax 选 worker；多副本经 NATS JetStream 同步
分布式运行时（`lib/runtime`）	Runtime/Endpoint/Discovery/Transport 抽象；TCP+NATS 双平面
KVBM（KV Block Manager）	G1-G4 四级 KV 缓存、NIXL 零拷贝、TinyLFU 升降级、SequenceHash 去重
Backend wrapper（Python）	sglang/vllm/TRT-LLM 适配；通过 PyO3 接入 Rust 运行时
Planner（SLA 自动扩缩）	Prometheus scrape + state machine + 推 K8s operator
K8s 控制面（Go + Grove + Gateway plugin）	CRD: DGDR→DGD→DCD；拓扑感知 gang scheduling

关键约束：

性能敏感路径（HTTP、tokenize、路由、KVBM）全部在 Rust；backend 适配薄到只剩 "engine.generate + publish KV events"。
K8s operator 不做 placement，把拓扑感知外包给 Grove。
控制平面（discovery）和事件平面（KV events）默认分离：file/mem 用 ZMQ，etcd/K8s 用 NATS。

关键数据流

端到端请求路径（HTTP arrival → token streaming）：

flowchart TD S1["[1] HTTP POST /v1/chat/completions"] S2["[2] axum Router → handler_chat_completions
lib/llm/src/http/service/openai.rs:2012"] S3["[3] Validate (openai.rs:1233-1254)
+ model / temperature / token defaults"] S4["[4] Preprocessor pipeline (lib/llm/src/preprocessor/)
TokenizeOperator → PromptFormattingOperator → SamplingOperator
NvCreateChatCompletionRequest → PreprocessedRequest"] S5["[5] Migration layer (lib/llm/src/migration.rs:115)
RetryManager wraps; on CannotConnect/Disconnected/
EngineShutdown (L189) replays Context::with_id(...)"] S6["[6] KV-aware route decision
· XXH3 PLH blocks (block_size=128, LoRA-aware)
· ConcurrentRadixTree per worker → prefix overlap
· cost = prefill_load_scale × adj_prefill + decode
· softmax(−cost) sample"] S7["[7] Dispatch via Request Plane
TCP (pooled) or NATS Core → worker.generate"] S8["[8] Worker (Python) calls backend engine
SGLang sgl.Engine.async_generate · vLLM AsyncLLMEngine.generate · TRT-LLM executor"] S9["[9] PrefillRouter picks prefill worker → runs prefill
disaggregated_params returned"] S10["[10] PrefillRouter picks decode worker
KV blocks via NIXL / GPUDirect-RDMA"] S11["[11] Decode generates tokens
· KVBM.OffloadManager computes SequenceHash per block
· publish to NATS 'kv-events' subject"] S12["[12] Tokens stream back through SSE
ChatCompletionAggregator collapses if stream=false"] S13["[13] Response to client (200, JSON or SSE)"] S1 --> S2 --> S3 --> S4 --> S5 --> S6 --> S7 --> S8 S8 -. "if disaggregated" .-> S9 S9 --> S10 --> S11 --> S12 --> S13 S8 -. "if monolithic" .-> S11

原 ASCII 图

[1] HTTP POST /v1/chat/completions
        │
        ▼
[2] axum Router (lib/llm/src/http/service/openai.rs:2012)
    └─► handler_chat_completions
        │
        ▼
[3] Validate (openai.rs:1233-1254)
    + Apply model/temperature/token defaults
        │
        ▼
[4] Preprocessor pipeline (lib/llm/src/preprocessor/)
    TokenizeOperator → PromptFormattingOperator → SamplingOperator
    NvCreateChatCompletionRequest ──► PreprocessedRequest
        │
        ▼
[5] Migration layer (lib/llm/src/migration.rs:115)
    RetryManager wraps the call; on CannotConnect/Disconnected/
    EngineShutdown (line 189) replays with Context::with_id(...)
        │
        ▼
[6] KV-aware route decision
    ├─ Hash prompt → PLH blocks (XXH3, block_size=128, LoRA-aware)
    ├─ Query ConcurrentRadixTree per worker for prefix overlap
    └─ Selector logit (lib/kv-router/src/scheduling/selector.rs:161):
       cost = prefill_load_scale × adjusted_prefill_blocks + decode_cost_blocks
       softmax(−cost) sample
        │
        ▼
[7] Dispatch via Request Plane
    └─ TCP (pooled) or NATS Core → worker generate endpoint
        │
        ▼
[8] Worker (Python) calls backend engine
    ├─ SGLang: sgl.Engine.async_generate(...)
    ├─ vLLM:   AsyncLLMEngine.generate(...)
    └─ TRT-LLM: trtllm executor
        │
        ▼ [if disaggregated]
[9] PrefillRouter picks prefill worker → runs prefill
    └─ disaggregated_params returned
        │
        ▼
[10] PrefillRouter picks decode worker
     └─ KV blocks transferred via NIXL/GPUDirect-RDMA
        │
        ▼
[11] Decode generates tokens
     ├─ Each new block: KVBM.OffloadManager computes SequenceHash
     └─ publish to NATS "kv-events" subject
        │
        ▼
[12] Tokens stream back through SSE
     └─ ChatCompletionAggregator collapses if stream=false
        │
        ▼
[13] Response to client (status 200, JSON or SSE)

容错路径： worker 在 [8]–[11] 任意阶段挂掉 → RetryManager 检测 is_migratable() 错误 → 用同一个 PreprocessedRequest 重发到新 worker。guided decoding 和 n>1 sampling 因状态机不可复制而禁用迁移。

Planner 自动扩缩容循环：

flowchart TD Tick["Tick scheduler
load ~10s · throughput ~60s · "agg""] Gather["_gather_tick_input()
· Prometheus: TTFT, ITL, ISL, OSL, QPS
· FPM subscriber (ForwardPassMetrics)
· per-worker queue depth"] Tp["Throughput branch
predict next-win traffic × safety
→ replicas LB"] Ld["Load branch
estimate latency from queue + FPM
vs SLA threshold"] Cor["Correction factors
prefill_correction = actual_ttft / expected_ttft
decode_correction = actual_itl / expected_itl"] Decide["ScalingDecision(num_prefill, num_decode) | None
load > throughput → use Load branch"] Apply["_apply_effects() → K8s operator
→ patches DGD replicas"] Diag["Prometheus counters + JSON diagnostics"] Tick --> Gather --> Tp & Ld & Cor Tp --> Decide Ld --> Decide Cor --> Decide Decide --> Apply --> Diag

原 ASCII 图

            ┌──────────────────────────────────────────────────────┐
            │  Tick scheduler (load ~10s, throughput ~60s, "agg")  │
            └────────────────────────┬─────────────────────────────┘
                                     │
                ┌────────────────────▼────────────────────┐
                │  _gather_tick_input()                   │
                │  - Prometheus: TTFT, ITL, ISL, OSL, QPS │
                │  - FPM subscriber (ForwardPassMetrics)  │
                │  - per-worker queue depth               │
                └────────────────────┬────────────────────┘
                                     │
        ┌────────────────────────────┼────────────────────────────┐
        │                            │                            │
        ▼                            ▼                            ▼
┌──────────────────┐   ┌──────────────────────┐   ┌──────────────────────┐
│ Throughput branch│   │  Load branch         │   │ Correction factors:  │
│ predict next-win │   │  estimate latency    │   │ prefill_correction = │
│ traffic × safety │   │  from queue + FPM    │   │   actual_ttft /      │
│ → replicas LB    │   │  vs SLA threshold    │   │   expected_ttft      │
└────────┬─────────┘   └──────────┬───────────┘   │ decode_correction =  │
         │                        │               │   actual_itl /       │
         └──── load > throughput ─┤               │   expected_itl       │
                                  │               └──────────────────────┘
                                  ▼
                   ScalingDecision(num_prefill, num_decode) | None
                                  │
                                  ▼
                    _apply_effects() ──► K8s operator
                                       ──► patches DGD replicas
                                  │
                                  ▼
                   Prometheus counters + JSON diagnostics

KV-aware Router cost function：

adjusted_prefill_blocks = max(
    prefill_blocks
    - overlap_score_credit * device_overlap_blocks
    - host_cache_hit_weight * host_overlap_blocks
    - disk_cache_hit_weight * disk_overlap_blocks
    - shared_cache_multiplier * shared_beyond_blocks,
    0,
)
cost = prefill_load_scale * adjusted_prefill_blocks + decode_blocks

设计决策与哲学

三平面架构解耦：请求平面（低延迟，TCP/NATS Core）、控制平面（desired-state，etcd/K8s/file）、存储+事件平面（KV 可见性，NATS JetStream + Object Store）三者独立演进。事件平面持久化保证 router 副本重启后能 replay。
Rust 内核 + Python 适配 + Go 控制器三语言协作：性能敏感路径全在 Rust（1000 个 .rs 文件）；backend 适配做到薄薄一层 Python（896 个 .py）；K8s CRD 控制循环用 Go（258 个 .go）。
请求迁移是默认能力：RetryManager 把 worker 死亡变成对客户端透明的事件。这是 Dynamo 区别于纯推理引擎（vllm、sglang）的根本特征——它把"集群"当作一等公民。
KV 块的全局身份 = SequenceHash：128-bit PositionalLineageHash（XXH3, seed=1337，混入 LoRA id）让一个 KV 块在 GPU/CPU/SSD/远端 + 多 worker 之间有同一个名字。Consolidator 去重、router 前缀匹配、KVBM 升降级都用它。
KVBM 四级层次（G1-G4）与 NIXL 统一传输：G1=GPU device、G2=CPU pinned、G3=NVMe/SSD、G4=S3/Azure。LRU 管 G1→G2，TinyLFU + presence filter 管 G2→G3。所有层都包装成 NIXL MemType，使得不同 tier 间的代码路径几乎同形。参见 kv-cache-offload。
KV-aware 路由 ≠ 最大化命中率：成本函数同时惩罚 prefix overlap 不足和当前 worker 负载。多 router 副本通过 NATS JetStream 同步活跃块视图。softmax(−cost) 采样而非 argmin，留出概率分散负载。
AIConfigurator → Planner → Operator 三段式 SLA 闭环：AIConfigurator 离线扫 10K+ TP/EP/DEP 配置选 Pareto 前沿 → Planner 在线决策扩缩 → Operator 物化 K8s 资源。这是 1.0 "zero-config DGDR" 的实现基础。
拓扑感知外包给 Grove：Operator 不做 NVL72/rack/host placement，把 component group 翻译成 Grove PodCliqueSet + PodCliqueScalingGroup 交给外部 scheduler，只读 Grove condition 反传 DGD status。
Discovery backend 是 trait，不是硬编码：KVStoreDiscovery（etcd/file/memory）/KubeDiscoveryClient（EndpointSlice）/MockDiscovery 三种实现可热切。模型注册有身份冲突检测——不同模型注册同一 endpoint 会被拒绝（LoRA adapter 除外）。
canary 健康检查 + 自愈再注册：每个 endpoint 独立 health check task，超时则标 NotReady 并触发 re-register，K8s 拉起新 pod 后立即恢复路由。

KVBM 四级层次（核心组件深入）

flowchart TD G1["G1 — GPU Memory (lib/kvbm-engine)
Fastest, smallest capacity
Active compute blocks"] G2["G2 — CPU / Host Memory (lib/kvbm-logical pools)
Pinned DRAM staging
µs-latency RDMA ready (lib/kvbm-physical)"] G3["G3 — NVMe / SSD (lib/kvbm-physical/storage)
Persistent warm cache
ms-latency disk ops"] G4["G4 — Object Storage (lib/llm/block_manager/storage/object.rs)
S3 / MinIO / Azure Blob
Unlimited capacity, seconds+ latency"] G1 -- "Offload G1→G2 (LRU pop_lru)" --> G2 G2 -- "Offload G2→G3 (TinyLFU + presence filter)" --> G3 G3 -- "Offload G3→G4 (NIXL OBJ backend)" --> G4 classDef hot fill:#7c2d12,stroke:#9a3412,color:#fff classDef warm fill:#5b21b6,stroke:#6d28d9,color:#fff classDef cold fill:#1e3a8a,stroke:#1e40af,color:#fff classDef cool fill:#14532d,stroke:#166534,color:#fff class G1 hot class G2 warm class G3 cold class G4 cool

原 ASCII 图

┌─────────────────────────────────────────────────────┐
│ GPU Memory (G1)                    lib/kvbm-engine │
│ - Fastest, smallest capacity                        │
│ - Active compute blocks                             │
└──────────────┬──────────────────────────────────────┘
               │ Offload G1→G2 (LRU pop_lru)
               ↓
┌─────────────────────────────────────────────────────┐
│ CPU/Host Memory (G2)     lib/kvbm-logical (pools)  │
│ - Pinned DRAM staging                               │
│ - µs-latency RDMA ready  lib/kvbm-physical         │
└──────────────┬──────────────────────────────────────┘
               │ Offload G2→G3 (TinyLFU + presence filter)
               ↓
┌─────────────────────────────────────────────────────┐
│ NVMe/SSD (G3)           lib/kvbm-physical/storage  │
│ - Persistent warm cache                             │
│ - ms-latency disk ops                               │
└──────────────┬──────────────────────────────────────┘
               │ Offload G3→G4 (NIXL OBJ backend)
               ↓
┌─────────────────────────────────────────────────────┐
│ Object Storage (G4)      lib/llm/block_manager/    │
│ - S3/MinIO/Azure Blob    storage/object.rs         │
│ - Unlimited capacity, seconds+ latency              │
└─────────────────────────────────────────────────────┘

块生命周期 8 阶段：Allocate → Fill → Schedule → Compute → Hash（128-bit SequenceHash）→ Register → Consolidate（按 SequenceHash 去重）→ Evict/Restore（weak ref demotion）。

NIXL 把 GPUDirect RDMA、NVMe-oF、对象存储都包装成统一 MemType，G1↔G2↔G3↔G4 的传输代码路径同形。详见 raw 中的"关键组件深入解读 / KVBM" 节。

NVIDIA Dynamo 架构与设计思路分析

一句话定位

核心架构图

模块分层

关键数据流

设计决策与哲学

KVBM 四级层次（核心组件深入）

相关页面

相关页面（来自 frontmatter）