llm-wiki wiki · sources 2026-05-15

NVIDIA Dynamo 架构与设计思路分析

2026-05-15 · 来源:dynamo-architecture-analysis.md

architectureai-infrallm-inferencedistributed-servingkv-cachekubernetes

原文:raw/dynamo-architecture-analysis.md · 仓库:https://github.com/ai-dynamo/dynamo · 分析版本 1.2.0(commit 7997117,2026-05-15)

一句话定位

dynamo 是 NVIDIA 开源的数据中心级 LLM 推理编排层:用 Rust 运行时 + Python 组件 + Go K8s Operator,把 SGLang / vLLM / TensorRT-LLM 等推理引擎拼成具备分离式 prefill/decodeKV 感知路由、四级 KV 缓存(KVBM)和 SLA 自动扩缩容的协调集群。它不替代推理引擎,而是让一群 GPU/Node 变成"一个协调的推理系统"。

核心架构图

flowchart TB Client["Client (OpenAI API)"] subgraph FE["FRONTEND — Rust axum + Python wrapper"] F1["/v1/chat/completions
validate → preprocess → migration → route
lib/llm/src/http · components/"] end subgraph PLANES["三平面 (Request · Control · Storage/Events)"] direction LR Router["KV-Aware Router (lib/kv-router)
· Radix tree of block hashes / wkr
· cost = prefill_load + decode_cost
− overlap credits"] Runtime["DistributedRuntime (lib/runtime)
· Discovery trait
· Component registry
· HealthCheckManager
· Pipeline framework"] Planner["Planner — Python
components/.../planner
· Prometheus scrape
· Throughput + Load scaling laws
· Emits ScalingDecision → K8s Op"] end subgraph WORK["WORKER POOL — Python wrappers via PyO3"] direction LR PW["Prefill Worker
SGLang / vLLM / TRT-LLM
+ KVBM hooks"] DW["Decode Worker
SGLang / vLLM / TRT-LLM
+ KVBM hooks"] PW -- "KV transfer (NIXL / GDS)" --> DW PW <-- "KV-events (NATS JetStream)" --> DW end subgraph STORE["STORAGE / EVENTS PLANE"] direction TB KVBM["KVBM Tier Hierarchy (lib/kvbm-*)
G1 GPU device memory ── LRU
G2 CPU pinned ── LFU
G3 NVMe / SSD (NIXL)
G4 S3 / Azure / Object (NIXL)
Consolidator dedupe by SequenceHash (128-bit PLH)"] Bus["NATS JetStream + Object Store
· kv-events subject
· Radix tree snapshots"] end subgraph K8S["K8s Operator — Go (deploy/operator/)"] Op["DGDR (request) → DGD (graph) → DCD (per-component pods)
AIConfigurator profiles → Planner picks → Grove places (NVL72 aware)"] end Client -- "HTTP / SSE" --> F1 F1 --> Router Router -- "TCP / NATS Core" --> WORK Runtime -- "register / watch" --> WORK Planner -- "patch DGD" --> Op WORK <--> KVBM KVBM <--> Bus Op -- "topology-aware gang sched" --> WORK classDef fe fill:#1e3a8a,stroke:#1e40af,color:#fff classDef pl fill:#7c2d12,stroke:#9a3412,color:#fff classDef wk fill:#14532d,stroke:#166534,color:#fff classDef st fill:#5b21b6,stroke:#6d28d9,color:#fff classDef k8 fill:#831843,stroke:#9d174d,color:#fff class FE,F1 fe class PLANES,Router,Runtime,Planner pl class WORK,PW,DW wk class STORE,KVBM,Bus st class K8S,Op k8
原 ASCII 图
                    ┌─────────────────────────────────────────────┐
                    │              Client (OpenAI API)            │
                    └────────────────────┬────────────────────────┘
                                         │ HTTP / SSE
┌────────────────────────────────────────▼─────────────────────────────────────┐
│ FRONTEND (Rust axum + Python wrapper)     lib/llm/src/http  +  components/  │
│   /v1/chat/completions  ─► validate ─► preprocess ─► migration ─► route     │
└────────────────────────────────────────┬─────────────────────────────────────┘
                                         │
            ┌────────────────────────────┼────────────────────────────┐
            │     REQUEST PLANE          │      CONTROL PLANE         │
            │   (TCP / NATS Core)        │   (etcd / K8s / file)      │
            │                            │                            │
            ▼                            ▼                            ▼
┌──────────────────────┐  ┌────────────────────────┐  ┌────────────────────────┐
│ KV-Aware Router      │  │ DistributedRuntime     │  │ Planner (Python)       │
│ lib/kv-router        │  │ lib/runtime            │  │ components/.../planner │
│ - Radix tree of      │  │ - Discovery trait      │  │ - Prometheus scrape    │
│   block hashes/wkr   │  │ - Component registry   │  │ - Throughput + Load    │
│ - Cost function:     │  │ - HealthCheckManager   │  │   scaling laws         │
│   prefill_load +     │  │ - Pipeline framework   │  │ - Emits ScalingDecision│
│   decode_cost -      │  │                        │  │   to K8s Operator      │
│   overlap credits    │  └─────────┬──────────────┘  └──────────┬─────────────┘
└──────────┬───────────┘            │                            │
           │                        │ register / watch           │ patch DGD
           ▼                        ▼                            ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ WORKER POOL (Python wrappers around backends via PyO3)                      │
│                                                                             │
│   Prefill Worker        ────KV transfer (NIXL/GDS)────►   Decode Worker    │
│   ┌──────────────┐                                        ┌──────────────┐ │
│   │ SGLang/vLLM/ │      ◄────KV-events (NATS JetStream)─► │ SGLang/vLLM/ │ │
│   │ TRT-LLM      │                                        │ TRT-LLM      │ │
│   │ + KVBM hooks │                                        │ + KVBM hooks │ │
│   └──────┬───────┘                                        └──────┬───────┘ │
└──────────┼───────────────────────────────────────────────────────┼─────────┘
           │                                                       │
┌──────────▼───────────────────────────────────────────────────────▼─────────┐
│ STORAGE / EVENTS PLANE                                                      │
│                                                                             │
│   KVBM Tier Hierarchy (lib/kvbm-*)            NATS JetStream + Object Store │
│   G1: GPU device memory   ──LRU──┐            - kv-events subject           │
│   G2: CPU pinned          ──LFU──┤            - Radix tree snapshots        │
│   G3: NVMe/SSD (NIXL)            │                                          │
│   G4: S3/Azure/Object (NIXL)     │                                          │
│                                  │                                          │
│   Consolidator dedupes by         │                                          │
│   SequenceHash (128-bit PLH)      │                                          │
└─────────────────────────────────────────────────────────────────────────────┘
                                         ▲
                                         │ topology-aware gang sched
┌────────────────────────────────────────┴─────────────────────────────────────┐
│ K8s Operator (Go, deploy/operator/)                                          │
│   DGDR (request)  ─►  DGD (graph)  ─►  DCD (per-component pods)             │
│   AIConfigurator profiles ─► Planner picks ─► Grove places (NVL72 aware)    │
└──────────────────────────────────────────────────────────────────────────────┘

模块分层

层 / 模块 职责
HTTP 前端(Rust axum + Python wrapper) OpenAI 兼容入口、SSE 流式、聚合、validation
预处理流水线 Tokenize、prompt template、多模态 decode、采样归一化
迁移层(migration.rs 失败 worker 在飞请求自动迁移到新 worker(RetryManager)
KV-aware Router XXH3 hash → radix tree → cost-based softmax 选 worker;多副本经 NATS JetStream 同步
分布式运行时(lib/runtime Runtime/Endpoint/Discovery/Transport 抽象;TCP+NATS 双平面
KVBM(KV Block Manager) G1-G4 四级 KV 缓存、NIXL 零拷贝、TinyLFU 升降级、SequenceHash 去重
Backend wrapper(Python) sglang/vllm/TRT-LLM 适配;通过 PyO3 接入 Rust 运行时
Planner(SLA 自动扩缩) Prometheus scrape + state machine + 推 K8s operator
K8s 控制面(Go + Grove + Gateway plugin) CRD: DGDR→DGD→DCD;拓扑感知 gang scheduling

关键约束:

关键数据流

端到端请求路径(HTTP arrival → token streaming):

flowchart TD S1["[1] HTTP POST /v1/chat/completions"] S2["[2] axum Router → handler_chat_completions
lib/llm/src/http/service/openai.rs:2012"] S3["[3] Validate (openai.rs:1233-1254)
+ model / temperature / token defaults"] S4["[4] Preprocessor pipeline (lib/llm/src/preprocessor/)
TokenizeOperator → PromptFormattingOperator → SamplingOperator
NvCreateChatCompletionRequest → PreprocessedRequest"] S5["[5] Migration layer (lib/llm/src/migration.rs:115)
RetryManager wraps; on CannotConnect/Disconnected/
EngineShutdown (L189) replays Context::with_id(...)"] S6["[6] KV-aware route decision
· XXH3 PLH blocks (block_size=128, LoRA-aware)
· ConcurrentRadixTree per worker → prefix overlap
· cost = prefill_load_scale × adj_prefill + decode
· softmax(−cost) sample"] S7["[7] Dispatch via Request Plane
TCP (pooled) or NATS Core → worker.generate"] S8["[8] Worker (Python) calls backend engine
SGLang sgl.Engine.async_generate · vLLM AsyncLLMEngine.generate · TRT-LLM executor"] S9["[9] PrefillRouter picks prefill worker → runs prefill
disaggregated_params returned"] S10["[10] PrefillRouter picks decode worker
KV blocks via NIXL / GPUDirect-RDMA"] S11["[11] Decode generates tokens
· KVBM.OffloadManager computes SequenceHash per block
· publish to NATS 'kv-events' subject"] S12["[12] Tokens stream back through SSE
ChatCompletionAggregator collapses if stream=false"] S13["[13] Response to client (200, JSON or SSE)"] S1 --> S2 --> S3 --> S4 --> S5 --> S6 --> S7 --> S8 S8 -. "if disaggregated" .-> S9 S9 --> S10 --> S11 --> S12 --> S13 S8 -. "if monolithic" .-> S11
原 ASCII 图
[1] HTTP POST /v1/chat/completions
        │
        ▼
[2] axum Router (lib/llm/src/http/service/openai.rs:2012)
    └─► handler_chat_completions
        │
        ▼
[3] Validate (openai.rs:1233-1254)
    + Apply model/temperature/token defaults
        │
        ▼
[4] Preprocessor pipeline (lib/llm/src/preprocessor/)
    TokenizeOperator → PromptFormattingOperator → SamplingOperator
    NvCreateChatCompletionRequest ──► PreprocessedRequest
        │
        ▼
[5] Migration layer (lib/llm/src/migration.rs:115)
    RetryManager wraps the call; on CannotConnect/Disconnected/
    EngineShutdown (line 189) replays with Context::with_id(...)
        │
        ▼
[6] KV-aware route decision
    ├─ Hash prompt → PLH blocks (XXH3, block_size=128, LoRA-aware)
    ├─ Query ConcurrentRadixTree per worker for prefix overlap
    └─ Selector logit (lib/kv-router/src/scheduling/selector.rs:161):
       cost = prefill_load_scale × adjusted_prefill_blocks + decode_cost_blocks
       softmax(−cost) sample
        │
        ▼
[7] Dispatch via Request Plane
    └─ TCP (pooled) or NATS Core → worker generate endpoint
        │
        ▼
[8] Worker (Python) calls backend engine
    ├─ SGLang: sgl.Engine.async_generate(...)
    ├─ vLLM:   AsyncLLMEngine.generate(...)
    └─ TRT-LLM: trtllm executor
        │
        ▼ [if disaggregated]
[9] PrefillRouter picks prefill worker → runs prefill
    └─ disaggregated_params returned
        │
        ▼
[10] PrefillRouter picks decode worker
     └─ KV blocks transferred via NIXL/GPUDirect-RDMA
        │
        ▼
[11] Decode generates tokens
     ├─ Each new block: KVBM.OffloadManager computes SequenceHash
     └─ publish to NATS "kv-events" subject
        │
        ▼
[12] Tokens stream back through SSE
     └─ ChatCompletionAggregator collapses if stream=false
        │
        ▼
[13] Response to client (status 200, JSON or SSE)

容错路径: worker 在 [8]–[11] 任意阶段挂掉 → RetryManager 检测 is_migratable() 错误 → 用同一个 PreprocessedRequest 重发到新 worker。guided decoding 和 n>1 sampling 因状态机不可复制而禁用迁移。

Planner 自动扩缩容循环:

flowchart TD Tick["Tick scheduler
load ~10s · throughput ~60s · "agg""] Gather["_gather_tick_input()
· Prometheus: TTFT, ITL, ISL, OSL, QPS
· FPM subscriber (ForwardPassMetrics)
· per-worker queue depth"] Tp["Throughput branch
predict next-win traffic × safety
→ replicas LB"] Ld["Load branch
estimate latency from queue + FPM
vs SLA threshold"] Cor["Correction factors
prefill_correction = actual_ttft / expected_ttft
decode_correction = actual_itl / expected_itl"] Decide["ScalingDecision(num_prefill, num_decode) | None
load > throughput → use Load branch"] Apply["_apply_effects() → K8s operator
→ patches DGD replicas"] Diag["Prometheus counters + JSON diagnostics"] Tick --> Gather --> Tp & Ld & Cor Tp --> Decide Ld --> Decide Cor --> Decide Decide --> Apply --> Diag
原 ASCII 图
            ┌──────────────────────────────────────────────────────┐
            │  Tick scheduler (load ~10s, throughput ~60s, "agg")  │
            └────────────────────────┬─────────────────────────────┘
                                     │
                ┌────────────────────▼────────────────────┐
                │  _gather_tick_input()                   │
                │  - Prometheus: TTFT, ITL, ISL, OSL, QPS │
                │  - FPM subscriber (ForwardPassMetrics)  │
                │  - per-worker queue depth               │
                └────────────────────┬────────────────────┘
                                     │
        ┌────────────────────────────┼────────────────────────────┐
        │                            │                            │
        ▼                            ▼                            ▼
┌──────────────────┐   ┌──────────────────────┐   ┌──────────────────────┐
│ Throughput branch│   │  Load branch         │   │ Correction factors:  │
│ predict next-win │   │  estimate latency    │   │ prefill_correction = │
│ traffic × safety │   │  from queue + FPM    │   │   actual_ttft /      │
│ → replicas LB    │   │  vs SLA threshold    │   │   expected_ttft      │
└────────┬─────────┘   └──────────┬───────────┘   │ decode_correction =  │
         │                        │               │   actual_itl /       │
         └──── load > throughput ─┤               │   expected_itl       │
                                  │               └──────────────────────┘
                                  ▼
                   ScalingDecision(num_prefill, num_decode) | None
                                  │
                                  ▼
                    _apply_effects() ──► K8s operator
                                       ──► patches DGD replicas
                                  │
                                  ▼
                   Prometheus counters + JSON diagnostics

KV-aware Router cost function:

adjusted_prefill_blocks = max(
    prefill_blocks
    - overlap_score_credit * device_overlap_blocks
    - host_cache_hit_weight * host_overlap_blocks
    - disk_cache_hit_weight * disk_overlap_blocks
    - shared_cache_multiplier * shared_beyond_blocks,
    0,
)
cost = prefill_load_scale * adjusted_prefill_blocks + decode_blocks

设计决策与哲学

KVBM 四级层次(核心组件深入)

flowchart TD G1["G1 — GPU Memory (lib/kvbm-engine)
Fastest, smallest capacity
Active compute blocks"] G2["G2 — CPU / Host Memory (lib/kvbm-logical pools)
Pinned DRAM staging
µs-latency RDMA ready (lib/kvbm-physical)"] G3["G3 — NVMe / SSD (lib/kvbm-physical/storage)
Persistent warm cache
ms-latency disk ops"] G4["G4 — Object Storage (lib/llm/block_manager/storage/object.rs)
S3 / MinIO / Azure Blob
Unlimited capacity, seconds+ latency"] G1 -- "Offload G1→G2 (LRU pop_lru)" --> G2 G2 -- "Offload G2→G3 (TinyLFU + presence filter)" --> G3 G3 -- "Offload G3→G4 (NIXL OBJ backend)" --> G4 classDef hot fill:#7c2d12,stroke:#9a3412,color:#fff classDef warm fill:#5b21b6,stroke:#6d28d9,color:#fff classDef cold fill:#1e3a8a,stroke:#1e40af,color:#fff classDef cool fill:#14532d,stroke:#166534,color:#fff class G1 hot class G2 warm class G3 cold class G4 cool
原 ASCII 图
┌─────────────────────────────────────────────────────┐
│ GPU Memory (G1)                    lib/kvbm-engine │
│ - Fastest, smallest capacity                        │
│ - Active compute blocks                             │
└──────────────┬──────────────────────────────────────┘
               │ Offload G1→G2 (LRU pop_lru)
               ↓
┌─────────────────────────────────────────────────────┐
│ CPU/Host Memory (G2)     lib/kvbm-logical (pools)  │
│ - Pinned DRAM staging                               │
│ - µs-latency RDMA ready  lib/kvbm-physical         │
└──────────────┬──────────────────────────────────────┘
               │ Offload G2→G3 (TinyLFU + presence filter)
               ↓
┌─────────────────────────────────────────────────────┐
│ NVMe/SSD (G3)           lib/kvbm-physical/storage  │
│ - Persistent warm cache                             │
│ - ms-latency disk ops                               │
└──────────────┬──────────────────────────────────────┘
               │ Offload G3→G4 (NIXL OBJ backend)
               ↓
┌─────────────────────────────────────────────────────┐
│ Object Storage (G4)      lib/llm/block_manager/    │
│ - S3/MinIO/Azure Blob    storage/object.rs         │
│ - Unlimited capacity, seconds+ latency              │
└─────────────────────────────────────────────────────┘

块生命周期 8 阶段:Allocate → Fill → Schedule → Compute → Hash(128-bit SequenceHash)→ Register → Consolidate(按 SequenceHash 去重)→ Evict/Restore(weak ref demotion)。

NIXL 把 GPUDirect RDMA、NVMe-oF、对象存储都包装成统一 MemType,G1↔G2↔G3↔G4 的传输代码路径同形。详见 raw 中的"关键组件深入解读 / KVBM" 节。

相关页面