llm-wiki wiki · entities 2026-06-12

CNCF Sandbox 分布式 LLM inference serving stack,围绕 Router/EPP、InferencePool、model server、KV cache management、P/D disaggregation、autoscaling、batch、benchmark 和 simulator 组织。详见 src-llm-d-architecture

架构边界

llm-d 是 serving system,不是推理 engine。llm-d-router 负责 LLM-aware entry point,llm-d-kv-cache 负责 KV cache aware routing/indexing library;外围项目继续补齐 llm-d-batch-gatewayllm-d-benchmarkllm-d-workload-variant-autoscalerllm-d-inference-sim。它与 aibrix 同属 K8s serving control plane,与 dynamo 同属分布式 serving 系统,但更贴近 Gateway API / InferencePool 标准化路线。

选型判断

需求 关注点
Gateway API/InferencePool 路由 src-llm-d-router-architecture
KV cache aware route/index src-llm-d-kv-cache-architecture
完整 K8s serving stack src-llm-d-architecture
OpenAI Batch API / 离线推理 src-llm-d-batch-gateway-architecture
serving benchmark / 实验复现 src-llm-d-benchmark-architecture
多 variant autoscaling src-llm-d-workload-variant-autoscaler-architecture
无 GPU 控制面与路由验证 src-llm-d-inference-sim-architecture