SGLang 架构与设计思路分析

2026-05-15 · 来源：sglang-architecture-analysis.md

architecturellm-inferenceai-infrakv-cachespeculative-decoding

原文：raw/sglang-architecture-analysis.md · 仓库：https://github.com/sgl-project/sglang · 分析版本：HEAD 50f4058（main，2026-05-13）· 范围：python/sglang/

一句话定位

sglang 是面向大规模 LLM 推理的高性能引擎：4 进程异步流水线（HTTP / TokenizerManager / Scheduler / DetokenizerManager）把 CPU 分词、GPU 推理、CPU 反分词彻底解耦；radix-attention 把 KV 缓存复用从 vllm 的 16-token block 粒度做到 token 级，叠加 7 套可插拔投机解码（EAGLE / EAGLE-v2 / 多层 EAGLE / FrozenKV-MTP / NGRAM / DFLASH / Standalone）+ prefill-decode-disaggregation（Mooncake / NIXL / Mori / Ascend 4 后端）+ 10+ attention 后端（FlashInfer / FA3 / Triton / FlashMLA / NSA / DSV4 / ...），是 vLLM 之外的主流开源推理栈。

核心架构图

flowchart TB Client[HTTP Client] Client --> L["launch_server.py
Process 0: HTTP main
default / --grpc-mode / --encoder-only / --use-ray"] L -->|FastAPI/uvicorn| A["Protocol Adapters
OpenAI / Anthropic / Ollama
entrypoints/openai/serving_chat.py
→ GenerateReqInput"] A -->|in-process| TM["TokenizerManager (main proc)
generate_request L516
tokenize → input_ids
send_to_scheduler.send_pyobj"] TM -->|ZMQ PUSH
scheduler_input_ipc| SCH subgraph SCH["Scheduler subprocess (TP rank 0..N)"] direction TB EL["event_loop_normal/_overlap L1537
recv_requests L1656
get_next_batch_to_run L2485
run_batch → ModelRunner.forward
process_batch_result L3170"] subgraph CORE["Inference Core"] direction TB SB["ScheduleBatch (in-flight)
Req[]: prefix_indices / extend_input_len / out_cache_loc
forward_mode: EXTEND/DECODE/MIXED"] MEM["Memory subsystem
RadixCache (tree, match_prefix/insert)
ReqToTokenPool (req → token idx)
TokenToKVPool (MHA/MLA/NSA)"] MR["ModelRunner (model_executor/)
ForwardBatch → AttentionBackend
FlashInfer / FA3 / Triton / FlashMLA / ...
CUDA Graph runner / Sampling"] MIX["Optional Mixins (composed via inheritance)
SpeculativeWorker (EAGLE-2/NGRAM/MTP/DFLASH/Standalone)
SchedulerDisaggregation Prefill/Decode Mixin
SchedulerPP / SchedulerDPAttn / SchedulerDllm"] SB <-->|cache lookup| MEM SB --> MR MR -.composes.-> MIX end EL --> CORE end SCH -->|ZMQ PUSH
detokenizer_ipc| DET["DetokenizerManager (subproc)
event_loop L140
tokenizer.batch_decode → BatchStrOutput"] DET -->|ZMQ PUSH
tokenizer_ipc| TM TM --> Client subgraph DISAGG["External cluster (P/D-disagg mode, optional)"] direction LR PI[Prefill instance
full SGLang] DI[Decode instance
full SGLang] PI -->|Mooncake / NIXL
Mori / Ascend| DI end SCH -.optional.-> DISAGG classDef proc fill:#1f2530,stroke:#6ea8fe,color:#e6e6e6 classDef core fill:#262b38,stroke:#83c4ff,color:#e6e6e6 class TM,SCH,DET,L proc class SB,MEM,MR,MIX core

原 ASCII 图

                              ┌────────────────────────────────────────────────────┐
                              │   sglang/launch_server.py  (Process 0: HTTP main)  │
                              │   ├─ Default: http_server.py:launch_server()       │
                              │   ├─ --grpc-mode → grpc_server.py:serve_grpc       │
                              │   ├─ --encoder-only → disaggregation/encode_*      │
                              │   └─ --use-ray → ray/http_server.py                │
                              └───────────────────┬────────────────────────────────┘
                                                  │ FastAPI/uvicorn
                                                  v
                            ┌─────────────────────────────────────────────┐
                            │  OpenAI / Anthropic / Ollama adapters       │
                            │  entrypoints/openai/serving_chat.py …       │
                            │  (ChatCompletionRequest → GenerateReqInput) │
                            └─────────────────────┬───────────────────────┘
                                                  │ in-process call
                                                  v
       ╔═══════════════════════════════════════════════════════════════════════════╗
       ║  TokenizerManager  (managers/tokenizer_manager.py)  — main process        ║
       ║  • async generate_request() @ L516                                        ║
       ║  • tokenize text → input_ids                                              ║
       ║  • send_to_scheduler.send_pyobj(TokenizedGenerateReqInput)                ║
       ║  • _wait_one_response() ← stream from detokenizer                         ║
       ╚════════════════╤═══════════════════════════════════╤══════════════════════╝
                        │ ZMQ PUSH                          ▲ ZMQ PULL
        scheduler_input_ipc_name                      tokenizer_ipc_name
                        v                                   │
       ╔═══════════════════════════════════════╗  ╔════════╧═══════════════════════╗
       ║  Scheduler (subprocess, TP rank 0..N) ║  ║  DetokenizerManager (subproc.) ║
       ║  managers/scheduler.py                ║  ║  managers/detokenizer_manager  ║
       ║  • event_loop_normal/_overlap @ 1537  ║  ║  • event_loop @ L140           ║
       ║  • recv_requests @ 1656               ║  ║  • batch_decode(token_ids)     ║
       ║  • get_next_batch_to_run @ 2485       ║  ║    → BatchStrOutput            ║
       ║  • run_batch → ModelRunner.forward    ║  ╚════════╤═══════════════════════╝
       ║  • process_batch_result @ 3170        ║           ▲ ZMQ PULL
       ║    → send BatchTokenIDOutput          ║───────────┘ detokenizer_ipc_name
       ╚════════════════╤══════════════════════╝
                        │  owns the GPU
                        v
       ┌──────────────────────────────────────────────────────────────────────────┐
       │                    Inference Core (within Scheduler proc)                │
       │                                                                          │
       │  ┌──── ScheduleBatch (in-flight) ────┐    ┌──── Mem subsystem ──────┐   │
       │  │ Req[]:                            │    │ RadixCache (tree)       │   │
       │  │   prefix_indices  (cached KV)     │◄──▶│   match_prefix / insert │   │
       │  │   extend_input_len (new tokens)   │    │ ReqToTokenPool          │   │
       │  │   out_cache_loc   (new KV slots)  │    │   (req → token idx)     │   │
       │  │ forward_mode: EXTEND/DECODE/MIXED │    │ TokenToKVPool           │   │
       │  └─────────────────┬─────────────────┘    │   MHA/MLA/NSA variants  │   │
       │                    │                      └─────────────────────────┘   │
       │                    v                                                    │
       │  ┌──── ModelRunner (model_executor/) ────────────────────────────────┐  │
       │  │  ForwardBatch (forward_batch_info.py)                             │  │
       │  │  → AttentionBackend (FlashInfer / FA3 / Triton / FlashMLA / ...)  │  │
       │  │  → CUDA Graph runner (decode replay) / piecewise / breakable      │  │
       │  │  → Sampling (constrained vocab mask, grammar)                     │  │
       │  └───────────────────────────────────────────────────────────────────┘  │
       │                                                                          │
       │  ┌──── Optional mixins (composed into Scheduler via inheritance) ────┐  │
       │  │ • SpeculativeWorker  (EAGLE-2 / NGRAM / MTP / DFLASH / Standalone)│  │
       │  │ • SchedulerDisaggregationPrefillMixin                             │  │
       │  │ • SchedulerDisaggregationDecodeMixin                              │  │
       │  │ • SchedulerPPMixin   (pipeline-parallel)                          │  │
       │  │ • SchedulerDPAttnMixin (DP-attention)                             │  │
       │  │ • SchedulerDllmMixin (diffusion-LLM)                              │  │
       │  └───────────────────────────────────────────────────────────────────┘  │
       └──────────────────────────────────────────────────────────────────────────┘

  External cluster (P/D-disagg mode):
                ┌──────────────────┐    KV transfer    ┌──────────────────┐
                │ Prefill instance │ ════════════════> │ Decode instance  │
                │  (full SGLang)   │  Mooncake / NIXL  │  (full SGLang)   │
                └──────────────────┘  Mori / Ascend    └──────────────────┘

模块分层

层 / 模块	职责
HTTP / RPC 入口	FastAPI/uvicorn + gRPC + 离线 Engine API；`launch_subprocesses()` 启动 4 进程流水线
协议适配	OpenAI / Anthropic / Ollama 三套协议统一翻译成 `GenerateReqInput`，下游无感切换
TokenizerManager	主进程做分词 + ZMQ 出/入；DetokenizerManager 在独立进程做反分词；ZMQ 三段管道
Scheduler（GPU 主进程）	事件循环 + 准入控制 + batch 组装 + 调用 ModelRunner；用 8+ Mixin 拼装 disagg/PP/DPAttn/Dllm 横切特性
Memory 子系统	4 套 RadixCache 变体（vanilla / hi / mamba / swa / cpp）+ 两级 pool（Req→Token, Token→KV）+ 4 种 evict 策略
ModelRunner	ForwardMode 调度（EXTEND/DECODE/MIXED/TARGET_VERIFY/DRAFT_EXTEND）；CUDA Graph 预捕获 decode 路径
Attention 后端	10+ 可插拔后端：FlashInfer / FA3-4 / Triton / FlashMLA / NSA / DSV4 / FlexAttention / TorchNative / Wave / AITER / Intel-AMX
投机解码	7 算法走 `BaseSpecWorker` + `spec_registry`：EAGLE-2 / EAGLE-v2 / 多层 EAGLE / NGRAM / FrozenKV-MTP / DFLASH / Standalone
P/D 分离	prefill / decode 节点独立扩容 + 5 KV transfer backend：mooncake / NIXL / Mori / Ascend / Fake
结构化输出	xgrammar / outlines / llguidance / reasoner backend；在 sampling 前 apply vocab mask
模型库	100+ 模型（LLaMA / Qwen / DeepSeek / Mixtral / Gemma / GPT-OSS / 多模态…）
多模态	image / audio / video 预处理 + KV 缓存；audio 走 Whisper / Qwen-ASR adapter
分布式	TP / PP / DP / EP + 专家并行 load balancing
前端 DSL	SGLang DSL（fork / gen / select）—— 论文里的 "Structured Generation Language"

分层约束：TokenizerManager 不持有 GPU；Mixin 之间互不依赖（disagg prefill / decode mixin 互斥）；RadixCache 4 变体按 attention 类型（MHA/MLA/Mamba/SWA）启动时绑死，不可混用；speculative + chunked prefill + disagg 三选二。

关键数据流：端到端请求生命周期

sequenceDiagram autonumber participant C as HTTP Client participant H as FastAPI
http_server.py participant TM as TokenizerManager
(main proc) participant SC as Scheduler
(subproc) participant MR as ModelRunner
(GPU) participant DT as DetokenizerManager
(subproc) C->>H: POST /v1/chat/completions H->>H: parse ChatCompletionRequest
build GenerateReqInput H->>TM: tokenizer_manager.generate_request(obj) TM->>TM: tokenizer(text) → input_ids
build TokenizedGenerateReqInput TM->>SC: ZMQ PUSH send_pyobj
(scheduler_input_ipc) activate SC Note right of SC: event_loop_normal()
scheduler.py:1537 SC->>SC: 1. recv_requests() — drain waiting_queue SC->>SC: 2. get_next_batch_to_run
RadixCache.match_prefix
PrefillAdder budget check
chunked prefill / policy (LPM/FCFS/LOF/DFS_WEIGHT) SC->>MR: 3. run_batch(batch) → ForwardBatch Note over MR: ForwardMode = EXTEND/DECODE/MIXED
DECODE: CUDA graph replay
EXTEND/MIXED: native attention kernel
sampling (+ constrained vocab mask) MR-->>SC: logits + sampled token_ids SC->>SC: 4. process_batch_result
write output_ids
tree_cache.insert if finished SC->>DT: ZMQ PUSH BatchTokenIDOutput
(detokenizer_ipc) deactivate SC DT->>DT: tokenizer.batch_decode → strings
build BatchStrOutput DT->>TM: ZMQ PUSH BatchStrOutput
(tokenizer_ipc) TM-->>C: SSE stream / final JSON

原 ASCII 图

Client HTTP POST /v1/chat/completions
   │
   v
┌──────────────────────────────────────────────────────────────────────┐
│ FastAPI handler (http_server.py) → OpenAIServingChat                 │
│   • parse ChatCompletionRequest                                      │
│   • build GenerateReqInput (text or input_ids, sampling_params,      │
│     stream, return_logprob, ...)                                     │
└────────────────────────────┬─────────────────────────────────────────┘
                             │ tokenizer_manager.generate_request(obj)
                             v
┌──────────────────────────────────────────────────────────────────────┐
│ TokenizerManager.generate_request()  (tokenizer_manager.py:516)      │
│   • tokenizer(text) → input_ids                                      │
│   • build TokenizedGenerateReqInput                                  │
│   • send_to_scheduler.send_pyobj(obj)        ─── ZMQ PUSH ───┐       │
│   • await _wait_one_response(rid)   ◄──── ZMQ PULL ────┐     │       │
└──────────────────────────────────────────────────────────│─────│─────┘
                                                          │     │
       scheduler_input_ipc_name (ZMQ PUSH/PULL) ◄─────────│─────┘
                             │                            │
                             v                            │
┌──────────────────────────────────────────────────────────────────────┐
│ Scheduler.event_loop_normal()  (scheduler.py:1537)                   │
│   step 1: recv_requests()                  → drain waiting_queue     │
│   step 2: get_next_batch_to_run()                                    │
│           • RadixCache.match_prefix(req)   → req.prefix_indices      │
│           • PrefillAdder budget check:                               │
│             rem_input_tokens, rem_chunk_tokens, rem_total_tokens     │
│           • chunked prefill if extend_input_len > chunk_size         │
│           • policy = LPM / FCFS / LOF / DFS_WEIGHT                   │
│   step 3: run_batch(batch)                                           │
│           → ModelRunner.forward(ForwardBatch)                        │
│             • ForwardMode = EXTEND / DECODE / MIXED                  │
│             • DECODE: CUDA graph replay (固定 BS)                    │
│             • EXTEND/MIXED: 原生 attention kernel                    │
│             • AttentionBackend.forward_decode / forward_extend       │
│             • sampling (with constrained vocab mask if any)          │
│   step 4: process_batch_result()                                     │
│           • write output_ids                                         │
│           • if finished: tree_cache.insert(req.last_node, new_kv)    │
│           • send_to_detokenizer.send_pyobj(BatchTokenIDOutput) ──┐   │
└──────────────────────────────────────────────────────────────────│───┘
                                                                  │
       detokenizer_ipc_name (ZMQ PUSH/PULL)  ◄───────────────────┘
                             │
                             v
┌──────────────────────────────────────────────────────────────────────┐
│ DetokenizerManager.event_loop()  (detokenizer_manager.py:140)        │
│   • tokenizer.batch_decode(token_ids) → strings                      │
│   • build BatchStrOutput                                             │
│   • send_to_tokenizer.send_pyobj(BatchStrOutput) ─── ZMQ PUSH ───┐   │
└──────────────────────────────────────────────────────────────────│───┘
                                                                  │
       tokenizer_ipc_name (ZMQ PUSH/PULL)  ◄───────────────────── ┘
                             │
                             v
TokenizerManager (main proc) ── SSE stream / final JSON ──> HTTP Client

流水线 overlap：event_loop_overlap（scheduler.py:1564）让 GPU 算 batch N 的同时 CPU 准备 batch N+1，TokenizerManager 同时分词 batch N+2，DetokenizerManager 反分词 batch N−1 —— 4 进程同时活跃。

RadixCache 复用 + KV pool 联动

flowchart TD subgraph RC["RadixCache (radix_cache.py)"] direction TB Root((Root)) Root --> NA["[101,102,103] Node A
value=[kv_0, kv_1, kv_2]"] NA --> NB["[104] Node B
value=[kv_3]"] NA --> NC["[104,105] Node C
value=[kv_4, kv_5]"] Root --> ND["[101,102,106] Node D
value=[kv_6, kv_7, kv_8]"] Match["match_prefix([101,102,103,104,X])
walk → A → B
prefix_indices = [kv_0, kv_1, kv_2, kv_3] (device)
last_node = B
extend_input_len = len(input) − 4"] end subgraph KV["TokenToKVPool (memory_pool.py:789, MHA/MLA/NSA)"] direction TB Slots["Physical GPU K, V tensors — flat token-indexed
[kv_0][kv_1][kv_2][kv_3][kv_4][kv_5][kv_6][kv_7]...
⌞ Node A ⌟⌞ B ⌟⌞ C ⌟⌞ D ⌟"] Ops["alloc(n) → out_cache_loc
free(indices) ← eviction"] end RC -->|value pointers| KV subgraph RP["ReqToTokenPool (memory_pool.py:128)
shape [num_reqs+1, max_ctx_len] int32"] direction TB Rows["row i = req_pool_indices[i] = full KV index list
req_0: [kv_0, kv_1, kv_2, kv_3, kv_NEW_0, kv_NEW_1, ...]
req_1: [kv_6, kv_7, kv_8, kv_NEW_2, ...]"] Kernel["attention kernel:
k_cache[req_to_token[i, :seq_lens[i]]] = ..."] Rows --> Kernel end KV -.read via gather.-> Kernel

原 ASCII 图

┌─────────────────────────────────────────────────────────────┐
│ RADIXCACHE (radix_cache.py)                                 │
│ Root                                                        │
│  ├─ [101, 102, 103] → Node A  (value=[kv_0, kv_1, kv_2])    │
│  │   ├─ [104] → Node B  (value=[kv_3])                      │
│  │   └─ [104, 105] → Node C  (value=[kv_4, kv_5])           │
│  └─ [101, 102, 106] → Node D  (value=[kv_6, kv_7, kv_8])    │
│                                                             │
│  match_prefix(input_ids=[101,102,103,104,X]):               │
│    walk tree → match A → match B → return                   │
│      prefix_indices = [kv_0, kv_1, kv_2, kv_3]   (device)   │
│      last_node = B                                          │
│      extend_input_len = len(input) - 4                      │
└─────────────────────────┬───────────────────────────────────┘
                          │ value pointers
                          v
┌─────────────────────────────────────────────────────────────┐
│ TokenToKVPool (memory_pool.py:789, MHA/MLA/NSA variants)    │
│ Physical GPU K, V tensors, flat token-indexed:              │
│  [kv_0] [kv_1] [kv_2] [kv_3] [kv_4] [kv_5] [kv_6] [kv_7]... │
│   └─Node A──────┘ └─B─┘ └─C─────┘ └─D───────────────┘       │
│                                                             │
│ alloc(n) → returns n new token slots (out_cache_loc)        │
│ free(indices) → returns slots after eviction                │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ ReqToTokenPool (memory_pool.py:128)                         │
│ Shape: [num_reqs+1, max_ctx_len], int32                     │
│ Row i = req_pool_indices[i] holds full KV index list:       │
│   req_0: [kv_0, kv_1, kv_2, kv_3, kv_NEW_0, kv_NEW_1, ...]  │
│   req_1: [kv_6, kv_7, kv_8, kv_NEW_2, ...]                  │
│                                                             │
│ 在 attention kernel 里：                                    │
│   for i in batch:                                           │
│     k_cache[req_to_token[i, :seq_lens[i]]] = ...            │
└─────────────────────────────────────────────────────────────┘

投机解码（EAGLE-2 默认）

flowchart TD Start(["Decode step
target accepted last token T_n"]) Start --> Draft["Draft model (eagle_worker.py:91)
• 1 step forward
• build_tree_kernel_efficient (L818)
propose topk=5 per position, K steps deep
→ tree of ~K×5 draft tokens
→ tree_mask + positions + retrieve_idx"] Draft -->|draft_tokens + tree metadata| Target["Target model (single forward pass)
• ForwardMode.TARGET_VERIFY
• EagleVerifyInput.custom_mask = tree_mask
• compute logits for ALL tree leaves in 1 fwd
• sample top-1 per leaf"] Target -->|verified_tokens| Accept["Accept prefix (longest path draft == verify)
• rollback KV cache to divergence
• on_verify_complete_cpu() → adaptive_spec_params
(adjusts topk / depth)
• emit 1~K+1 accepted tokens"] Accept -.next step.-> Start classDef stage fill:#1f2530,stroke:#83c4ff,color:#e6e6e6 class Draft,Target,Accept stage

原 ASCII 图

Decode step (target accepted last token T_n)
   │
   v
┌─────────────────────────────────────────────────────────────┐
│ Draft model (eagle_worker.py:91)                            │
│  • 1 step forward                                           │
│  • build_tree_kernel_efficient (line 818)                   │
│    propose topk=5 candidates per position, K steps deep     │
│    → tree of ~K×5 draft tokens                              │
│    → tree_mask (谁能 attend 谁) + positions + retrieve_idx  │
└────────────────────────┬────────────────────────────────────┘
                         │ draft_tokens + tree metadata
                         v
┌─────────────────────────────────────────────────────────────┐
│ Target model (single forward pass)                          │
│  • ForwardMode.TARGET_VERIFY                                │
│  • EagleVerifyInput.custom_mask = tree_mask                 │
│  • compute logits for ALL tree leaves in 1 fwd              │
│  • sample top-1 per leaf                                    │
└────────────────────────┬────────────────────────────────────┘
                         │ verified_tokens
                         v
┌─────────────────────────────────────────────────────────────┐
│ Accept prefix (longest path where draft == verify)          │
│  • rollback KV cache to divergence point                    │
│  • on_verify_complete_cpu() feed adaptive controller        │
│    (adaptive_spec_params.py adjusts topk/depth)             │
│  • emit accepted tokens (1~K+1 per step)                    │
└─────────────────────────────────────────────────────────────┘

P/D 分离请求流

flowchart TD Client([Client request]) Client --> P["Prefill instance (full SGLang scheduler)
• PrefillBootstrapQueue: handshake + KV slot prealloc
• Run forward (prefill only)
• release_req_to_metadata_buffer()
• Queue KVSender (begin transfer)"] P --> KV["KV transfer backend (one of):
• Mooncake — Moonshot KV store, ~80KB
• NIXL — NVIDIA GPUDirect RDMA
• Mori — ByteDance KV protocol, ~58KB
• Ascend — Huawei NPU native
• Fake — testing
poll_and_all_reduce() sync across ranks"] KV --> D["Decode instance (full SGLang scheduler)
• DecodePreallocQueue: reserve KV
• Wait for KV transfer completion
• DecodeTransferQueue → WaitingQueue → RunningBatch
• Skip prefill, populate forward meta
• Run decode loop, stream tokens"] D --> Client classDef inst fill:#1f2530,stroke:#83c4ff,color:#e6e6e6 classDef transfer fill:#2a2a3a,stroke:#ffa657,color:#e6e6e6 class P,D inst class KV transfer

原 ASCII 图

Client request
   │
   v
┌──────────────────────────────────────────┐
│ Prefill instance (full SGLang scheduler) │
│  • PrefillBootstrapQueue: handshake +    │
│    KV slot prealloc                      │
│  • Run forward (prefill only)            │
│  • release_req_to_metadata_buffer()      │
│  • Queue KVSender (begin transfer)       │
└──────────────────┬───────────────────────┘
                   │
                   v
┌──────────────────────────────────────────┐
│ KV transfer backend (one of):            │
│  • Mooncake (Moonshot KV store, ~80KB)   │
│  • NIXL (NVIDIA GPUDirect RDMA)          │
│  • Mori (ByteDance KV protocol, ~58KB)   │
│  • Ascend (Huawei NPU native)            │
│  • Fake (testing)                        │
│  poll_and_all_reduce() sync across ranks │
└──────────────────┬───────────────────────┘
                   │
                   v
┌──────────────────────────────────────────┐
│ Decode instance (full SGLang scheduler)  │
│  • DecodePreallocQueue: reserve KV       │
│  • Wait for KV transfer completion       │
│  • DecodeTransferQueue → WaitingQueue    │
│    → RunningBatch                        │
│  • Skip prefill, populate forward meta   │
│  • Run decode loop, stream tokens        │
└──────────────────┬───────────────────────┘
                   │
                   v
                Client

设计决策与哲学

4 进程异步流水线：CPU 分词 / GPU 推理 / CPU 反分词彻底进程化隔离，zmq pyobj 串联 + Scheduler 内部 event_loop_overlap 进一步重叠上轮 GPU 与本轮 CPU 准备 —— 任何 stage 阻塞不卡其他 stage
radix-attention 取代 paged-attention：vLLM 16-token block 粒度浪费太多复用机会，LLM workload 里 system prompt / few-shot / agent template 是高度共享的前缀，SGLang 用 token-level radix 树 + flat KV pool 把"任意 token 边界共享"做到极致 —— 论文里 RadixAttention 在 tree-of-thought / few-shot 场景 throughput 1.6-6.4× over vLLM
两级 KV pool：ReqToTokenPool（req → token 索引列表）+ TokenToKVPool（token 索引 → 物理 KV 槽位）的双跳让 radix 树叶直接指向 KV pool 位置，attention kernel 用 device-resident req_to_token 张量做一次 gather —— 既保留 token 级粒度，又能让 kernel 高效访存
Scheduler 用 Mixin 拼装：4000+ 行的 scheduler.py 通过 10+ 个 Mixin 把横切关注点解耦 —— open-closed 友好，新加 disagg 后端 / PP 形态不改主类。代价是新人定位某个 hook 实现要跨 8-10 个 mixin 文件
7 套投机解码共生：BaseSpecWorker + spec_registry 注册表让 EAGLE-2 / EAGLE-v2 / 多层 EAGLE / NGRAM / FrozenKV-MTP / DFLASH / Standalone 共存 —— 不同 workload 选最优。NGRAM 零模型成本适合 retrieval-heavy；MTP 适合 DeepSeek-V3 原生 multi-token-predict 头；EAGLE-2 适合通用模型
prefill-decode-disaggregation + 4 transfer backend：prefill（compute-bound）和 decode（memory-bound）独立扩容，KV 通过 RDMA / mooncake / NIXL / Mori / Ascend 跨节点传输。kv_events.py 用 ZMQ 发布 KV 事件给 router / observability
10+ Attention 后端：AttentionBackend ABC + attention_registry.py 注册表 —— FlashInfer 默认 / FA3-4 / Triton / FlashMLA / NSA / DSV4 / FlexAttention / TorchNative / Wave / AITER / Intel-AMX，按硬件 / 模型 / 性能曲线选
CUDA Graph 只给 DECODE：DECODE 形状固定（batch_size × 1 token），适合预捕获 graph replay 消除 CPU launch overhead；EXTEND/MIXED 形状变化走原生 kernel。这是 SGLang 把 decode latency 压到接近 kernel-only 的关键
统一三大协议入口：单 backend 支持 OpenAI / Anthropic / Ollama + gRPC + 原生 Engine SDK，下游 client 无缝切换

关键组件深入：RadixCache（核心创新）

TreeNode (mem_cache/radix_cache.py:206)：key: RadixKey（变长 token 序列，支持 bigram 模式给 EAGLE 用）+ value: torch.Tensor（指向 TokenToKVPool 的 KV 索引张量）+ children: dict[int, TreeNode] + lock_ref: int（引用计数 >0 时禁止 evict）+ evicted: bool（已 evict 的占位，支持增量恢复）。

match_prefix (line 360)：从 root walk，每层比 input_ids[i:] 和子节点 key；部分匹配时 split 节点（原节点截到匹配长度，剩余 token 移到新子节点）；返回 (device_indices, last_node) —— 前者是已命中 KV 索引张量（直接喂 attention kernel），后者给后续 insert 用。

insert (line 420)：请求完成时在 req.last_node 下挂新节点，key = req.fill_ids[len(prefix):]，value = req.out_cache_loc（本次 forward 新写入的 KV 槽位）；父节点 lock_ref 递减，允许 evict。

Eviction (line 560)：LRU / LFU / FIFO / SLRU / Priority 4 策略，pop evictable_leaves 堆顶，free 该节点 value 指向的 KV 槽位，递归向上 unlock。

vs vLLM 关键差异：vLLM BlockManager 固定 16 token block + block table，无法在 token 7 处分叉；SGLang radix 树天然支持任意 token 边界 split，value 是连续区间索引而非 block 列表。

与同类对比

维度	SGLang	vllm	TensorRT-LLM	TGI
KV 复用粒度	token 级（radix-attention）	16-token block（paged-attention）	block	block
前缀共享	任意分叉点自动	同 block 才共享	prefix cache	prefix cache
投机解码	7 算法	EAGLE / Medusa	EAGLE / Medusa / ReDrafter	speculative-decoding
P/D 分离	5 后端	实验	✅	实验
Attention 后端	10+	FlashAttn / xFormers / TorchSDPA	TRT 内核	FlashAttn
结构化输出	4 backend	outlines	有限	outlines
多协议入口	OpenAI / Anthropic / Ollama / gRPC / Engine	OpenAI	TensorRT 原生	TGI 原生
国产硬件	Ascend NPU / Wave / AITER 一等公民	实验	—	—
前端 DSL	SGLang DSL（fork/gen/select）	—	—	—
典型适用	agent / tool-use / RAG / 结构化生成 / 多模态	通用 LLM serving	极致延迟	HF 生态

SGLang 架构与设计思路分析

一句话定位

核心架构图

模块分层

关键数据流：端到端请求生命周期

RadixCache 复用 + KV pool 联动

投机解码（EAGLE-2 默认）

P/D 分离请求流

设计决策与哲学

关键组件深入：RadixCache（核心创新）

与同类对比

相关页面

相关页面（来自 frontmatter）