Blade-KVT (KV Transfer)#
Overview#
KVT (KV Transfer) is a high-performance, zero-overhead KV Cache transfer module designed for distributed LLM inference scenarios. It handles efficient KV Cache transmission between two nodes, supporting multiple model architectures and cache layouts.
Design Goals#
KVT’s design originates from the following core requirements:
Bypass Design: No major changes to the main step flow, enabling sidecar-style integration
Zero Overhead: No additional load introduced to the step execution path due to KV Cache transfer
Full CUDA Graph Compatibility: Supports CUDA Graph optimization without introducing CPU synchronization points
Generality: Supports multiple model architectures (FlashAttention, GDN, DSA, etc.) and cache layouts
Core Architecture#
KVT consists of four core modules:
1. Access Layer (Python Binding)#
Handles Python-side integration with the following functions:
CUDA Event notification for layer computation completion, enabling Full CUDA Graph compatibility
Supports P node full-cuda-graph, D node requires no logic execution
2. ParseBlock (Block Parser)#
Calculates the list of IpcBlocks to send based on layer and request information:
struct IpcBlock {
src_off: usize, // Source offset
dst_off: usize, // Destination offset
len: usize, // Transfer length
}
Design Evolution:
Early KVT assumed Cache shape of (num_blocks, block_size, 2, num_heads, head_dim), meaning K/V for each token resides in the same block.
However, vLLM uses FlashAttention with Cache shape of (2, num_blocks, block_size, num_kv_heads, head_size), meaning K/V are separated into different blocks.
Solution: Extract ParseBlock as a pluggable strategy:
During initialization,
block_size_bytesandtoken_size_bytesare still calculated assuming “token kv together”ParseBlock reinterprets offsets based on actual layout during parsing
This design enables support for new architectures (e.g., Qwen3-Next GDN, DeepSeek DSA) by simply adding new ParseBlock implementations.
3. Control Layer#
Responsible for:
Remote connection maintenance
Listening for layer computation completion signals (CUDA Event)
Scheduling data transfer via the transport layer
Error handling and fault tolerance for transfer failures
4. Transport Layer#
Handles Vec<IpcBlock> transmission, supporting multiple backends:
GPU Direct RDMA (GDR): Direct GPU memory access, lowest latency
TCP: Bypasses GPU/GDR path, isolated from EP all2all traffic
Shared Memory: Single-node multi-GPU scenarios
Physical Layout Abstraction#
KVT provides the following transfer abstraction:
Each layer → One GPU memory region
↓
Memory → Multiple Blocks (same byte size)
↓
Block → Multiple Tokens (same byte size)
Physical layout parameters passed during initialization:
block_size_bytes: Byte size per blocktoken_size_bytes: Byte size per tokennum_blocks: Number of blocks
Integration with HybridConnector#
KVT works in coordination with HybridConnector:
┌─────────────────────────────────────────────────────┐
│ vLLM Engine (Python) │
│ │
│ ┌─────────────┐ ┌──────────────────────────┐ │
│ │ Scheduler │ │ HybridConnector │ │
│ └─────────────┘ │ ┌────────────────────┐ │ │
│ │ │ KVT ( C++ ) │ │ │
│ │ │ ┌──────────────┐ │ │ │
│ │ │ │ ParseBlock │ │ │ │
│ │ │ └──────────────┘ │ │ │
│ │ └────────────────────┘ │ │
│ └──────────────────────────┘ │
└─────────────────────────────────────────────────────┘
Responsibility Division:
KVT: Handles low-level KV Cache transfer
HybridConnector: Manages request lifecycle, fault tolerance, and Backend coordination
Project Status#
✅ FlashAttention cache layout support
✅ Full CUDA Graph compatibility
✅ GDR transfer support
✅ TCP transfer support
✅ Qwen3-Next GDN support
✅ DeepSeek DSA support