Llumnix Documentation#
Llumnix is a full-stack solution for distributed LLM inference serving, featuring fully dynamic request scheduling for modern LLM deployments.
π Getting Started#
New to Llumnix? Start here.
Quick Start β Deploy your first Llumnix cluster in minutes
Deployment Guide - Full deployment guide covering all modes: neutral, PD, and PD-KVS
Benchmark - Performance benchmarks for Llumnix on Kubernetes
Llumlet Configuration β Configure Llumlet
π οΈ Development Guide#
Development Setup β Environment setup, build commands, and e2e unit tests guide
Build Images β Manually build and push component images
ποΈ Design#
Understand how Llumnix works internally.
Architecture Overview β Full-stack component design
Gateway
Gateway Architecture - Gateway architecture and basic functionalities
PDD Forwarding Protocol β Prefill-Decode disaggregation forwarding protocol
Batch Inference β Batch inference support
Traffic Splitting β Split traffic across internal and external endpoints with fallback
Scheduler
Scheduling Policy Framework β Scheduling policy design
Instant and Accurate Load β Instance load obervation and modeling
Cache-aware Scheduling β Scheduling with KV cache state awareness
Predictor-Enhanced Scheduling β Online latency prediction for accurate prefill load estimation
SLO-aware Scheduling β Scheduling with SLO awareness
Adaptive PD Scheduling β Adaptive P/D role assignment to maximize SLO attainment
Rescheduler β Continuous rescheduling via request migration
Llumlet
Llumlet and Llumlet Proxy β Engine-side agent bridging local engine and global scheduler
Real-time Instance Status Tracking β How Llumnix tracks engine state with minimal delay and overhead
Migration β Request migration implementation