Llumnix Documentation

Llumnix Documentation#

Llumnix is a full-stack solution for distributed LLM inference serving, featuring fully dynamic request scheduling for modern LLM deployments.

New to Llumnix? Start here.

Quick Start — Deploy your first Llumnix cluster in minutes
Deployment Guide - Full deployment guide covering all modes: neutral, PD, and PD-KVS
Benchmark - Performance benchmarks for Llumnix on Kubernetes

Understand how Llumnix works internally.

Architecture Overview — Full-stack component design
Service Discovery — Discovery sidecar and CMS discovery paths
Observability — Monitoring setup and Grafana dashboards
Gateway
- Gateway Architecture - Gateway architecture and basic functionalities
- PDD Forwarding Protocol — Prefill-Decode disaggregation forwarding protocol
- Batch Inference — Batch inference support
- Traffic Splitting — Split traffic across internal and external endpoints with fallback
- Traffic Mirror — Asynchronously mirror requests to a secondary target
Scheduler
- Scheduling Policy Framework — Scheduling policy design
- Instant and Accurate Load — Instance load obervation and modeling
- Cache-aware Scheduling — Scheduling with KV cache state awareness
- Predictor-Enhanced Scheduling — Online latency prediction for accurate prefill load estimation
- SLO-aware Scheduling — Scheduling with SLO awareness
- Adaptive PD Scheduling — Adaptive P/D role assignment to maximize SLO attainment
- Rescheduler — Continuous rescheduling via request migration
Llumlet
- Llumlet and Llumlet Proxy — Engine-side agent bridging local engine and global scheduler
- Real-time Instance Status Tracking — How Llumnix tracks engine state with minimal delay and overhead
- Migration — Request migration implementation