Scheduler Configuration Guide#
Overview#
For detailed design information about the Scheduler, please refer to the Scheduler Architecture Design Document.
Policy Framework#
Usage and Extension Guidelines#
Choosing a policy and mode
SLO-restricted workloads with profiling data: SLO + full-mode.
Utilization-focused workloads with cache locality needs: load-balance + full-mode + cache-aware scheduling.
Simple deployments without CMS: load-balance + lite-mode, with
num_requestsornum_tokensas the load metric.
Switching between full-mode and lite-mode
Flag:
--enable-full-mode-scheduling.Full-mode (default):
--enable-full-mode-scheduling=true.Lite-mode:
--enable-full-mode-scheduling=false.Advanced scheduling features require full-mode.
Extending with new policies
Reuse existing metrics, filters, and selectors; rewire combinations per infer type.
For new signals: implement
instanceSchedulingMetricinmetrics.go, register ingetSchedulingMetric, configure inschedule_policy_registry.go.
Cache-Aware Scheduling#
Prerequisites#
Llumnix’s cache-aware scheduling requires the inference engine to use a global KV cache store (e.g. Mooncake). Llumnix queries the global KV cache store’s metadata service to obtain prefix cache hit information for each request, so this feature is only applicable when such a store is deployed.
Cache-aware scheduling is disabled by default. It is only supported in full-mode scheduling; the scheduler forcibly disables it when full-mode is not enabled.
Configuration#
Key configuration flags (cmd/config/config.go):
Flag |
Default |
Description |
|---|---|---|
|
|
Enable or disable cache-aware scheduling |
|
|
Minimum prompt length to trigger cache-aware logic |
|
|
KVS metadata service backend |
|
|
Hash algorithm for prompt chunking |
|
|
Token chunk size |
|
|
Whether to hash the last incomplete chunk |
|
|
Retry count for metadata service queries |
|
|
Interval between retries |
|
|
Duration to treat metadata service as down after failures |
Environment variables:
Variable |
Default |
Description |
|---|---|---|
|
|
Hash seed for algorithms that require one (e.g. |
Deployment Example#
For a complete Kubernetes deployment example with PD disaggregation, Mooncake as KVS, and cache-aware scheduling
enabled, see deploy/pd-kvs/full-mode-scheduling/load-balance/.
Predictor-Enhanced Scheduling#
Configuration#
Scheduler Flags#
Flag |
Default |
Description |
|---|---|---|
|
|
Enable predictor-enhanced scheduling |
|
|
Maximum tokens per prefill batch; must match the inference engine’s |
|
|
Minimum profiling samples before fitting the predictor model |
Engine Environment Variables#
Variable |
Default |
Description |
|---|---|---|
|
|
Set to |
|
|
Number of profiling samples to collect per instance |
Constraints#
Full-mode only: Requires
--enable-full-mode-scheduling=trueand CMS. The scheduler forcibly disables this feature when full-mode is not enabled.max-num-batched-tokensalignment: The scheduler-side--max-num-batched-tokensmust match the inference engine’smax_num_batched_tokensconfiguration. A mismatch causes inaccurate step simulation and degrades prediction quality.
SLO-Aware Scheduling#
Configuration#
Flag |
Setting |
Description |
|---|---|---|
|
|
Must be |
|
|
Set to |
|
Path to TTFT profiling JSON file |
(required) |
|
Path to TPOT profiling JSON file |
(required) |
|
Target TTFT SLO in milliseconds |
|
|
Target TPOT SLO in milliseconds |
|
|
Multiplier for TTFT dispatch threshold |
|
|
Multiplier for TPOT dispatch threshold |
|
The effective dispatch threshold is computed as SLO * DispatchThreshold. Instances with predicted latency exceeding
this threshold are filtered out. If no instances meet the threshold, a scheduling error (ErrorNoAvailableEndpoint) is
returned.
Best Practices#
Accurate profiling data: Collect profiling data on the same hardware configuration and engine launch parameters as production. Inaccurate profiling data leads to poor latency predictions.
Conservative thresholds: Start with
--ttft-slo-dispatch-thresholdand--tpot-slo-dispatch-thresholdvalues slightly below 1.0 to allow some margin for prediction errors.Monitor actual latencies: Compare predicted latencies with actual observed latencies and adjust profiling data if systematic bias is detected.
Deployment Example#
For a complete Kubernetes deployment example with SLO-aware scheduling, see deploy/slo-aware/base/.
Adaptive PD Scheduling#
Configuration#
Flag |
Setting |
Description |
|---|---|---|
|
|
Requires full-mode scheduling. |
|
|
Must be set to |
|
|
Enable adaptive PD scheduling. |
|
|
TPOT SLO target (ms). |
|
|
Fraction of TPOT SLO used as the dispatch filter threshold. |
|
|
Enable colocated reschedule mode. (standalone rescheduling is also supported). |
|
|
Reschedule interval. |
|
|
Reschedule policies. |
|
|
Fraction of TPOT SLO above which overload rescheduling triggers. |
|
|
Fraction of TPOT SLO below which underload rescheduling triggers. |
|
|
Enable instance status local account. |
Deployment Example#
For a complete Kubernetes deployment example with adaptive PD scheduling, see deploy/slo-aware/adaptive-pd/.
Rescheduler#
Configuration#
Core Rescheduling Flags#
Flag |
Default |
Description |
|---|---|---|
|
|
Enable rescheduling |
|
|
Comma-separated list of rescheduling policies |
|
|
Interval between rescheduling iterations |
|
|
Run rescheduler inside scheduler process |
|
|
Run rescheduler as separate process |
Load Balance Configuration#
Flag |
Default |
Description |
|---|---|---|
|
|
Load metric for decode instances |
|
|
Threshold for source/destination filtering: instances >= this value are migration sources, instances < this value are migration destinations |
|
|
Load metric for neutral instances |
|
|
Threshold for source/destination filtering: instances >= this value are migration sources, instances < this value are migration destinations |
|
|
Minimum load difference required to trigger migration |
|
|
Balancing scope: |
Adaptive PD Configuration#
Flag |
Default |
Description |
|---|---|---|
|
|
Enable adaptive PD scheduling |
|
|
Must be set to |
|
|
TPOT SLO target (ms) |
|
|
Fraction of TPOT SLO used as dispatch/destination filter threshold |
|
|
Fraction of TPOT SLO above which mitigating rescheduling triggers (source filter) |
|
|
Fraction of TPOT SLO below which consolidating rescheduling triggers (source filter) |
|
|
Rescheduling policies for adaptive PD |
|
|
Interval between rescheduling iterations (use |
Note: For details, see Adaptive PD Scheduling.
Failover Configuration#
Flag |
Default |
Description |
|---|---|---|
|
|
Failure domain: |
|
|
Time after which an instance is considered stale |
Migration Request Configuration#
Flag |
Default |
Description |
|---|---|---|
|
|
Migration request selection rule: |
|
|
Migration request selection order: |
|
|
Number of requests/tokens or KV cache ratio to migrate |
gRPC Configuration#
Flag |
Default |
Description |
|---|---|---|
|
|
Size of gRPC connection pool per instance |
|
|
Timeout for gRPC migration calls |
Deployment Modes#
Mode |
Flag |
Process |
|---|---|---|
Colocated |
|
Inside scheduler process |
Standalone |
|
Separate process |