Scheduler Configuration Guide

Scheduler Configuration Guide#

Overview#

For detailed design information about the Scheduler, please refer to the Scheduler Architecture Design Document.

Service Discovery#

Configuration#

The Scheduler discovers backend inference instances differently depending on the scheduling mode:

Lite-mode: Uses the same Redis/etcd discovery as the Gateway (via --llm-backend-discovery).
Full-mode: Uses CMS (Cluster Management Service) exclusively, polling instance metadata and status from a dedicated Redis.

Lite-mode discovery flags#

These flags are shared with the Gateway. See the Gateway Service Discovery configuration for the full flag reference (--llm-backend-discovery, --discovery-redis-*, --discovery-etcd-*).

Full-mode CMS flags#

Flag	Default	Description
`--cms-redis-host`	`"redis"`	CMS Redis host
`--cms-redis-port`	`"6379"`	CMS Redis port
`--cms-redis-username`	`""`	CMS Redis username
`--cms-redis-password`	`""`	CMS Redis password
`--cms-redis-timeout`	`1.0`	CMS Redis socket timeout in seconds
`--cms-redis-retry-times`	`1`	CMS Redis retry times on connection failure
`--cms-pull-status-interval-ms`	`500`	Polling interval (ms) for instance status from CMS
`--cms-pull-metadata-interval-ms`	`10000`	Polling interval (ms) for instance metadata from CMS

Policy Framework#

Usage and Extension Guidelines#

Choosing a policy and mode
- SLO-restricted workloads with profiling data: SLO + full-mode.
- Utilization-focused workloads with cache locality needs: load-balance + full-mode + cache-aware scheduling.
- Simple deployments without CMS: load-balance + lite-mode, with num_requests or num_tokens as the load metric.
Switching between full-mode and lite-mode
- Flag: --enable-full-mode-scheduling.
- Full-mode (default): --enable-full-mode-scheduling=true.
- Lite-mode: --enable-full-mode-scheduling=false.
- Advanced scheduling features require full-mode.
Extending with new policies
- Reuse existing metrics, filters, and selectors; rewire combinations per infer type.
- For new signals: implement instanceSchedulingMetric in metrics.go, register in getSchedulingMetric, configure in schedule_policy_registry.go.

Cache-Aware Scheduling#

Prerequisites#

Llumnix’s cache-aware scheduling requires the inference engine to use a global KV cache store (e.g. Mooncake). Llumnix queries the global KV cache store’s metadata service to obtain prefix cache hit information for each request, so this feature is only applicable when such a store is deployed.

Cache-aware scheduling is disabled by default. It is only supported in full-mode scheduling; the scheduler forcibly disables it when full-mode is not enabled.

Configuration#

Key configuration flags (cmd/config/config.go):

Flag	Default	Description
`--enable-cache-aware-scheduling`	`false`	Enable or disable cache-aware scheduling
`--cache-aware-scheduling-min-tokens`	`1024`	Minimum prompt length to trigger cache-aware logic
`--kvs-backend`	`mooncake`	KVS metadata service backend
`--kvs-hash-algo`	`sha256_cbor`	Hash algorithm for prompt chunking
`--kvs-chunk-size`	`256`	Token chunk size
`--kvs-enable-save-unfull-chunk`	`false`	Whether to hash the last incomplete chunk
`--kvs-retry-times`	`5`	Retry count for metadata service queries
`--kvs-retry-interval-ms`	`100`	Interval between retries
`--kvs-metadata-service-down-duration-s`	`30`	Duration to treat metadata service as down after failures

Environment variables:

Variable	Default	Description
`GO_HASH_SEED`	`"0"`	Hash seed for algorithms that require one (e.g. `sha256_cbor`). Must match the seed used by the KVS

Deployment Example#

For a complete Kubernetes deployment example with PD disaggregation, Mooncake as KVS, and cache-aware scheduling enabled, see deploy/pd-kvs/full-mode-scheduling/load-balance/.

Predictor-Enhanced Scheduling#

Configuration#

Scheduler Flags#

Flag	Default	Description
`--enable-predictor-enhanced-scheduling`	`false`	Enable predictor-enhanced scheduling
`--max-num-batched-tokens`	`65536`	Maximum tokens per prefill batch; must match the inference engine’s `max_num_batched_tokens`
`--num-predictor-warmup-samples`	`20`	Minimum profiling samples before fitting the predictor model

Engine Environment Variables#

Variable	Default	Description
`LLUMNIX_ENABLE_PROFILING`	`0`	Set to `1` to enable profiling data collection on the engine side
`LLUMNIX_PROFILING_STEPS`	`50`	Number of profiling samples to collect per instance

Constraints#

Full-mode only: Requires --enable-full-mode-scheduling=true and CMS. The scheduler forcibly disables this feature when full-mode is not enabled.
max-num-batched-tokens alignment: The scheduler-side --max-num-batched-tokens must match the inference engine’s max_num_batched_tokens configuration. A mismatch causes inaccurate step simulation and degrades prediction quality.

SLO-Aware Scheduling#

Configuration#

Flag	Setting	Description
`--enable-full-mode-scheduling`	`true`	Must be `true` for SLO policy
`--scheduling-policy`	`slo`	Set to `slo` to enable SLO policy
`--ttft-profiling-data-path`	Path to TTFT profiling JSON file	(required)
`--tpot-profiling-data-path`	Path to TPOT profiling JSON file	(required)
`--ttft-slo`	Target TTFT SLO in milliseconds	`6000.0`
`--tpot-slo`	Target TPOT SLO in milliseconds	`50.0`
`--ttft-slo-dispatch-threshold`	Multiplier for TTFT dispatch threshold	`1.0`
`--tpot-slo-dispatch-threshold`	Multiplier for TPOT dispatch threshold	`1.0`

The effective dispatch threshold is computed as SLO * DispatchThreshold. Instances with predicted latency exceeding this threshold are filtered out. If no instances meet the threshold, a scheduling error (ErrorNoAvailableEndpoint) is returned.

Best Practices#

Accurate profiling data: Collect profiling data on the same hardware configuration and engine launch parameters as production. Inaccurate profiling data leads to poor latency predictions.
Conservative thresholds: Start with --ttft-slo-dispatch-threshold and --tpot-slo-dispatch-threshold values slightly below 1.0 to allow some margin for prediction errors.
Monitor actual latencies: Compare predicted latencies with actual observed latencies and adjust profiling data if systematic bias is detected.

Deployment Example#

For a complete Kubernetes deployment example with SLO-aware scheduling, see deploy/slo-aware/base/.

Adaptive PD Scheduling#

Configuration#

Flag	Setting	Description
`--enable-full-mode-scheduling`	`true`	Requires full-mode scheduling.
`--scheduling-policy`	`slo`	Must be set to `slo` for adaptive PD.
`--enable-adaptive-pd`	`true`	Enable adaptive PD scheduling.
`--tpot-slo`	`50`	TPOT SLO target (ms).
`--tpot-slo-dispatch-threshold`	`0.85`	Fraction of TPOT SLO used as the dispatch filter threshold.
`--colocated-reschedule-mode`	`true`	Enable colocated reschedule mode. (standalone rescheduling is also supported).
`--reschedule-interval-ms`	`100`	Reschedule interval.
`--reschedule-policies`	`binpacking_mitigation,binpacking_consolidation`	Reschedule policies.
`--tpot-migrate-out-ceil-threshold`	`0.95`	Fraction of TPOT SLO above which overload rescheduling triggers.
`--tpot-migrate-out-floor-threshold`	`0.60`	Fraction of TPOT SLO below which underload rescheduling triggers.
`--enable-instance-status-local-account` (Optional)	`true`	Enable instance status local account.

Deployment Example#

For a complete Kubernetes deployment example with adaptive PD scheduling, see deploy/slo-aware/adaptive-pd/.

Rescheduler#

Configuration#

Core Rescheduling Flags#

Flag	Default	Description
`--enable-rescheduling`	`false`	Enable rescheduling
`--rescheduling-policies`	`"decode_load,prefill_failover,decode_failover,neutral_failover"`	Comma-separated list of rescheduling policies
`--rescheduling-interval-ms`	`500`	Interval between rescheduling iterations
`--colocated-rescheduling-mode`	`false`	Run rescheduler inside scheduler process
`--standalone-rescheduling-mode`	`false`	Run rescheduler as separate process

Load Balance Configuration#

Flag	Default	Description
`--rescheduling-decode-load-metric`	`"kv_cache_usage_ratio_projected"`	Load metric for decode instances
`--rescheduling-decode-load-threshold`	`1.0`	Threshold for source/destination filtering: instances >= this value are migration sources, instances < this value are migration destinations
`--rescheduling-neutral-load-metric`	`"kv_cache_usage_ratio_projected"`	Load metric for neutral instances
`--rescheduling-neutral-load-threshold`	`1.0`	Threshold for source/destination filtering: instances >= this value are migration sources, instances < this value are migration destinations
`--rescheduling-load-balance-threshold`	`0.0`	Minimum load difference required to trigger migration
`--rescheduling-load-balance-scope`	`"cluster"`	Balancing scope: `cluster` or `unit`

Adaptive PD Configuration#

Flag	Default	Description
`--enable-adaptive-pd`	`false`	Enable adaptive PD scheduling
`--scheduling-policy`	`"load-balance"`	Must be set to `slo` for adaptive PD
`--tpot-slo`	`50`	TPOT SLO target (ms)
`--tpot-slo-dispatch-threshold`	`0.85`	Fraction of TPOT SLO used as dispatch/destination filter threshold
`--tpot-migrate-out-ceil-threshold`	`0.95`	Fraction of TPOT SLO above which mitigating rescheduling triggers (source filter)
`--tpot-migrate-out-floor-threshold`	`0.60`	Fraction of TPOT SLO below which consolidating rescheduling triggers (source filter)
`--rescheduling-policies`	`"binpacking_mitigation,binpacking_consolidation"`	Rescheduling policies for adaptive PD
`--rescheduling-interval-ms`	`500`	Interval between rescheduling iterations (use `100` for adaptive PD)

Note: For details, see Adaptive PD Scheduling.

Failover Configuration#

Flag	Default	Description
`--failover-domain`	`"instance"`	Failure domain: `instance`, `node`, `instance-unit`, `node-unit`
`--instance-staleness-seconds`	`60`	Time after which an instance is considered stale

Migration Request Configuration#

Flag	Default	Description
`--rescheduling-req-select-rule`	`"TOKEN"`	Migration request selection rule: `NUM_REQ`, `TOKEN`, `RATIO`
`--rescheduling-req-select-order`	`"SR"`	Migration request selection order: `LCR`, `FCR`, `LR`, `SR`, `FCW`, `FCWSR`
`--rescheduling-req-select-value`	`1024`	Number of requests/tokens or KV cache ratio to migrate

gRPC Configuration#

Flag	Default	Description
`--llumlet-grpc-connection-pool-size`	`10`	Size of gRPC connection pool per instance
`--llumlet-grpc-timeout-seconds`	`5`	Timeout for gRPC migration calls

Deployment Modes#

Mode	Flag	Process
Colocated	`--colocated-rescheduling-mode=true`	Inside scheduler process
Standalone	`--standalone-rescheduling-mode=true`	Separate process

Scheduler Configuration Guide

Contents

Scheduler Configuration Guide#

Overview#

Service Discovery#

Configuration#

Lite-mode discovery flags#

Full-mode CMS flags#

Policy Framework#

Usage and Extension Guidelines#

Cache-Aware Scheduling#

Prerequisites#

Configuration#

Deployment Example#

Predictor-Enhanced Scheduling#

Configuration#

Scheduler Flags#

Engine Environment Variables#

Constraints#

SLO-Aware Scheduling#

Configuration#

Best Practices#

Deployment Example#

Adaptive PD Scheduling#

Configuration#

Deployment Example#

Rescheduler#

Configuration#

Core Rescheduling Flags#

Load Balance Configuration#

Adaptive PD Configuration#

Failover Configuration#

Migration Request Configuration#

gRPC Configuration#

Deployment Modes#