SLO-aware Scheduling#
The SLO (Service Level Objective) aware Scheduling is a latency-aware scheduling policy that routes requests to instances predicted to deliver the lowest latency, ensuring TTFT (Time-To-First-Token) and TPOT (Time-Per-Output-Token) SLO compliance.
Overview#
The SLO policy leverages latency prediction to make informed scheduling decisions. Unlike the load-balance policy that minimizes load, the SLO policy minimizes predicted latency, making it suitable for latency-sensitive workloads with strict SLO requirements.
Key characteristics:
Full-mode only: Requires
--enable-full-mode-scheduling=trueand CMS for accurate instance state.Profiling-based prediction: Uses pre-collected profiling data to predict TTFT and TPOT.
SLO-aware filtering: Filters out instances predicted to exceed SLO thresholds.
Adaptive PD integration (optional): Supports adaptive prefill-decode disaggregation when enabled. please refer to Adaptive PD for more details.
Generating Profiling Data#
Profiling data should be collected from benchmark runs on your target hardware:
Run benchmarks with various batch sizes and token lengths.
Collect TTFT and TPOT distributions.
Compute p50 values for each configuration point.
Format as JSON according to the profiling data schema.
Profiling Data Format#
The profiling data should be formatted as JSON and the schema is defined as follows. Now, only p50 values are used.
TTFT Profiling Data (--ttft-profiling-data-path):
{
"metadata": {
"model": "model-name",
"timestamp": "2026-01-01T00:00:00Z",
"description": "TTFT profiling results"
},
"results": [
{
"tokens_num": 128,
"mean": 45.2,
"p50": 42.0,
"p95": 58.1,
"p99": 62.3
}
]
}
TPOT Profiling Data (--tpot-profiling-data-path):
{
"metadata": {
"model": "model-name",
"timestamp": "2026-01-01T00:00:00Z"
},
"results": [
{
"batch_size": 16,
"tokens_per_request": 8,
"mean": 12.5,
"p50": 11.8,
"p95": 15.2
}
]
}
Latency Prediction#
LatencyPredictor#
The LatencyPredictor (defined in predict_utils.go) uses interpolation-based prediction from profiling data:
TTFT Prediction: Based on prefill tokens, decode batch size and decode tokens. Uses chunked prefill modeling when applicable.
TPOT Prediction: Based on decode batch size and decode tokens.
Prediction Algorithm#
The InterpolationPredictor (defined in interpolation_predictor.go) performs bilinear interpolation:
Finds the bounding box of profiling points around the target parameters.
Computes weighted interpolation between the four corner points.
Returns the predicted latency value.
Scheduling Pipeline#
Prefill Stage#
Metrics:
predicted_ttft: Predicted time-to-first-token latency.
Filters:
failoverFilter(global): Blocks instances in failure domains with unhealthy instances.schedulabilityFilter(single-instance): Blocks unschedulable instances.stalenessFilter(single-instance): Blocks instances with stale status data.metricBasedFilter(single-instance): Blocks instances wherepredicted_ttft > TtftSlo * TtftSloDispatchThreshold.
Selector:
metricBasedSelector: Selects the instance with the lowestpredicted_ttft.
Decode Stage#
Metrics:
predicted_tpot: Predicted time-per-output-token latency.
Filters:
failoverFilter(global): Blocks instances in failure domains.schedulabilityFilter(single-instance): Blocks unschedulable instances.stalenessFilter(single-instance): Blocks instances with stale status data.metricBasedFilter(single-instance): Blocks instances wherepredicted_tpot > TpotSlo * TpotSloDispatchThreshold.
Selector:
metricBasedSelector: Selects the instance with the lowestpredicted_tpot.
Policy Configuration#
Flag |
Setting |
Description |
|---|---|---|
|
|
Must be |
|
|
Set to |
|
Path to TTFT profiling JSON file |
(required) |
|
Path to TPOT profiling JSON file |
(required) |
|
Target TTFT SLO in milliseconds |
|
|
Target TPOT SLO in milliseconds |
|
|
Multiplier for TTFT dispatch threshold |
|
|
Multiplier for TPOT dispatch threshold |
|
The effective dispatch threshold is computed as SLO * DispatchThreshold. Instances with predicted latency exceeding this threshold are filtered out. If no instances meet the threshold, a scheduling error(ErrorNoAvailableEndpoint) is returned.
Best Practices#
Accurate profiling data: Collect profiling data on the same hardware configuration and engine launch parameters as production. Inaccurate profiling data leads to poor latency predictions.
Conservative thresholds: Start with
--ttft-slo-dispatch-thresholdand--tpot-slo-dispatch-thresholdvalues slightly below 1.0 to allow some margin for prediction errors.Monitor actual latencies: Compare predicted latencies with actual observed latencies and adjust profiling data if systematic bias is detected.