Observability

Observability#

Llumnix provides built-in observability at four levels: request performance, component diagnostics, fine-grained instance state, and engine-native metrics. Each component exposes a /metrics HTTP endpoint. Prometheus Operator CRDs (ServiceMonitor / PodMonitor) scrape these endpoints, and pre-built Grafana dashboards in deploy/observability/ visualize the collected data.

Monitoring Setup#

deploy/base/monitoring.yaml defines Prometheus Operator resources to scrape metrics from Llumnix components:

ServiceMonitor (llumnix-control-plane): scrapes Gateway (port 8089) and Scheduler (port 8088) via /metrics every 10s.
PodMonitor (llumnix-engine-neutral / llumnix-engine-prefill / llumnix-engine-decode): scrapes engine pods managed by LeaderWorkerSet (neutral, prefill, decode) via /metrics every 10s, with relabeling to extract infer_type and model labels.

monitoring.yaml is included in deploy/base/kustomization.yaml. All deployment configurations under deploy/ reference this base and inherit monitoring automatically. Note that the engine PodMonitors match only LeaderWorkerSet-based pods; deployment examples using plain Deployments (e.g., traffic-mirror/, traffic-splitting/) are not covered by the engine PodMonitors — only Gateway and Scheduler metrics are collected for those examples.

Grafana Dashboards#

Pre-built Grafana dashboard JSON files are located in deploy/observability/.

Llumnix Request Dashboard#

llumnix-request-dashboard.json — end-user request-level metrics.

Panel	Key Metrics	Type	Description
Request Rate	`request_total`	Counter	Total inference requests processed, partitioned by status code
Request Rate by Status	`request_total`	Counter	Request rate broken down by HTTP status code
Request Retry & Fallback Rate	`request_retry_total`, `request_fallback_total`, `request_fallback_retry_success_total`	Counter	Retry count, fallback count, and successful retries after fallback rate-limit (429)
Input / Output Token Throughput	`request_input_tokens_total`, `request_output_tokens_total`	Counter	Cumulative input (prompt) and output (completion) token counts
Input / Output Token Distribution	`request_input_tokens`, `request_output_tokens`	Histogram	Per-request token count distribution
E2E Latency	`request_e2e_latency_seconds`	Histogram	End-to-end request latency in seconds
TTFT	`request_ttft_milliseconds`	Histogram	Time to first token in milliseconds
TPOT	`request_tpot_milliseconds`	Histogram	Time per output token: (E2E − TTFT) / (output_tokens − 1) in milliseconds
ITL	`request_itl_milliseconds`	Histogram	Inter-token latency between consecutive output tokens in milliseconds
Prefix Cache Hit Ratio	`request_prefix_cache_hit_percent`	Histogram	Prefix cache hit ratio on the selected instance per request (0–100%)
Max Prefix Cache Hit Ratio	`request_max_prefix_cache_hit_percent`	Histogram	Maximum prefix cache hit ratio across all instances per request (0–100%)

Llumnix Component Dashboard#

llumnix-component-dashboard.json — internal component-level metrics for Gateway, Scheduler, and system runtime.

Panel	Key Metrics	Type	Description
Queue Duration	`request_queue_duration_milliseconds`	Histogram	Request queue waiting duration in milliseconds
Preprocess Duration	`request_preprocess_duration_milliseconds`	Histogram	Request preprocessing duration in milliseconds
Schedule Duration	`request_schedule_duration_milliseconds`	Histogram	Request scheduling phase duration in milliseconds
Postprocess Duration	`request_postprocess_duration_milliseconds`	Histogram	Response postprocessing duration in milliseconds
Gateway Requests	`gateway_pending_requests`, `gateway_current_requests`	Gauge	Pending and total in-flight requests in the Gateway
Scheduling Events	`scheduler_scheduling_total`, `scheduler_scheduling_failed_total`	Counter	Total scheduling attempts and failures
Rescheduling Events	`scheduler_rescheduling_total`, `scheduler_rescheduling_failed_total`	Counter	Total rescheduling operations and failures
CMS Refresh Metadata Duration	`scheduler_cms_refresh_metadata_duration_milliseconds`	Histogram	CMS instance metadata refresh duration
CMS Refresh Status Duration	`scheduler_cms_refresh_status_duration_milliseconds`	Histogram	CMS instance status refresh duration
Full-Mode Schedule Duration	`request_full_mode_schedule_duration_milliseconds`	Histogram	Full-mode scheduling decision duration per request
Query Prefix Cache Hit Duration	`request_query_prefix_cache_hit_duration_milliseconds`	Histogram	Duration of querying KVS for prefix cache hit
Calc Prefix Cache Hit Duration	`request_calc_prefix_cache_hit_duration_milliseconds`	Histogram	Duration of calculating prefix cache hit length
Uptime / Goroutines / Go Memory	`uptime_seconds`, `go_goroutines`, `go_memstats_alloc_bytes`	Gauge	System runtime diagnostics

Llumnix CMS Dashboard#

llumnix-cms-dashboard.json — per-instance CMS status, split into Prefill and Decode sections. Each section includes:

Panel	Key Metrics	Type	Description
CMS Requests	`instance_cms_running_requests`, `instance_cms_waiting_requests`, `instance_cms_loading_requests`, `instance_cms_scheduler_waiting_to_decode_requests`, `instance_cms_scheduler_running_to_decode_requests`, `instance_cms_hybrid_scheduler_waiting_to_decode_requests`	Gauge	Per-instance request counts by state
CMS Used Tokens	`instance_cms_used_gpu_tokens`	Gauge	GPU tokens currently used per instance
CMS Prefill Tokens	`instance_cms_uncomputed_tokens_all_waiting_prefills`, `instance_cms_uncomputed_tokens_scheduler_running_prefills`, `instance_cms_unallocated_tokens_scheduler_running_prefills`	Gauge	Uncomputed/unallocated tokens for prefill requests
CMS Decode Tokens	`instance_cms_unallocated_tokens_hybrid_scheduler_waiting_decodes`, `instance_cms_hybrid_scheduler_waiting_to_decode_tokens`, `instance_cms_scheduler_waiting_to_decode_tokens`, `instance_cms_scheduler_running_to_decode_tokens`, `instance_cms_tokens_loading_requests`	Gauge	Tokens for decode and loading requests
CMS Inflight Dispatch Requests	`instance_cms_inflight_dispatch_requests`, `instance_cms_inflight_dispatch_prefill_requests`, `instance_cms_inflight_dispatch_decode_requests`	Gauge	Inflight dispatch request counts
CMS Inflight Dispatch Tokens	`instance_cms_uncomputed_tokens_inflight_dispatch_prefill_requests`, `instance_cms_tokens_inflight_dispatch_decode_requests`	Gauge	Tokens for inflight dispatch requests
CMS KV Cache Usage Ratio	`instance_cms_kv_cache_usage_ratio_projected`	Gauge	Projected KV cache usage ratio per instance
CMS Decode Batch Size	`instance_cms_decode_batch_size`	Gauge	Decode batch size per instance
CMS All Prefill/Decode Tokens	`instance_cms_all_prefills_tokens_num`, `instance_cms_all_decodes_tokens_num`	Gauge	Total tokens for all prefill/decode requests per instance

The dashboard also includes a Selected Instance Scheduling Metrics section showing the scheduling decision context for the chosen instance:

Panel	Key Metrics	Type	Description
Selected Instance KV Cache Usage Ratio	`selected_instance_kv_cache_usage_ratio_projected`	Histogram	Projected KV cache usage ratio on the selected instance
Selected Instance Decode Batch Size	`selected_instance_decode_batch_size`	Histogram	Decode batch size on the selected instance
Selected Instance Prefill/Decode Tokens	`selected_instance_all_prefills_tokens_num`, `selected_instance_all_decodes_tokens_num`	Histogram	Total prefill/decode tokens on the selected instance
Selected Instance Predicted TTFT	`selected_instance_predicted_ttft`	Histogram	Predicted TTFT for the selected instance in milliseconds
Selected Instance Predicted TPOT	`selected_instance_predicted_tpot`	Histogram	Predicted TPOT for the selected instance in milliseconds

Llumnix LRS Dashboard#

llumnix-lrs-dashboard.json — per-instance Local Real-time State (LRS), split into Prefill and Decode sections.

Panel	Key Metrics	Type	Description
LRS Requests	`instance_lrs_running_requests`, `instance_lrs_waiting_requests`, `instance_lrs_total_requests`	Gauge	Running, waiting, and total requests on a backend endpoint
LRS Tokens	`instance_lrs_running_tokens`, `instance_lrs_waiting_tokens`, `instance_lrs_total_tokens`	Gauge	Running, waiting, and total token counts on a backend endpoint

vLLM Dashboard#

vllm-dashboard.json — engine-native vLLM metrics.

Panel	Key Metrics	Type	Description
E2E Request Latency	`vllm:e2e_request_latency_seconds`	Histogram	End-to-end request latency from vLLM
Time To First Token Latency	`vllm:time_to_first_token_seconds`	Histogram	TTFT from vLLM
Inter-Token Latency	`vllm:inter_token_latency_seconds`	Histogram	ITL from vLLM
Scheduler State	`vllm:num_requests_running`, `vllm:num_requests_waiting`	Gauge	vLLM internal scheduler state (running/waiting)
Cache Utilization	`vllm:kv_cache_usage_perc`	Gauge	KV cache utilization ratio
Token Throughput	`vllm:prompt_tokens_total`, `vllm:generation_tokens_total`	Counter	Token generation throughput
Finish Reason	`vllm:request_success_total`	Counter	Request completion reason distribution
Queue Time	`vllm:request_queue_time_seconds`	Histogram	Time spent in vLLM queue
Prefill and Decode Time	`vllm:request_prefill_time_seconds`, `vllm:request_decode_time_seconds`	Histogram	Per-request prefill and decode time
Request Prompt/Generation Length	`vllm:request_prompt_tokens`, `vllm:request_generation_tokens`	Histogram	Prompt and generation length distributions (heatmap)