Traffic Splitting#
Introduction#
In production LLM serving, backend instances may become unavailable due to crashes, overload, or rate limiting. Without a traffic management layer, such failures propagate directly to users. The service router addresses this by splitting request traffic across internal (Llumnix-managed) and external (e.g. third-party API) endpoints, and falling back to alternative endpoints when the primary route fails.
The service router supports two routing policies — weight-based and prefix-based — and a fallback chain that activates on scheduling failures, HTTP errors, or unmatched routes.
Design and implementation#
graph TB
Request([Request])
subgraph Gateway
Router["Service Router"]
end
Router -->|RouteInternal| Scheduler["Scheduler"]
Router -->|RouteExternal| ExternalEP["External Endpoints"]
Router -->|RouteUnknown| Fallback["Fallback Chain"]
Scheduler --> Internal["Internal Instances<br/>(Llumnix-managed)"]
Scheduler -->|scheduling failure| Fallback
Internal -->|HTTP error after retries| Fallback
ExternalEP -->|HTTP error| Fallback
Fallback --> FB1["Fallback Endpoints"]
Request --> Router
Routing policies#
The ServiceRouter (pkg/gateway/router/service_router.go) evaluates each incoming request and produces one of three route types:
RouteInternal: dispatch to internal Llumnix-managed instances via the scheduler.
RouteExternal: proxy directly to an external endpoint (e.g. a third-party model API).
RouteUnknown: no matching route found; proceed to the fallback chain.
Two mutually exclusive policies determine how the route is selected:
Weight-based routing (--route-policy weight): each configured endpoint carries an integer weight. The router draws a random number over the total weight sum and selects the corresponding endpoint. This distributes traffic proportionally — for example, weights 50/50 yield an approximately even split.
Prefix-based routing (--route-policy prefix): each endpoint declares a model name prefix pattern (e.g. Qwen/Qwen3-*). The router matches the request’s model name against all patterns and selects the endpoint with the longest matching prefix. An exact match (no wildcard) takes priority over any prefix match. A catch-all pattern * matches any model with the shortest length, serving as a default route.
For both policies, an endpoint with base_url set to "local" is treated as the internal route.
Fallback mechanism#
When the primary route fails, the gateway attempts fallback endpoints in the order they appear in --route-config. Only external endpoints with "fallback": true participate in the fallback chain.
Fallback triggers in three scenarios:
Scheduling failure: the scheduler cannot find an available internal instance (e.g. all instances are overloaded or unhealthy).
Internal HTTP error after retries: the request was dispatched to an internal instance, encountered a retryable error (5xx, network error), exhausted all retry attempts (
--retry-max-count), and the response headers have not yet been sent to the client.External HTTP error: the primary external endpoint returned a 4xx/5xx error.
For each fallback attempt, the gateway proxies the request to the next fallback endpoint. If that endpoint also fails, the gateway moves to the next one in the chain until all fallback endpoints are exhausted.
When a fallback endpoint returns HTTP 429 (Too Many Requests), the gateway can optionally enqueue the request into a rate-limit retry queue (--fallback-retry-queue-enabled) that retries with exponential backoff, avoiding immediate rejection.
Request dispatch flow#
The gateway’s dispatchRequest method (pkg/gateway/service/gateway_service.go) orchestrates the full lifecycle:
ServiceRouter.Route()determines the route type and target endpoint.Based on the route type:
RouteInternal: acquire an instance from the scheduler, then execute the request with retry logic. On exhausted retries, trigger fallback if available.
RouteExternal: proxy the request to the external endpoint. On failure, trigger fallback if available.
RouteUnknown: proceed directly to the fallback chain.
The fallback chain iterates through configured fallback endpoints sequentially, with exponential backoff between attempts.
Configuration#
Route configuration flags#
Flag |
Default |
Description |
|---|---|---|
|
|
Routing policy: |
|
|
JSON array of route endpoint configurations |
|
|
Max retries for internal routing on retryable errors before triggering fallback |
|
|
Enable retry queue for 429 responses from fallback endpoints |
|
|
Max queued 429-retry tasks |
|
|
Concurrent goroutines processing 429-retry tasks |
|
|
Max 429 retries per request |
|
|
Initial backoff delay (ms) for 429 retries |
|
|
Max backoff delay (ms) for 429 retries |
Route config JSON format#
The --route-config flag accepts a JSON array. Each element describes one endpoint:
Field |
Type |
Description |
|---|---|---|
|
string |
Endpoint URL. Set to |
|
string |
API key for authentication. The gateway sets the |
|
string |
Reserved. Carried in the route config but not used to modify the proxied request; the original request model is forwarded as-is |
|
bool |
Whether this endpoint participates in the fallback chain |
|
int |
Weight for weight-based routing |
|
string |
Model name prefix pattern for prefix-based routing (e.g. |
Weight-based example:
[
{"base_url": "local", "weight": 50},
{"base_url": "http://vllm-external:8000", "weight": 50, "fallback": true}
]
Prefix-based example:
[
{"prefix": "Qwen/Qwen3-*", "base_url": "local"},
{"prefix": "Qwen/Qwen2.5-*", "base_url": "http://vllm-external:8000", "fallback": true}
]
Deployment example#
For complete Kubernetes deployment examples with service routing, see:
Prefix-based routing:
deploy/service-router/prefix/Weight-based routing:
deploy/service-router/weight/
Both examples deploy an internal vLLM instance (Llumnix-managed with service discovery) and an external standalone vLLM instance, with the gateway configured to route between them and fall back to the external service on internal failure.