Gateway Configuration Guide#
Overview#
For detailed design information about the Gateway, please refer to the Gateway Architecture Design Document.
PDD Forwarding Protocol#
Configuration#
Key configuration flags (cmd/config/config.go):
Flag |
Default |
Description |
|---|---|---|
|
|
PDD protocol type: vllm-kvt/vllm-mooncake |
|
|
Enable staged scheduling mode, batched scheduling mode when false |
Deployment Example#
For a complete Kubernetes deployment example with vllm-mooncake PDD protocol, see
deploy/pd/full-mode-scheduling/load-balance. For vllm-kvt PDD protocol, see
deploy/pd-kvs/full-mode-scheduling/load-balance.
Traffic Splitting#
Configuration#
Flag |
Default |
Description |
|---|---|---|
|
|
Routing policy: |
|
|
JSON array of route endpoint configurations |
|
|
Max retries for internal routing on retryable errors before triggering fallback |
|
|
Enable retry queue for 429 responses from fallback endpoints |
|
|
Max queued 429-retry tasks |
|
|
Concurrent goroutines processing 429-retry tasks |
|
|
Max 429 retries per request |
|
|
Initial backoff delay (ms) for 429 retries |
|
|
Max backoff delay (ms) for 429 retries |
Route Config JSON Format#
The --route-config flag accepts a JSON array. Each element describes one endpoint:
Field |
Type |
Description |
|---|---|---|
|
string |
Endpoint URL. Set to |
|
string |
API key for authentication. The gateway sets the |
|
string |
Reserved. Carried in the route config but not used to modify the proxied request; the original request model is forwarded as-is |
|
bool |
Whether this endpoint participates in the fallback chain |
|
int |
Weight for weight-based routing |
|
string |
Model name prefix pattern for prefix-based routing (e.g. |
Weight-based example:
[
{
"base_url": "local",
"weight": 50
},
{
"base_url": "http://vllm-external:8000",
"weight": 50,
"fallback": true
}
]
Prefix-based example:
[
{
"prefix": "Qwen/Qwen3-*",
"base_url": "local"
},
{
"prefix": "Qwen/Qwen2.5-*",
"base_url": "http://vllm-external:8000",
"fallback": true
}
]
Deployment Example#
For complete Kubernetes deployment examples with service routing, see:
Prefix-based routing:
deploy/service-router/prefix/Weight-based routing:
deploy/service-router/weight/
Both examples deploy an internal vLLM instance (Llumnix-managed with service discovery) and an external standalone vLLM instance, with the gateway configured to route between them and fall back to the external service on internal failure.