Gateway
The SIE gateway is a stateless Rust service that sits between clients and GPU worker pods. It handles routing, queue submission, resource pools, SIE server sidecar health, read-side config, and scale-from-zero orchestration. In Kubernetes, each worker pod runs the SIE server sidecar beside the Python sie-server adapter process; the sidecar pulls queued work and calls the adapter over IPC.
The page keeps the /docs/engine/router/ URL for compatibility, but the deployed component is sie-gateway.
When to Use the Gateway
Section titled “When to Use the Gateway”Not every deployment needs a gateway. The deciding factor is whether you are running an elastic worker fleet:
- Single server (local dev or Docker): point the SDK at a standalone
sie-server. - Kubernetes clusters: use the gateway. It provides a stable client endpoint, worker discovery, queue-based inference, scale-from-zero, resource pools, and config read endpoints.
- Horizontal gateway replicas: supported. Each replica keeps its own in-memory registry and converges through bootstrap, NATS config deltas, and epoch polling.
| Setup | Use Gateway? | Why |
|---|---|---|
| Single Docker container | No | One sie-server process handles the request path |
| Kubernetes | Yes | Required for worker discovery, queue routing, scale-from-zero, and pool isolation |
Architecture
Section titled “Architecture”The gateway is stateless with respect to durable data. It owns in-memory routing state, but it does not persist config and it does not execute inference.
Client request -> sie-gateway resolves model, bundle, machine profile, and pool -> gateway publishes msgpack work items to NATS JetStream -> matching worker pod's SIE server sidecar pulls, batches, and calls the sie-server adapter over UDS IPC -> SIE server sidecar publishes msgpack results to the gateway's NATS Core inbox -> gateway assembles and returns the HTTP responseConfig writes are outside this hot path. Admin tooling writes to sie-config, and gateways mirror that state through /v1/configs/export, NATS deltas, and /v1/configs/epoch polling.
Request Routing
Section titled “Request Routing”The gateway resolves every inference request to:
- Model and profile: the model path and optional
:profilesuffix. - Bundle: selected by adapter compatibility, with the lowest numeric bundle priority winning by default.
- Machine profile:
X-SIE-MACHINE-PROFILEheader or SDKgpuparameter. - Pool: default pool or explicit
X-SIE-Pool/ SDKpool/profiletarget. - Queue subject:
sie.work.{model}.{pool}on the pool’s JetStream stream, consumed by the SIE server sidecar inside matching worker pods.
The Rust gateway is queue-only for inference. If the queue transport is unavailable, the gateway returns 503.
GPU Routing
Section titled “GPU Routing”Requests can specify a target machine profile:
# HTTPcurl -X POST http://gateway:8080/v1/encode/BAAI/bge-m3 \ -H "X-SIE-MACHINE-PROFILE: l4" \ -H "Content-Type: application/json" \ -d '{"items": [{"text": "Hello world"}]}'# SDKresult = client.encode("BAAI/bge-m3", Item(text="hello"), gpu="l4")// SDKconst result = await client.encode("BAAI/bge-m3", { text: "hello" }, { gpu: "l4" });If the caller omits a machine profile, the gateway can use the default configured route. Scale-from-zero returns 202 when the selected (bundle, machine_profile) has no fresh SIE server sidecar health and the caller did not pin an explicit pool.
202 Scale-from-Zero
Section titled “202 Scale-from-Zero”When no healthy SIE server sidecar has recently published health for the selected (bundle, machine_profile) tuple and the caller did not pin a specific pool, the gateway returns:
HTTP/1.1 202 AcceptedRetry-After: 120Content-Type: application/json
{ "status": "provisioning", "gpu": "l4", "bundle": "default", "estimated_wait_s": 180, "message": "No worker available for GPU type 'l4'. Provisioning in progress."}The SDK handles this automatically with wait_for_capacity=True. See Scale-from-Zero for details.
202 is only for capacity provisioning. Unknown models fail fast with 404 once the gateway registry has bootstrapped. Incompatible explicit bundle choices fail with 409.
Sidecar Health And Discovery
Section titled “Sidecar Health And Discovery”The production Helm path runs the SIE server sidecar inside each worker pod and uses NATS health. The sidecar publishes sie.health.<worker_id> heartbeats with the worker pod’s bundle, machine profile, queue depth, loaded models, and bundle_config_hash; the gateway builds its routing registry from those heartbeats.
| Mode | Used for | Health source |
|---|---|---|
nats | Default chart path with SIE server sidecar | sie.health.<worker_id> heartbeats from the SIE server sidecar |
ws | Local status diagnostics | Python sie-server /ws/status stream |
static | Explicit local diagnostics | Operator-provided worker URLs |
The gateway still owns pool state through Kubernetes ConfigMaps and Leases. Kubernetes is not on the inference request path; queued work moves through NATS JetStream.
Local Diagnostics
Section titled “Local Diagnostics”For hand-run gateway processes that inspect a standalone sie-server /ws/status, list worker URLs explicitly:
sie-gateway serve \ -w http://worker-1:8080 \ -w http://worker-2:8080 \ -w http://worker-3:8080With queue-mode SIE server sidecar routing, the chart leaves gateway.healthMode empty and renders the routing-safe default, nats.
Resource Pools
Section titled “Resource Pools”Resource pools reserve dedicated worker pods for tenant isolation. Pool worker pods only serve requests for that pool.
Create a Pool
Section titled “Create a Pool”client = SIEClient("http://gateway:8080")
# Reserve 2 L4 workers for this tenantclient.create_pool("tenant-abc", {"l4": 2})
# Route requests to the poolresult = client.encode( "BAAI/bge-m3", Item(text="hello"), gpu="tenant-abc/l4" # pool_name/gpu_type)
# Check pool statusinfo = client.get_pool("tenant-abc")
# Cleanupclient.delete_pool("tenant-abc")Pool Lifecycle
Section titled “Pool Lifecycle”- Pools are represented in Kubernetes
ConfigMaps andLeases. - The SDK renews pool leases automatically in a background thread.
- Pools expire after their TTL unless renewed.
- The
defaultpool is protected and cannot be deleted.
Config Read Surface
Section titled “Config Read Surface”The gateway serves read-side config endpoints from its in-memory registry:
| Endpoint | Purpose |
|---|---|
GET /v1/configs/models | List models known to this gateway |
GET /v1/configs/models/{id} | Return model YAML from the gateway registry |
GET /v1/configs/models/{id}/status | Report per-replica config-hash readiness |
GET /v1/configs/bundles | List known bundles and visible SIE server sidecar health counts |
GET /v1/configs/bundles/{id} | Return bundle YAML |
POST /v1/configs/resolve | Dry-run model or explicit bundle override to bundle routing |
The gateway is not a config write authority. POST /v1/configs/models is not registered on the gateway and returns 405 Method Not Allowed; send writes to sie-config.
Bootstrap and Recovery
Section titled “Bootstrap and Recovery”On startup, the gateway:
- Optionally loads filesystem seeds from
SIE_BUNDLES_DIRandSIE_MODELS_DIRif an escape-hatch config map is mounted. - Reads
GET /v1/configs/epochto capture the authoritative epoch and bundle-set hash. - Fetches bundles from
sie-configwithGET /v1/configs/bundles{,/{id}}. - Fetches model state with
GET /v1/configs/export. - Subscribes to
sie.config.models._allfor live deltas. - Polls
GET /v1/configs/epochevery 30 seconds to catch missed deltas or bundle-set drift.
/readyz does not wait for sie-config. A fresh gateway can be ready before the first config bootstrap succeeds; during that window, typed requests may return 404 until the registry is populated.
Health & Status
Section titled “Health & Status”The gateway aggregates SIE server sidecar health records:
| Endpoint | Description |
|---|---|
GET /healthz | Gateway liveness |
GET /readyz | Gateway readiness; intentionally independent of sie-config reachability |
GET /health | Cluster summary: worker count, GPU count, models loaded |
GET /v1/models | Model list from the gateway registry |
WS /ws/cluster-status | Real-time cluster metrics stream |
Cluster Health Example
Section titled “Cluster Health Example”curl http://gateway:8080/health{ "status": "healthy", "worker_count": 3, "gpu_count": 3, "models_loaded": 12, "configured_gpu_types": ["l4", "a100-80gb"], "live_gpu_types": ["l4"]}Metrics
Section titled “Metrics”Important gateway metrics include:
| Metric | Purpose |
|---|---|
sie_gateway_requests_total | HTTP requests by endpoint, status, and machine profile |
sie_gateway_request_latency_seconds | Gateway request latency |
sie_gateway_pending_demand | KEDA scale-from-zero trigger by machine profile and bundle |
sie_gateway_worker_queue_depth | Per-worker queue depth |
sie_gateway_config_epoch | Highest config epoch applied on this gateway |
sie_gateway_config_bootstrap_degraded | Whether bootstrap has been failing long enough to alert |
sie_gateway_config_deltas_total | NATS config-delta processing outcomes |
sie_gateway_nats_connected | Gateway NATS connection state |
What’s Next
Section titled “What’s Next”- Scale-from-Zero - the 202 flow and cold start handling
- Config API - runtime config writes and gateway readiness polling
- Kubernetes in GCP - full deployment with the gateway
- Monitoring - metrics and dashboards