Skip to content
Why did we open-source our inference engine? Read the post

Gateway

The SIE gateway is a stateless Rust service that sits between clients and GPU worker pods. It handles routing, queue submission, resource pools, SIE server sidecar health, read-side config, and scale-from-zero orchestration. In Kubernetes, each worker pod runs the SIE server sidecar beside the Python sie-server adapter process; the sidecar pulls queued work and calls the adapter over IPC.

The page keeps the /docs/engine/router/ URL for compatibility, but the deployed component is sie-gateway.

Not every deployment needs a gateway. The deciding factor is whether you are running an elastic worker fleet:

  • Single server (local dev or Docker): point the SDK at a standalone sie-server.
  • Kubernetes clusters: use the gateway. It provides a stable client endpoint, worker discovery, queue-based inference, scale-from-zero, resource pools, and config read endpoints.
  • Horizontal gateway replicas: supported. Each replica keeps its own in-memory registry and converges through bootstrap, NATS config deltas, and epoch polling.
SetupUse Gateway?Why
Single Docker containerNoOne sie-server process handles the request path
KubernetesYesRequired for worker discovery, queue routing, scale-from-zero, and pool isolation

Gateway architecture: SDK/HTTP Client to gateway, NATS queue, and GPU worker pods

The gateway is stateless with respect to durable data. It owns in-memory routing state, but it does not persist config and it does not execute inference.

Client request
-> sie-gateway resolves model, bundle, machine profile, and pool
-> gateway publishes msgpack work items to NATS JetStream
-> matching worker pod's SIE server sidecar pulls, batches, and calls the sie-server adapter over UDS IPC
-> SIE server sidecar publishes msgpack results to the gateway's NATS Core inbox
-> gateway assembles and returns the HTTP response

Config writes are outside this hot path. Admin tooling writes to sie-config, and gateways mirror that state through /v1/configs/export, NATS deltas, and /v1/configs/epoch polling.


The gateway resolves every inference request to:

  1. Model and profile: the model path and optional :profile suffix.
  2. Bundle: selected by adapter compatibility, with the lowest numeric bundle priority winning by default.
  3. Machine profile: X-SIE-MACHINE-PROFILE header or SDK gpu parameter.
  4. Pool: default pool or explicit X-SIE-Pool / SDK pool/profile target.
  5. Queue subject: sie.work.{model}.{pool} on the pool’s JetStream stream, consumed by the SIE server sidecar inside matching worker pods.

The Rust gateway is queue-only for inference. If the queue transport is unavailable, the gateway returns 503.

Requests can specify a target machine profile:

# HTTP
curl -X POST http://gateway:8080/v1/encode/BAAI/bge-m3 \
-H "X-SIE-MACHINE-PROFILE: l4" \
-H "Content-Type: application/json" \
-d '{"items": [{"text": "Hello world"}]}'
# SDK
result = client.encode("BAAI/bge-m3", Item(text="hello"), gpu="l4")

If the caller omits a machine profile, the gateway can use the default configured route. Scale-from-zero returns 202 when the selected (bundle, machine_profile) has no fresh SIE server sidecar health and the caller did not pin an explicit pool.

When no healthy SIE server sidecar has recently published health for the selected (bundle, machine_profile) tuple and the caller did not pin a specific pool, the gateway returns:

HTTP/1.1 202 Accepted
Retry-After: 120
Content-Type: application/json
{
"status": "provisioning",
"gpu": "l4",
"bundle": "default",
"estimated_wait_s": 180,
"message": "No worker available for GPU type 'l4'. Provisioning in progress."
}

The SDK handles this automatically with wait_for_capacity=True. See Scale-from-Zero for details.

202 is only for capacity provisioning. Unknown models fail fast with 404 once the gateway registry has bootstrapped. Incompatible explicit bundle choices fail with 409.


The production Helm path runs the SIE server sidecar inside each worker pod and uses NATS health. The sidecar publishes sie.health.<worker_id> heartbeats with the worker pod’s bundle, machine profile, queue depth, loaded models, and bundle_config_hash; the gateway builds its routing registry from those heartbeats.

ModeUsed forHealth source
natsDefault chart path with SIE server sidecarsie.health.<worker_id> heartbeats from the SIE server sidecar
wsLocal status diagnosticsPython sie-server /ws/status stream
staticExplicit local diagnosticsOperator-provided worker URLs

The gateway still owns pool state through Kubernetes ConfigMaps and Leases. Kubernetes is not on the inference request path; queued work moves through NATS JetStream.

For hand-run gateway processes that inspect a standalone sie-server /ws/status, list worker URLs explicitly:

sie-gateway serve \
-w http://worker-1:8080 \
-w http://worker-2:8080 \
-w http://worker-3:8080

With queue-mode SIE server sidecar routing, the chart leaves gateway.healthMode empty and renders the routing-safe default, nats.


Resource pools reserve dedicated worker pods for tenant isolation. Pool worker pods only serve requests for that pool.

client = SIEClient("http://gateway:8080")
# Reserve 2 L4 workers for this tenant
client.create_pool("tenant-abc", {"l4": 2})
# Route requests to the pool
result = client.encode(
"BAAI/bge-m3",
Item(text="hello"),
gpu="tenant-abc/l4" # pool_name/gpu_type
)
# Check pool status
info = client.get_pool("tenant-abc")
# Cleanup
client.delete_pool("tenant-abc")
  • Pools are represented in Kubernetes ConfigMaps and Leases.
  • The SDK renews pool leases automatically in a background thread.
  • Pools expire after their TTL unless renewed.
  • The default pool is protected and cannot be deleted.

The gateway serves read-side config endpoints from its in-memory registry:

EndpointPurpose
GET /v1/configs/modelsList models known to this gateway
GET /v1/configs/models/{id}Return model YAML from the gateway registry
GET /v1/configs/models/{id}/statusReport per-replica config-hash readiness
GET /v1/configs/bundlesList known bundles and visible SIE server sidecar health counts
GET /v1/configs/bundles/{id}Return bundle YAML
POST /v1/configs/resolveDry-run model or explicit bundle override to bundle routing

The gateway is not a config write authority. POST /v1/configs/models is not registered on the gateway and returns 405 Method Not Allowed; send writes to sie-config.

On startup, the gateway:

  1. Optionally loads filesystem seeds from SIE_BUNDLES_DIR and SIE_MODELS_DIR if an escape-hatch config map is mounted.
  2. Reads GET /v1/configs/epoch to capture the authoritative epoch and bundle-set hash.
  3. Fetches bundles from sie-config with GET /v1/configs/bundles{,/{id}}.
  4. Fetches model state with GET /v1/configs/export.
  5. Subscribes to sie.config.models._all for live deltas.
  6. Polls GET /v1/configs/epoch every 30 seconds to catch missed deltas or bundle-set drift.

/readyz does not wait for sie-config. A fresh gateway can be ready before the first config bootstrap succeeds; during that window, typed requests may return 404 until the registry is populated.


The gateway aggregates SIE server sidecar health records:

EndpointDescription
GET /healthzGateway liveness
GET /readyzGateway readiness; intentionally independent of sie-config reachability
GET /healthCluster summary: worker count, GPU count, models loaded
GET /v1/modelsModel list from the gateway registry
WS /ws/cluster-statusReal-time cluster metrics stream
curl http://gateway:8080/health
{
"status": "healthy",
"worker_count": 3,
"gpu_count": 3,
"models_loaded": 12,
"configured_gpu_types": ["l4", "a100-80gb"],
"live_gpu_types": ["l4"]
}

Important gateway metrics include:

MetricPurpose
sie_gateway_requests_totalHTTP requests by endpoint, status, and machine profile
sie_gateway_request_latency_secondsGateway request latency
sie_gateway_pending_demandKEDA scale-from-zero trigger by machine profile and bundle
sie_gateway_worker_queue_depthPer-worker queue depth
sie_gateway_config_epochHighest config epoch applied on this gateway
sie_gateway_config_bootstrap_degradedWhether bootstrap has been failing long enough to alert
sie_gateway_config_deltas_totalNATS config-delta processing outcomes
sie_gateway_nats_connectedGateway NATS connection state

Contact us

Tell us about your use case and we'll get back to you shortly.