---
title: Architecture
description: High-level architecture of SIE from client SDK to GPU inference.
canonical_url: https://superlinked.com/docs/engine/architecture
last_updated: 2026-06-11
---

SIE has two runtime shapes. A single Docker `sie-server` exposes the server API in one process. A Kubernetes cluster splits the path across a Rust gateway, NATS, GPU worker pods, and a dedicated config service. Each worker pod runs the SIE server sidecar beside the Python `sie-server` adapter process.

## System Overview

![SIE system architecture: Client, Gateway, NATS, worker pod with SIE server sidecar, and sie-server adapter layers](/diagrams/system-arch.svg)

In production Kubernetes deployments, the hot path is intentionally separate from the config control plane:

```text
Client SDK
  -> sie-gateway (Rust, stateless inference edge)
  -> NATS JetStream queue
  -> SIE server sidecar inside the worker pod
  -> UDS IPC
  -> sie-server adapter process
  -> NATS Core result inbox
  -> sie-gateway response

Admin tooling
  -> sie-config (Python, single-writer config control plane)
  -> config store + NATS config deltas
  -> gateways and worker pods converge asynchronously
```

---

## Components

### Client SDK

Source: [packages/sie_sdk/src/sie_sdk/client/sync.py](https://github.com/superlinked/sie/blob/main/packages/sie_sdk/src/sie_sdk/client/sync.py)

The SDK provides `encode()`, `score()`, and `extract()` methods. It handles:

- **msgpack serialization**: Binary wire format, faster and smaller than JSON
- **Automatic 202 retry**: Waits for scale-from-zero with `wait_for_capacity=True`
- **Pool management**: Background lease renewal for resource pools
- **Numpy integration**: Returns native numpy arrays for embeddings

Framework integrations (LangChain, LlamaIndex, etc.) wrap the SDK with framework-specific interfaces.

### Gateway

Source: [packages/sie_gateway/src/handlers/proxy.rs](https://github.com/superlinked/sie/blob/main/packages/sie_gateway/src/handlers/proxy.rs)

The gateway is a stateless Rust service that sits between clients and Kubernetes worker pods. It is optional for single-server setups but required for elastic Kubernetes clusters.

**Responsibilities:**
- Resolves model, bundle, machine profile, and pool from its in-memory registry
- Publishes inference work to NATS JetStream
- Returns `202 Accepted` with `Retry-After` when the target worker pool is scaled to zero
- Serves read-side config endpoints from its local registry mirror
- Manages resource pools for capacity isolation
- Tracks worker-pod health and bundle config hashes from SIE server sidecar NATS heartbeats

The gateway does not own config writes. `POST /v1/configs/models`, `GET /v1/configs/export`, and `GET /v1/configs/epoch` belong to `sie-config`.

### Config Service

Source: [packages/sie_config/src/sie_config/config_api.py](https://github.com/superlinked/sie/blob/main/packages/sie_config/src/sie_config/config_api.py)

`sie-config` is the authoritative config control plane. It runs as a single writer, persists API-added model configs, and publishes runtime config deltas:

- `POST /v1/configs/models` appends new models or profiles.
- `GET /v1/configs/export` gives gateways a full snapshot for bootstrap and drift recovery.
- `GET /v1/configs/epoch` exposes the authoritative model-write epoch and bundle-set hash.
- `GET /v1/configs/bundles{,/{id}}` lets gateways fetch the bundle set baked into the `sie-config` image.

Gateways bootstrap from `sie-config`, subscribe to `sie.config.models._all` for live deltas, and poll `/v1/configs/epoch` to recover any missed NATS messages.

### SIE Server Sidecar

Source: [packages/sie_server_sidecar/src/main.rs](https://github.com/superlinked/sie/blob/main/packages/sie_server_sidecar/src/main.rs)

Every queue-mode worker pod includes the SIE server sidecar. The Helm chart enables it by default and renders the container as `worker-sidecar`.
The image repository is `ghcr.io/superlinked/sie-server-sidecar`. It owns the queue half of the worker pod's cluster hot path:

- Pulls work from the pool's NATS JetStream stream
- Validates subjects and reply inboxes before processing payloads
- Forms batches by model, operation, and LoRA key
- Calls the Python `sie-server` adapter process over Unix domain socket IPC
- Frames results, publishes them to the gateway inbox, and ACKs or NAKs JetStream messages
- Publishes `sie.health.<worker_id>` heartbeats with queue depth, loaded model state, and the current `bundle_config_hash`
- Applies bundle-scoped config deltas to the `sie-server` adapter through IPC and reconciles missed deltas from `sie-config`

The SIE server sidecar does not load model weights and does not link GPU libraries. It keeps queue, batching, payload, and framing work in Rust while the `sie-server` adapter remains responsible for model execution.

### Worker Adapter (sie-server)

Source: [packages/sie_server/src/sie_server/main.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/main.py)

`sie-server` remains the Python model-execution process. In standalone Docker it exposes the server API. In Kubernetes queue mode, it runs beside the SIE server sidecar inside each worker pod and exposes an IPC server to execute prepared batches.

It owns:

- Model registry and GPU lifecycle
- Adapter-specific preprocessing and model-path selection
- Model loading, LoRA loading, and memory pressure eviction
- GPU inference through PyTorch, Flash Attention, SGLang, and other adapter backends
- The single-process API used by Docker deployments

Queue-mode batches from the SIE server sidecar arrive as fully formed GPU work. The `sie-server` adapter can still retokenize an item when that is the safe execution path.

---

## Wire Protocol

Source: [packages/sie_sdk/src/sie_sdk/client/sync.py](https://github.com/superlinked/sie/blob/main/packages/sie_sdk/src/sie_sdk/client/sync.py)

SIE uses **msgpack** as the default wire format instead of JSON:

| Format | Encode speed | Decode speed | Size | Numpy support |
|--------|-------------|-------------|------|---------------|
| msgpack | Fast | Fast | ~50% of JSON | Native via msgpack-numpy |
| JSON | Slower | Slower | Baseline | Requires list conversion |

The SDK sends and receives msgpack automatically. The OpenAI-compatible `/v1/embeddings` endpoint uses JSON for compatibility.

Inside a Kubernetes cluster, gateway-to-sidecar work items, IPC frames between the SIE server sidecar and the `sie-server` adapter, and sidecar-to-gateway results are msgpack as well. JSON is reserved for low-frequency control-plane APIs and client requests that explicitly negotiate JSON.

---

## Model Cache Hierarchy

Source: [packages/sie_server/src/sie_server/core/model_loader.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/core/model_loader.py)

Model weights are resolved through a 3-tier cache:

![Model cache hierarchy: Local Cache, Cluster Cache, HuggingFace Hub](/diagrams/cache-hierarchy.svg)

**Local disk cache** uses LRU eviction when disk usage exceeds `SIE_DISK_PRESSURE_THRESHOLD_PERCENT` (default: 85%).

**Cluster cache** is useful for Kubernetes deployments where multiple worker pods share the same S3/GCS bucket, avoiding redundant downloads from HuggingFace.

---

## Deployment Modes

### Standalone Docker

```
Client → sie-server (single GPU)
```

Simplest setup. The SDK points at one `sie-server` process. Good for development and small production.

### Multi-Bundle (Docker Compose)

```
Client → sie-server:8080 (default bundle)
Client → sie-server:8081 (sglang bundle)
```

Multiple containers, each with a different bundle. Client routes to the correct port.

### Cluster (Kubernetes)

```
Client → sie-gateway → NATS JetStream → SIE server sidecar → UDS IPC → sie-server adapter
Admin → sie-config → config store + NATS config deltas → gateway + worker pod
```

Full production setup with GPU routing, autoscaling, and observability. See [Kubernetes in GCP](/docs/deployment/cloud-gcp/) or [AWS](/docs/deployment/cloud-aws/).

---

## What's Next

- [Request Pipeline](/docs/engine/) - detailed preprocessing, batching, and GPU inference flow
- [Gateway](/docs/engine/router/) - routing, queueing, load balancing, and resource pools
- [Config API](/docs/engine/config-api/) - runtime model config writes and readiness polling
- [Adapters](/docs/engine/adapters/) - compute engine abstraction layer
