---
title: How to deploy SIE
description: Run SIE as a single Docker server or as a Kubernetes cluster with gateway, config service, NATS, and GPU worker pods.
canonical_url: https://superlinked.com/docs/deployment
last_updated: 2026-06-11
---

SIE has two deployment paths. Use Docker for a single `sie-server` with no external SIE services. Use Kubernetes when you need the clustered runtime: `sie-gateway`, `sie-config`, NATS JetStream, and GPU worker pods. Each worker pod runs the SIE server sidecar beside the Python `sie-server` adapter container.

---

## Which Deployment Path Should I Use?

**Use Docker if:**
- You are running on a single server or VM
- You are in development or running a low-traffic service
- You want the simplest possible setup

**Use Kubernetes if:**
- You need horizontal scaling or autoscaling to zero
- You need high availability across multiple nodes
- You are deploying on GCP or AWS with GPU node pools

| | Docker | Kubernetes |
|---|---|---|
| Setup time | Minutes | Hours |
| Scaling | Manual | Automatic |
| High availability | No | Yes |
| Scale-to-zero | No | Yes |
| Best for | Dev, single-server | Production, high traffic |

See [Kubernetes on GCP](https://superlinked.com/docs/deployment/cloud-gcp/) and [Kubernetes on AWS](https://superlinked.com/docs/deployment/cloud-aws/) for cloud-specific guides.

---

## Getting Started With Docker

The fastest way to run SIE is a single `docker run`:

```bash
# CPU only
docker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default

# With GPU (recommended)
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default
```

The server starts on port 8080. Models load on first request with no pre-configuration needed.

### Common Options

```bash
# Persistent model cache (avoids re-downloading on restart)
docker run --gpus all \
  -p 8080:8080 \
  -v ~/.cache/sie:/root/.cache/sie \
  ghcr.io/superlinked/sie-server:latest-cuda12-default

# Custom port
docker run --gpus all -p 3000:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default

# Specific models only (faster startup)
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default \
  sie-server serve -m BAAI/bge-m3,BAAI/bge-reranker-v2-m3

# Persistent model cache (skip re-downloads)
docker run --gpus all -p 8080:8080 \
  -v ~/.cache/huggingface:/app/.cache/huggingface \
  ghcr.io/superlinked/sie-server:latest-cuda12-default

# Different bundle (e.g. SGLang backend for large LLM embeddings)
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-sglang
```

See the full [Docker deployment guide](https://superlinked.com/docs/deployment/docker/).

---

## What Hardware Does SIE Need?

### Minimum Specs

| Component | Minimum | Recommended |
|---|---|---|
| CPU | 4 cores | 8+ cores |
| RAM | 8GB | 16GB+ |
| GPU | Optional | Any NVIDIA with 16GB+ VRAM |
| Disk | 20GB | 100GB+ for model cache |

### GPU Recommendations by Workload

| GPU | VRAM | Best for |
|---|---|---|
| T4 | 16GB | Development, light production |
| L4 | 24GB | Standard production (recommended starting point) |
| A100 40GB | 40GB | High-throughput or large model serving |
| A100 80GB | 80GB | 7B+ parameter models |

See [Hardware and Capacity](https://superlinked.com/docs/deployment/resources/) for full sizing guidance.

---

## When Should I Move to Kubernetes?

Move from Docker to Kubernetes when you need:

- **Autoscaling** to handle traffic spikes by spinning up additional workers
- **Scale-to-zero** to save costs by scaling down during idle periods
- **High availability** with multiple replicas to survive node failures
- **Multi-region** deployment to serve users in different geographies

Note: Kubernetes clusters with scale-to-zero have cold start times of 5 to 7 minutes. Use `wait_for_capacity=True` in the Python SDK (or `waitForCapacity: true` in TypeScript) to handle this gracefully. See [Scale-from-Zero and Autoscaling](https://superlinked.com/docs/deployment/autoscaling/).

---

## Kubernetes Cluster Prerequisites

These requirements apply to any Kubernetes install path. The Terraform examples for GCP and AWS provision a cluster that satisfies all of them. Operators using `helm install` against an existing cluster must confirm each item first.

### Cluster

- **Kubernetes 1.29 or newer.** The AWS Terraform example pins to 1.35; the GCP example follows the cluster's release channel. Older versions are untested.
- **Worker nodes with NVIDIA GPUs** (L4, A100 40GB, or A100 80GB). CPU-only worker pools exist for local testing but are not a supported production target.
- **NVIDIA device plugin installed** and exposing `nvidia.com/gpu` as a schedulable resource. GKE ships this on GPU node pools automatically; EKS does not.
- **Node disk ≥ 350Gi** per GPU node. Workers cache models in a 300Gi `emptyDir` (no PVC, no storage class needed for the cache itself).

### In-cluster components

- **Ingress controller.** The chart defaults to `ingressClassName: nginx`. Install ingress-nginx if you plan to expose the gateway publicly. Port-forward works for smoke tests and internal-only setups.
- **cert-manager** (optional). Required only if you want the chart to issue Let's Encrypt certificates via HTTP-01. BYO TLS via a `kubernetes.io/tls` Secret is also supported and is the default.
- **Storage class.** Only matters if you enable the `sie-config` PVC (1Gi, default off). The cluster default class is fine.
- **KEDA, Prometheus, Loki, Alloy, DCGM Exporter.** Packaged as optional sub-charts (`keda.install=true`, `kube-prometheus-stack.install=true`, etc.). Skip them for a minimal smoke test; enable for autoscaling and observability.

### Cluster identity

- **Workload Identity (GCP) or IRSA (AWS)** bound to a service account named `sie-server` in the SIE release namespace. This is how worker pods read the model cache bucket (GCS or S3) without static credentials. The Terraform examples create and bind this for you.

### Network egress

The cluster must reach:

- `ghcr.io` for chart images (`sie-gateway`, `sie-server`, `sie-server-sidecar`, `sie-config`) and the OCI chart itself
- `huggingface.co` for model weights on first request (unless you pre-populate a cluster cache bucket via `sie-admin cache weights sync`)

Air-gapped environments must mirror both registries and configure `workers.common.clusterCache.url` to a pre-populated S3 or GCS bucket.

### Tokens and secrets

- **`HF_TOKEN`** required for gated HuggingFace models (e.g. `google/embeddinggemma-300m`, `naver/splade-v3`). Optional for the `BAAI/bge-m3` smoke test.

For cloud-account-level requirements (GCP project, GPU quotas, IAM roles, API enablement), see the **Prerequisites** section on the [GCP](/docs/deployment/cloud-gcp/) or [AWS](/docs/deployment/cloud-aws/) page.

---

## Frequently Asked Questions

**Can SIE run without a GPU?**
Yes. SIE runs on CPU and works well for development and low-traffic workloads. For production inference at scale, a GPU is strongly recommended, especially for batch encoding. See [Hardware and Capacity](https://superlinked.com/docs/deployment/resources/).

**How do I monitor a SIE deployment?**
SIE exposes Prometheus metrics and structured logs. See [Monitoring and Observability](https://superlinked.com/docs/deployment/monitoring/) for dashboards, alerting, and log configuration.

**How do I tune SIE for better performance?**
The main levers are batch size, worker concurrency, and model preloading. See [Performance Tuning](https://superlinked.com/docs/deployment/tuning/) for a step-by-step guide.

**How do I upgrade SIE without downtime?**
See the [Upgrade Runbook](https://superlinked.com/docs/deployment/upgrades/) for rolling upgrade procedures on both Docker and Kubernetes.

**Is there a managed cloud option?**
Superlinked offers managed SIE deployments for teams that do not want to manage infrastructure themselves. [Contact us](https://superlinked.com/) to learn more.
