Bring SIE up in a cluster with no public internet access. The worker pods normally pull model weights from HuggingFace and container images from GHCR; both of those need to come from inside your network instead.
This guide covers a typical air-gapped flow:
Snapshot model weights on a workstation that has internet access.
Mirror the snapshot to private S3-compatible storage reachable from the cluster.
Configure the chart to read weights from that store and skip HuggingFace.
Mirror the SIE container images to a private registry.
Verify first inference with no egress.
The same pattern works for “restricted egress” clusters that allow private object storage but block public HuggingFace.
The result is a directory in HuggingFace cache layout (./offline-weights/models--BAAI--bge-m3/snapshots/<sha>/...) that the chart can mount as HF_HUB_CACHE. The cache layout stores both blob files and snapshot symlinks, so the on-disk and mirrored sizes will be roughly 2x the model’s raw byte count. Expected, not duplication.
Point the chart’s workers.common.clusterCache at the mirrored bucket. The Python sie-server adapter containers in worker pods read weights from there instead of HuggingFace.
# values-offline.yaml
workers:
common:
extraEnv:
- name: SIE_HF_FALLBACK
value: "false"
clusterCache:
enabled: true
url: s3://sie-models-private/weights/# or gs:// for GCS
hfCache:
home: /models/huggingface
tokenSecret: ""
# Skip HF token wiring entirely in air-gapped clusters
hfToken:
create: false
For S3, worker pods authenticate via IRSA (EKS) or static credentials supplied through extraEnv. For GCS, they use Workload Identity (GKE). For MinIO or other S3-compatibles, mount credentials via a secret and pass them through workers.common.extraEnv.
sie-server is only published with -{platform}-{bundle} suffixes. ghcr.io/superlinked/sie-server:v0.6.6 (plain) does not exist, and the chart’s worker template assembles the full tag from workers.common.image.tag + -${platform}-${bundle} at install time.
The ghcr.io/superlinked/sie-server-sidecar image backs the SIE server sidecar in Kubernetes. Helm renders that sidecar as the worker-sidecar container for release compatibility.
The chart also pulls NATS images via the bundled nats sub-chart when nats.install=true, which is the default. For a truly air-gapped cluster, one where the cluster host has no public egress across the whole cluster, these must be mirrored too:
Image
Source
nats:2.12.6-alpine
docker.io / nats.io
natsio/nats-server-config-reloader:0.21.1
docker.io
natsio/nats-box:0.19.3
docker.io
If you enable optional sub-charts (keda.install=true, kube-prometheus-stack.install=true, dcgm-exporter.install=true, loki.install=true, alloy.install=true), each pulls additional images. Run helm template oci://ghcr.io/superlinked/charts/sie-cluster --version 0.6.6 -f values-offline.yaml | grep -oE 'image:.*' | sort -u to extract the full set for your config.
Mirror the SIE images once:
TAG=v0.6.6
PLATFORM=cuda12# or `cpu` for a CPU-only worker pool
BUNDLE=default
# sie-server: platform/bundle suffix is required. There is no plain `:$TAG` tag.
Note on architecture mismatch:docker pull on a host whose architecture differs from the cluster nodes’ (e.g. an arm64 Mac mirroring images for an amd64 EKS cluster) will silently pull the wrong platform unless you pass --platform, and the subsequent docker push will publish a multi-arch index with only the pulled platforms. Worker pods on a mismatched node arch will then fail with no match for platform in manifest. For arch-safe mirroring use crane (brew install crane); it copies all platforms without going through the host’s container runtime:
If you also mirrored the chart itself (recommended for fully air-gapped), pull it once with helm pull oci://ghcr.io/superlinked/charts/sie-cluster --version 0.6.6 and install from the local .tgz:
For a CPU worker pool (workers.common.platform: cpu, workers.pools.cpu.enabled: true, useful for local clusters or small offline deployments without a GPU):
python3-c"
from sie_sdk import SIEClient
client = SIEClient('http://localhost:8080')
result = client.encode('BAAI/bge-m3', {'text': 'hello world'},
The first request still pays the cold-start cost, but the weight load now comes from your private store rather than HuggingFace. CPU inference will be substantially slower than GPU for the same model.