---
title: Configuration
description: Environment variables for SIE server and cluster components.
canonical_url: https://superlinked.com/docs/reference/configuration
last_updated: 2026-06-11
---

SIE uses environment variables for runtime configuration. CLI arguments override environment variables, which override defaults. In Kubernetes, Helm values render the gateway, config service, and worker-pod containers separately. The worker pod contains the SIE server sidecar and the Python `sie-server` adapter.

## Server Configuration

Source: [packages/sie_server/src/sie_server/cli.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/cli.py)

Core settings for device selection, model loading, and server behavior.

| Variable | Default | Description |
|----------|---------|-------------|
| `SIE_DEVICE` | `auto` | Inference device. Options: `auto` (detect GPU), `cuda`, `cuda:0`, `mps`, `cpu` |
| `SIE_MODELS_DIR` | `./models` | Path to model configs directory. Supports local paths, `s3://`, or `gs://` URLs |
| `SIE_MODEL_FILTER` | None | Comma-separated list of model names to load. If unset, all models are available |
| `SIE_GPU_TYPE` | Auto-detected | Override detected GPU type for routing (e.g., `l4`, `a100-80gb`, `h100`) |

### Cache Configuration

Source: [packages/sie_server/src/sie_server/core/disk_cache.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/core/disk_cache.py)

Control where model weights are stored and retrieved.

| Variable | Default | Description |
|----------|---------|-------------|
| `SIE_LOCAL_CACHE` | `HF_HOME` | Local cache directory for model weights |
| `SIE_CLUSTER_CACHE` | None | Cluster cache URL for shared weights (`s3://` or `gs://`) |
| `SIE_HF_FALLBACK` | `true` | Allow HuggingFace Hub downloads after cache miss |

**Cache resolution order:**
1. Local cache (`SIE_LOCAL_CACHE`)
2. Cluster cache (`SIE_CLUSTER_CACHE`)
3. HuggingFace Hub (if `SIE_HF_FALLBACK=true`)

---

## Batching Configuration

Source: [packages/sie_server/src/sie_server/config/engine.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/config/engine.py)

Source: [packages/sie_server_sidecar/src/config.rs](https://github.com/superlinked/sie/blob/main/packages/sie_server_sidecar/src/config.rs)

Control request batching behavior for GPU efficiency. Standalone `sie-server` uses the Python batching knobs. Kubernetes queue-mode clusters use the SIE server sidecar knobs.

| Variable | Default | Description |
|----------|---------|-------------|
| `SIE_MAX_BATCH_REQUESTS` | `64` | Maximum requests per batch |
| `SIE_MAX_BATCH_WAIT_MS` | `10` | Maximum milliseconds to wait for batch to fill |
| `SIE_MAX_CONCURRENT_REQUESTS` | `512` | Maximum concurrent requests (queue size) |
| `SIE_RUST_PIPELINE_DEPTH` | `2` | Queue-mode SIE server sidecar IPC dispatch depth |
| `SIE_BATCHER_COALESCE_MS` | `5` | Queue-mode SIE server sidecar coalesce window in milliseconds |
| `SIE_BATCHER_MAX_BATCH_REQUESTS` | `12` | Queue-mode SIE server sidecar item cap per batch |
| `SIE_ADAPTIVE_MIN_QUANTUM_MS` | `2` | Queue-mode pull-loop wait floor |
| `SIE_ADAPTIVE_MAX_QUANTUM_MS` | `15` | Queue-mode pull-loop wait ceiling |
| `SIE_ADAPTIVE_TARGET_P50_MS` | `50` | Queue-mode pull-loop latency target |

**Tuning guidance:**
- Increase Docker `SIE_MAX_BATCH_REQUESTS` or Helm `workerSidecar.batcher.maxBatchRequests` for higher throughput on high-memory GPUs
- Decrease Docker `SIE_MAX_BATCH_WAIT_MS` or Helm `workerSidecar.batcher.coalesceMs` for lower latency at the cost of smaller batches
- Set `SIE_MAX_CONCURRENT_REQUESTS` based on expected burst traffic

Prefer Helm values for queue-mode clusters, for example `workers.common.workerSidecar.batcher.coalesceMs`, so the chart and rendered environment stay in sync.

---

## Gateway and Cluster Configuration

Source: [packages/sie_gateway/src/config.rs](https://github.com/superlinked/sie/blob/main/packages/sie_gateway/src/config.rs)

Source: [packages/sie_server_sidecar/src/main.rs](https://github.com/superlinked/sie/blob/main/packages/sie_server_sidecar/src/main.rs)

Source: [packages/sie_config/src/sie_config/config_api.py](https://github.com/superlinked/sie/blob/main/packages/sie_config/src/sie_config/config_api.py)

Helm normally renders these variables in Kubernetes. Set them by hand only when running the Rust gateway, `sie-server-sidecar`, or `sie-config` outside Helm.

| Variable | Default | Description |
|----------|---------|-------------|
| `SIE_NATS_URL` | None | NATS URL for queued inference, result inboxes, SIE server sidecar health, and config deltas |
| `SIE_GATEWAY_HEALTH_MODE` | `ws` raw CLI, `nats` via Helm | Health source used by the gateway. Helm renders `nats` for the SIE server sidecar path |
| `SIE_GATEWAY_CONFIGURED_GPUS` | None | Comma-separated machine profiles available for routing and scale-from-zero |
| `SIE_CONFIG_SERVICE_URL` | None | `sie-config` base URL used by gateway and SIE server sidecar drift polling |
| `SIE_PAYLOAD_STORE_URL` | None | Shared payload store for large queued requests (`s3://`, `gs://`, or local path) |
| `SIE_ADMIN_TOKEN` | None | Admin bearer token for config writes and config export reads |
| `SIE_AUTH_MODE` | `none` | Gateway auth mode: `none`, `static`, or `token` |
| `SIE_AUTH_TOKEN`, `SIE_AUTH_TOKENS` | None | Bearer tokens accepted by protected gateway routes |
| `SIE_NATS_CONFIG_TRUSTED_PRODUCERS` | `sie-config` | Comma-separated producer IDs trusted for config-delta subjects |

Static worker URL and Kubernetes endpoint discovery variables (`SIE_GATEWAY_WORKERS`, `SIE_GATEWAY_KUBERNETES`, `SIE_GATEWAY_K8S_*`) are local diagnostics for WebSocket health. Queue-mode Helm deployments route through NATS.

---

## SIE Server Sidecar Configuration

Source: [packages/sie_server_sidecar/src/config.rs](https://github.com/superlinked/sie/blob/main/packages/sie_server_sidecar/src/config.rs)

The SIE server sidecar runs beside the Python `sie-server` adapter in each worker pod.
Helm renders the sidecar container as `worker-sidecar`.
The sidecar pulls from JetStream, batches by model and operation, calls the adapter over Unix domain socket IPC, publishes results, and emits sidecar health over NATS.

| Variable | Default | Description |
|----------|---------|-------------|
| `SIE_POOL` | `_default` | Worker pool name, also used in JetStream stream and subject names |
| `SIE_BUNDLE` | `default` | Bundle ID used for the durable consumer and config subscription |
| `SIE_IPC_SOCKET_PATH` | `/tmp/sie-ipc.sock` | Unix socket path for SIE server sidecar to `sie-server` adapter IPC |
| `SIE_MAX_CONCURRENT_BATCHES` | `4` | Maximum concurrent sidecar batches |
| `SIE_IPC_POOL_SIZE` | Matches `SIE_MAX_CONCURRENT_BATCHES` when unset | Concurrent IPC connections to the Python `sie-server` adapter |
| `SIE_WORKER_METRICS_PORT` | `9095` | SIE server sidecar `/metrics`, `/healthz`, and `/readyz` port |
| `SIE_WORKER_ID` | Pod hostname or generated UUID | Stable worker ID surfaced in logs, results, and NATS health |
| `SIE_MACHINE_PROFILE` | `SIE_POOL` | Machine-profile label used by gateway routing |
| `SIE_GPU_COUNT` | `1` | GPU count advertised in SIE server sidecar health |
| `SIE_GATEWAY_URL` | None | Gateway base URL for worker-side pool admission checks |
| `SIE_POOL_ADMISSION_ENABLED` | `true` | Enable SIE server sidecar pool admission before pulling work |
| `SIE_WORKER_PING_INTERVAL_MS` | `2000` | IPC ping cadence used for SIE server sidecar readiness |
| `SIE_WORKER_READYZ_STALE_MULT` | `3` | Readiness staleness multiplier applied to the ping interval |
| `SIE_WORKER_CONFIG_POLL_INTERVAL_MS` | `30000` | Worker-side config epoch poll interval |
| `SIE_WORKER_CONFIG_FULL_EXPORT_INTERVAL_MS` | `300000` | Slow full-export reconcile interval. Set `0` to disable after startup |
| `SIE_HEALTH_PUBLISH_INTERVAL_MS` | `5000` | NATS SIE server sidecar health publish interval |

Batch and pull-loop knobs for the SIE server sidecar are listed in [Batching Configuration](#batching-configuration).

---

## Memory Configuration

Source: [packages/sie_server/src/sie_server/core/memory.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/core/memory.py)

Control memory pressure thresholds and LRU eviction behavior.

| Variable | Default | Description |
|----------|---------|-------------|
| `SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT` | `85` | VRAM usage percent that triggers LRU eviction (0-100) |
| `SIE_DISK_CACHE_ENABLED` | `true` | Enable LRU disk cache for model weights |
| `SIE_DISK_PRESSURE_THRESHOLD_PERCENT` | `85` | Disk usage percent that triggers LRU eviction of cached weights |
| `SIE_IDLE_EVICT_S` | (unset) | Unload models idle for N seconds. Disabled by default; set e.g. `300` for a 5-minute idle TTL. |
| `SIE_PRELOAD_MODELS` | (unset) | Comma-separated list of model IDs to eagerly load at server startup, instead of lazy on first request. |

**How LRU eviction works:**
1. Background monitor checks memory usage periodically
2. When usage exceeds `SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT`, the least-recently-used model is evicted
3. Models are re-loaded on-demand when the next request arrives
4. Set `SIE_IDLE_EVICT_S` to also evict models that have been idle for too long, regardless of memory pressure

---

## Logging Configuration

Source: [packages/sie_server/src/sie_server/core/logging.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/core/logging.py)

Control log format and verbosity.

| Variable | Default | Description |
|----------|---------|-------------|
| `SIE_LOG_JSON` | `false` | Enable structured JSON logging for Loki compatibility |

JSON log format includes structured fields:

```json
{
  "timestamp": "2025-12-18T10:30:00Z",
  "level": "INFO",
  "logger": "sie_server.core.registry",
  "message": "Inference completed",
  "model": "bge-m3",
  "request_id": "abc123",
  "trace_id": "def456",
  "latency_ms": 45.2
}
```

---

## Tracing Configuration

Source: [packages/sie_server/src/sie_server/observability/tracing.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/observability/tracing.py)

Enable OpenTelemetry distributed tracing.

| Variable | Default | Description |
|----------|---------|-------------|
| `SIE_TRACING_ENABLED` | `false` | Enable OpenTelemetry tracing |

When tracing is enabled, SIE respects standard OpenTelemetry environment variables:

| Variable | Default | Description |
|----------|---------|-------------|
| `OTEL_SERVICE_NAME` | `sie-server` | Service name in traces |
| `OTEL_TRACES_EXPORTER` | `otlp` | Exporter type (`otlp`, `console`, `none`) |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | `http://localhost:4317` | OTLP collector endpoint |
| `OTEL_TRACES_SAMPLER` | `always_on` | Sampling strategy |
| `OTEL_TRACES_SAMPLER_ARG` | `1.0` | Sampling rate (for `traceidratio` sampler) |

---

## Performance Configuration

Source: [packages/sie_server/src/sie_server/config/engine.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/config/engine.py)

Advanced settings for compute precision and preprocessing.

| Variable | Default | Description |
|----------|---------|-------------|
| `SIE_PREPROCESSOR_WORKERS` | `4` | Number of preprocessing worker threads |
| `SIE_IMAGE_WORKERS` | `4` | Image preprocessing worker threads (for VLMs) |
| `SIE_ATTENTION_BACKEND` | `auto` | Attention implementation: `auto`, `flash_attention_2`, `sdpa`, `eager` |
| `SIE_DEFAULT_COMPUTE_PRECISION` | `float16` | Default compute precision: `float16`, `bfloat16`, `float32` |
| `SIE_INSTRUMENTATION` | `false` | Enable detailed batch statistics for debugging |

---

## LoRA Configuration

Source: [packages/sie_server/src/sie_server/config/engine.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/config/engine.py)

Control LoRA adapter loading behavior.

| Variable | Default | Description |
|----------|---------|-------------|
| `SIE_MAX_LORAS_PER_MODEL` | `10` | Maximum LoRA adapters to keep loaded per model |

When the limit is reached, the least-recently-used LoRA adapter is evicted.

---

## Example: Production Configuration

```bash
# High-throughput production setup
export SIE_DEVICE=cuda
export SIE_MODELS_DIR=s3://my-bucket/models/
export SIE_CLUSTER_CACHE=s3://my-bucket/weights/
export SIE_LOCAL_CACHE=/mnt/nvme/cache

# Batching optimized for A100-80GB
export SIE_MAX_BATCH_REQUESTS=128
export SIE_MAX_BATCH_WAIT_MS=5
export SIE_MAX_CONCURRENT_REQUESTS=1024

# Memory management
export SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT=90

# Observability
export SIE_LOG_JSON=true
export SIE_TRACING_ENABLED=true
export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317
```

---

## Example: Development Configuration

```bash
# Local development setup
export SIE_DEVICE=mps  # or cuda, cpu
export SIE_MODELS_DIR=./models

# Lower batching for faster iteration
export SIE_MAX_BATCH_REQUESTS=8
export SIE_MAX_BATCH_WAIT_MS=1

# Debug logging
export SIE_INSTRUMENTATION=true
```

---

## What's Next

- [CLI Reference](/docs/reference/cli/) - Command-line options that map to these variables
- [HTTP API Reference](/docs/reference/api/) - Endpoints exposed by the configured server