Configuration

SIE uses environment variables for runtime configuration. CLI arguments override environment variables, which override defaults. In Kubernetes, Helm values render the gateway, config service, and worker-pod containers separately. The worker pod contains the SIE server sidecar and the Python sie-server adapter.

Server Configuration

Core settings for device selection, model loading, and server behavior.

Variable	Default	Description
`SIE_DEVICE`	`auto`	Inference device. Options: `auto` (detect GPU), `cuda`, `cuda:0`, `mps`, `cpu`
`SIE_MODELS_DIR`	`./models`	Path to model configs directory. Supports local paths, `s3://`, or `gs://` URLs
`SIE_MODEL_FILTER`	None	Comma-separated list of model names to load. If unset, all models are available
`SIE_GPU_TYPE`	Auto-detected	Override detected GPU type for routing (e.g., `l4`, `a100-80gb`, `h100`)

Cache Configuration

Control where model weights are stored and retrieved.

Variable	Default	Description
`SIE_LOCAL_CACHE`	`HF_HOME`	Local cache directory for model weights
`SIE_CLUSTER_CACHE`	None	Cluster cache URL for shared weights (`s3://` or `gs://`)
`SIE_HF_FALLBACK`	`true`	Allow HuggingFace Hub downloads after cache miss

Cache resolution order:

Local cache (SIE_LOCAL_CACHE)
Cluster cache (SIE_CLUSTER_CACHE)
HuggingFace Hub (if SIE_HF_FALLBACK=true)

Batching Configuration

Control request batching behavior for GPU efficiency. Standalone sie-server uses the Python batching knobs. Kubernetes queue-mode clusters use the SIE server sidecar knobs.

Variable	Default	Description
`SIE_MAX_BATCH_REQUESTS`	`64`	Maximum requests per batch
`SIE_MAX_BATCH_WAIT_MS`	`10`	Maximum milliseconds to wait for batch to fill
`SIE_MAX_CONCURRENT_REQUESTS`	`512`	Maximum concurrent requests (queue size)
`SIE_RUST_PIPELINE_DEPTH`	`2`	Queue-mode SIE server sidecar IPC dispatch depth
`SIE_BATCHER_COALESCE_MS`	`5`	Queue-mode SIE server sidecar coalesce window in milliseconds
`SIE_BATCHER_MAX_BATCH_REQUESTS`	`12`	Queue-mode SIE server sidecar item cap per batch
`SIE_ADAPTIVE_MIN_QUANTUM_MS`	`2`	Queue-mode pull-loop wait floor
`SIE_ADAPTIVE_MAX_QUANTUM_MS`	`15`	Queue-mode pull-loop wait ceiling
`SIE_ADAPTIVE_TARGET_P50_MS`	`50`	Queue-mode pull-loop latency target

Tuning guidance:

Increase Docker SIE_MAX_BATCH_REQUESTS or Helm workerSidecar.batcher.maxBatchRequests for higher throughput on high-memory GPUs
Decrease Docker SIE_MAX_BATCH_WAIT_MS or Helm workerSidecar.batcher.coalesceMs for lower latency at the cost of smaller batches
Set SIE_MAX_CONCURRENT_REQUESTS based on expected burst traffic

Prefer Helm values for queue-mode clusters, for example workers.common.workerSidecar.batcher.coalesceMs, so the chart and rendered environment stay in sync.

Gateway and Cluster Configuration

Helm normally renders these variables in Kubernetes. Set them by hand only when running the Rust gateway, sie-server-sidecar, or sie-config outside Helm.

Variable	Default	Description
`SIE_NATS_URL`	None	NATS URL for queued inference, result inboxes, SIE server sidecar health, and config deltas
`SIE_GATEWAY_HEALTH_MODE`	`ws` raw CLI, `nats` via Helm	Health source used by the gateway. Helm renders `nats` for the SIE server sidecar path
`SIE_GATEWAY_CONFIGURED_GPUS`	None	Comma-separated machine profiles available for routing and scale-from-zero
`SIE_CONFIG_SERVICE_URL`	None	`sie-config` base URL used by gateway and SIE server sidecar drift polling
`SIE_PAYLOAD_STORE_URL`	None	Shared payload store for large queued requests (`s3://`, `gs://`, or local path)
`SIE_ADMIN_TOKEN`	None	Admin bearer token for config writes and config export reads
`SIE_AUTH_MODE`	`none`	Gateway auth mode: `none`, `static`, or `token`
`SIE_AUTH_TOKEN`, `SIE_AUTH_TOKENS`	None	Bearer tokens accepted by protected gateway routes
`SIE_NATS_CONFIG_TRUSTED_PRODUCERS`	`sie-config`	Comma-separated producer IDs trusted for config-delta subjects

Static worker URL and Kubernetes endpoint discovery variables (SIE_GATEWAY_WORKERS, SIE_GATEWAY_KUBERNETES, SIE_GATEWAY_K8S_*) are local diagnostics for WebSocket health. Queue-mode Helm deployments route through NATS.

SIE Server Sidecar Configuration

The SIE server sidecar runs beside the Python sie-server adapter in each worker pod. Helm renders the sidecar container as worker-sidecar. The sidecar pulls from JetStream, batches by model and operation, calls the adapter over Unix domain socket IPC, publishes results, and emits sidecar health over NATS.

Variable	Default	Description
`SIE_POOL`	`_default`	Worker pool name, also used in JetStream stream and subject names
`SIE_BUNDLE`	`default`	Bundle ID used for the durable consumer and config subscription
`SIE_IPC_SOCKET_PATH`	`/tmp/sie-ipc.sock`	Unix socket path for SIE server sidecar to `sie-server` adapter IPC
`SIE_MAX_CONCURRENT_BATCHES`	`4`	Maximum concurrent sidecar batches
`SIE_IPC_POOL_SIZE`	Matches `SIE_MAX_CONCURRENT_BATCHES` when unset	Concurrent IPC connections to the Python `sie-server` adapter
`SIE_WORKER_METRICS_PORT`	`9095`	SIE server sidecar `/metrics`, `/healthz`, and `/readyz` port
`SIE_WORKER_ID`	Pod hostname or generated UUID	Stable worker ID surfaced in logs, results, and NATS health
`SIE_MACHINE_PROFILE`	`SIE_POOL`	Machine-profile label used by gateway routing
`SIE_GPU_COUNT`	`1`	GPU count advertised in SIE server sidecar health
`SIE_GATEWAY_URL`	None	Gateway base URL for worker-side pool admission checks
`SIE_POOL_ADMISSION_ENABLED`	`true`	Enable SIE server sidecar pool admission before pulling work
`SIE_WORKER_PING_INTERVAL_MS`	`2000`	IPC ping cadence used for SIE server sidecar readiness
`SIE_WORKER_READYZ_STALE_MULT`	`3`	Readiness staleness multiplier applied to the ping interval
`SIE_WORKER_CONFIG_POLL_INTERVAL_MS`	`30000`	Worker-side config epoch poll interval
`SIE_WORKER_CONFIG_FULL_EXPORT_INTERVAL_MS`	`300000`	Slow full-export reconcile interval. Set `0` to disable after startup
`SIE_HEALTH_PUBLISH_INTERVAL_MS`	`5000`	NATS SIE server sidecar health publish interval

Batch and pull-loop knobs for the SIE server sidecar are listed in Batching Configuration.

Memory Configuration

Control memory pressure thresholds and LRU eviction behavior.

Variable	Default	Description
`SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT`	`85`	VRAM usage percent that triggers LRU eviction (0-100)
`SIE_DISK_CACHE_ENABLED`	`true`	Enable LRU disk cache for model weights
`SIE_DISK_PRESSURE_THRESHOLD_PERCENT`	`85`	Disk usage percent that triggers LRU eviction of cached weights
`SIE_IDLE_EVICT_S`	(unset)	Unload models idle for N seconds. Disabled by default; set e.g. `300` for a 5-minute idle TTL.
`SIE_PRELOAD_MODELS`	(unset)	Comma-separated list of model IDs to eagerly load at server startup, instead of lazy on first request.

How LRU eviction works:

Background monitor checks memory usage periodically
When usage exceeds SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT, the least-recently-used model is evicted
Models are re-loaded on-demand when the next request arrives
Set SIE_IDLE_EVICT_S to also evict models that have been idle for too long, regardless of memory pressure

Logging Configuration

Control log format and verbosity.

Variable	Default	Description
`SIE_LOG_JSON`	`false`	Enable structured JSON logging for Loki compatibility

JSON log format includes structured fields:

{
  "timestamp": "2025-12-18T10:30:00Z",
  "level": "INFO",
  "logger": "sie_server.core.registry",
  "message": "Inference completed",
  "model": "bge-m3",
  "request_id": "abc123",
  "trace_id": "def456",
  "latency_ms": 45.2
}

Tracing Configuration

Enable OpenTelemetry distributed tracing.

Variable	Default	Description
`SIE_TRACING_ENABLED`	`false`	Enable OpenTelemetry tracing

When tracing is enabled, SIE respects standard OpenTelemetry environment variables:

Variable	Default	Description
`OTEL_SERVICE_NAME`	`sie-server`	Service name in traces
`OTEL_TRACES_EXPORTER`	`otlp`	Exporter type (`otlp`, `console`, `none`)
`OTEL_EXPORTER_OTLP_ENDPOINT`	`http://localhost:4317`	OTLP collector endpoint
`OTEL_TRACES_SAMPLER`	`always_on`	Sampling strategy
`OTEL_TRACES_SAMPLER_ARG`	`1.0`	Sampling rate (for `traceidratio` sampler)

Performance Configuration

Advanced settings for compute precision and preprocessing.

Variable	Default	Description
`SIE_PREPROCESSOR_WORKERS`	`4`	Number of preprocessing worker threads
`SIE_IMAGE_WORKERS`	`4`	Image preprocessing worker threads (for VLMs)
`SIE_ATTENTION_BACKEND`	`auto`	Attention implementation: `auto`, `flash_attention_2`, `sdpa`, `eager`
`SIE_DEFAULT_COMPUTE_PRECISION`	`float16`	Default compute precision: `float16`, `bfloat16`, `float32`
`SIE_INSTRUMENTATION`	`false`	Enable detailed batch statistics for debugging

LoRA Configuration

Control LoRA adapter loading behavior.

Variable	Default	Description
`SIE_MAX_LORAS_PER_MODEL`	`10`	Maximum LoRA adapters to keep loaded per model

When the limit is reached, the least-recently-used LoRA adapter is evicted.

Example: Production Configuration

# High-throughput production setup
export SIE_DEVICE=cuda
export SIE_MODELS_DIR=s3://my-bucket/models/
export SIE_CLUSTER_CACHE=s3://my-bucket/weights/
export SIE_LOCAL_CACHE=/mnt/nvme/cache

# Batching optimized for A100-80GB
export SIE_MAX_BATCH_REQUESTS=128
export SIE_MAX_BATCH_WAIT_MS=5
export SIE_MAX_CONCURRENT_REQUESTS=1024

# Memory management
export SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT=90

# Observability
export SIE_LOG_JSON=true
export SIE_TRACING_ENABLED=true
export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317

Example: Development Configuration

# Local development setup
export SIE_DEVICE=mps  # or cuda, cpu
export SIE_MODELS_DIR=./models

# Lower batching for faster iteration
export SIE_MAX_BATCH_REQUESTS=8
export SIE_MAX_BATCH_WAIT_MS=1

# Debug logging
export SIE_INSTRUMENTATION=true

What’s Next

CLI Reference - Command-line options that map to these variables
HTTP API Reference - Endpoints exposed by the configured server