Configuration
SIE uses environment variables for runtime configuration. CLI arguments override environment variables, which override defaults. In Kubernetes, Helm values render the gateway, config service, and worker-pod containers separately. The worker pod contains the SIE server sidecar and the Python sie-server adapter.
Server Configuration
Section titled “Server Configuration”Core settings for device selection, model loading, and server behavior.
| Variable | Default | Description |
|---|---|---|
SIE_DEVICE | auto | Inference device. Options: auto (detect GPU), cuda, cuda:0, mps, cpu |
SIE_MODELS_DIR | ./models | Path to model configs directory. Supports local paths, s3://, or gs:// URLs |
SIE_MODEL_FILTER | None | Comma-separated list of model names to load. If unset, all models are available |
SIE_GPU_TYPE | Auto-detected | Override detected GPU type for routing (e.g., l4, a100-80gb, h100) |
Cache Configuration
Section titled “Cache Configuration”Control where model weights are stored and retrieved.
| Variable | Default | Description |
|---|---|---|
SIE_LOCAL_CACHE | HF_HOME | Local cache directory for model weights |
SIE_CLUSTER_CACHE | None | Cluster cache URL for shared weights (s3:// or gs://) |
SIE_HF_FALLBACK | true | Allow HuggingFace Hub downloads after cache miss |
Cache resolution order:
- Local cache (
SIE_LOCAL_CACHE) - Cluster cache (
SIE_CLUSTER_CACHE) - HuggingFace Hub (if
SIE_HF_FALLBACK=true)
Batching Configuration
Section titled “Batching Configuration”Control request batching behavior for GPU efficiency. Standalone sie-server uses the Python batching knobs. Kubernetes queue-mode clusters use the SIE server sidecar knobs.
| Variable | Default | Description |
|---|---|---|
SIE_MAX_BATCH_REQUESTS | 64 | Maximum requests per batch |
SIE_MAX_BATCH_WAIT_MS | 10 | Maximum milliseconds to wait for batch to fill |
SIE_MAX_CONCURRENT_REQUESTS | 512 | Maximum concurrent requests (queue size) |
SIE_RUST_PIPELINE_DEPTH | 2 | Queue-mode SIE server sidecar IPC dispatch depth |
SIE_BATCHER_COALESCE_MS | 5 | Queue-mode SIE server sidecar coalesce window in milliseconds |
SIE_BATCHER_MAX_BATCH_REQUESTS | 12 | Queue-mode SIE server sidecar item cap per batch |
SIE_ADAPTIVE_MIN_QUANTUM_MS | 2 | Queue-mode pull-loop wait floor |
SIE_ADAPTIVE_MAX_QUANTUM_MS | 15 | Queue-mode pull-loop wait ceiling |
SIE_ADAPTIVE_TARGET_P50_MS | 50 | Queue-mode pull-loop latency target |
Tuning guidance:
- Increase Docker
SIE_MAX_BATCH_REQUESTSor HelmworkerSidecar.batcher.maxBatchRequestsfor higher throughput on high-memory GPUs - Decrease Docker
SIE_MAX_BATCH_WAIT_MSor HelmworkerSidecar.batcher.coalesceMsfor lower latency at the cost of smaller batches - Set
SIE_MAX_CONCURRENT_REQUESTSbased on expected burst traffic
Prefer Helm values for queue-mode clusters, for example workers.common.workerSidecar.batcher.coalesceMs, so the chart and rendered environment stay in sync.
Gateway and Cluster Configuration
Section titled “Gateway and Cluster Configuration”Helm normally renders these variables in Kubernetes. Set them by hand only when running the Rust gateway, sie-server-sidecar, or sie-config outside Helm.
| Variable | Default | Description |
|---|---|---|
SIE_NATS_URL | None | NATS URL for queued inference, result inboxes, SIE server sidecar health, and config deltas |
SIE_GATEWAY_HEALTH_MODE | ws raw CLI, nats via Helm | Health source used by the gateway. Helm renders nats for the SIE server sidecar path |
SIE_GATEWAY_CONFIGURED_GPUS | None | Comma-separated machine profiles available for routing and scale-from-zero |
SIE_CONFIG_SERVICE_URL | None | sie-config base URL used by gateway and SIE server sidecar drift polling |
SIE_PAYLOAD_STORE_URL | None | Shared payload store for large queued requests (s3://, gs://, or local path) |
SIE_ADMIN_TOKEN | None | Admin bearer token for config writes and config export reads |
SIE_AUTH_MODE | none | Gateway auth mode: none, static, or token |
SIE_AUTH_TOKEN, SIE_AUTH_TOKENS | None | Bearer tokens accepted by protected gateway routes |
SIE_NATS_CONFIG_TRUSTED_PRODUCERS | sie-config | Comma-separated producer IDs trusted for config-delta subjects |
Static worker URL and Kubernetes endpoint discovery variables (SIE_GATEWAY_WORKERS, SIE_GATEWAY_KUBERNETES, SIE_GATEWAY_K8S_*) are local diagnostics for WebSocket health. Queue-mode Helm deployments route through NATS.
SIE Server Sidecar Configuration
Section titled “SIE Server Sidecar Configuration”The SIE server sidecar runs beside the Python sie-server adapter in each worker pod.
Helm renders the sidecar container as worker-sidecar.
The sidecar pulls from JetStream, batches by model and operation, calls the adapter over Unix domain socket IPC, publishes results, and emits sidecar health over NATS.
| Variable | Default | Description |
|---|---|---|
SIE_POOL | _default | Worker pool name, also used in JetStream stream and subject names |
SIE_BUNDLE | default | Bundle ID used for the durable consumer and config subscription |
SIE_IPC_SOCKET_PATH | /tmp/sie-ipc.sock | Unix socket path for SIE server sidecar to sie-server adapter IPC |
SIE_MAX_CONCURRENT_BATCHES | 4 | Maximum concurrent sidecar batches |
SIE_IPC_POOL_SIZE | Matches SIE_MAX_CONCURRENT_BATCHES when unset | Concurrent IPC connections to the Python sie-server adapter |
SIE_WORKER_METRICS_PORT | 9095 | SIE server sidecar /metrics, /healthz, and /readyz port |
SIE_WORKER_ID | Pod hostname or generated UUID | Stable worker ID surfaced in logs, results, and NATS health |
SIE_MACHINE_PROFILE | SIE_POOL | Machine-profile label used by gateway routing |
SIE_GPU_COUNT | 1 | GPU count advertised in SIE server sidecar health |
SIE_GATEWAY_URL | None | Gateway base URL for worker-side pool admission checks |
SIE_POOL_ADMISSION_ENABLED | true | Enable SIE server sidecar pool admission before pulling work |
SIE_WORKER_PING_INTERVAL_MS | 2000 | IPC ping cadence used for SIE server sidecar readiness |
SIE_WORKER_READYZ_STALE_MULT | 3 | Readiness staleness multiplier applied to the ping interval |
SIE_WORKER_CONFIG_POLL_INTERVAL_MS | 30000 | Worker-side config epoch poll interval |
SIE_WORKER_CONFIG_FULL_EXPORT_INTERVAL_MS | 300000 | Slow full-export reconcile interval. Set 0 to disable after startup |
SIE_HEALTH_PUBLISH_INTERVAL_MS | 5000 | NATS SIE server sidecar health publish interval |
Batch and pull-loop knobs for the SIE server sidecar are listed in Batching Configuration.
Memory Configuration
Section titled “Memory Configuration”Control memory pressure thresholds and LRU eviction behavior.
| Variable | Default | Description |
|---|---|---|
SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT | 85 | VRAM usage percent that triggers LRU eviction (0-100) |
SIE_DISK_CACHE_ENABLED | true | Enable LRU disk cache for model weights |
SIE_DISK_PRESSURE_THRESHOLD_PERCENT | 85 | Disk usage percent that triggers LRU eviction of cached weights |
SIE_IDLE_EVICT_S | (unset) | Unload models idle for N seconds. Disabled by default; set e.g. 300 for a 5-minute idle TTL. |
SIE_PRELOAD_MODELS | (unset) | Comma-separated list of model IDs to eagerly load at server startup, instead of lazy on first request. |
How LRU eviction works:
- Background monitor checks memory usage periodically
- When usage exceeds
SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT, the least-recently-used model is evicted - Models are re-loaded on-demand when the next request arrives
- Set
SIE_IDLE_EVICT_Sto also evict models that have been idle for too long, regardless of memory pressure
Logging Configuration
Section titled “Logging Configuration”Control log format and verbosity.
| Variable | Default | Description |
|---|---|---|
SIE_LOG_JSON | false | Enable structured JSON logging for Loki compatibility |
JSON log format includes structured fields:
{ "timestamp": "2025-12-18T10:30:00Z", "level": "INFO", "logger": "sie_server.core.registry", "message": "Inference completed", "model": "bge-m3", "request_id": "abc123", "trace_id": "def456", "latency_ms": 45.2}Tracing Configuration
Section titled “Tracing Configuration”Enable OpenTelemetry distributed tracing.
| Variable | Default | Description |
|---|---|---|
SIE_TRACING_ENABLED | false | Enable OpenTelemetry tracing |
When tracing is enabled, SIE respects standard OpenTelemetry environment variables:
| Variable | Default | Description |
|---|---|---|
OTEL_SERVICE_NAME | sie-server | Service name in traces |
OTEL_TRACES_EXPORTER | otlp | Exporter type (otlp, console, none) |
OTEL_EXPORTER_OTLP_ENDPOINT | http://localhost:4317 | OTLP collector endpoint |
OTEL_TRACES_SAMPLER | always_on | Sampling strategy |
OTEL_TRACES_SAMPLER_ARG | 1.0 | Sampling rate (for traceidratio sampler) |
Performance Configuration
Section titled “Performance Configuration”Advanced settings for compute precision and preprocessing.
| Variable | Default | Description |
|---|---|---|
SIE_PREPROCESSOR_WORKERS | 4 | Number of preprocessing worker threads |
SIE_IMAGE_WORKERS | 4 | Image preprocessing worker threads (for VLMs) |
SIE_ATTENTION_BACKEND | auto | Attention implementation: auto, flash_attention_2, sdpa, eager |
SIE_DEFAULT_COMPUTE_PRECISION | float16 | Default compute precision: float16, bfloat16, float32 |
SIE_INSTRUMENTATION | false | Enable detailed batch statistics for debugging |
LoRA Configuration
Section titled “LoRA Configuration”Control LoRA adapter loading behavior.
| Variable | Default | Description |
|---|---|---|
SIE_MAX_LORAS_PER_MODEL | 10 | Maximum LoRA adapters to keep loaded per model |
When the limit is reached, the least-recently-used LoRA adapter is evicted.
Example: Production Configuration
Section titled “Example: Production Configuration”# High-throughput production setupexport SIE_DEVICE=cudaexport SIE_MODELS_DIR=s3://my-bucket/models/export SIE_CLUSTER_CACHE=s3://my-bucket/weights/export SIE_LOCAL_CACHE=/mnt/nvme/cache
# Batching optimized for A100-80GBexport SIE_MAX_BATCH_REQUESTS=128export SIE_MAX_BATCH_WAIT_MS=5export SIE_MAX_CONCURRENT_REQUESTS=1024
# Memory managementexport SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT=90
# Observabilityexport SIE_LOG_JSON=trueexport SIE_TRACING_ENABLED=trueexport OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317Example: Development Configuration
Section titled “Example: Development Configuration”# Local development setupexport SIE_DEVICE=mps # or cuda, cpuexport SIE_MODELS_DIR=./models
# Lower batching for faster iterationexport SIE_MAX_BATCH_REQUESTS=8export SIE_MAX_BATCH_WAIT_MS=1
# Debug loggingexport SIE_INSTRUMENTATION=trueWhat’s Next
Section titled “What’s Next”- CLI Reference - Command-line options that map to these variables
- HTTP API Reference - Endpoints exposed by the configured server