Skip to content
Why did we open-source our inference engine? Read the post

Configuration

SIE uses environment variables for runtime configuration. CLI arguments override environment variables, which override defaults. In Kubernetes, Helm values render the gateway, config service, and worker-pod containers separately. The worker pod contains the SIE server sidecar and the Python sie-server adapter.

Core settings for device selection, model loading, and server behavior.

VariableDefaultDescription
SIE_DEVICEautoInference device. Options: auto (detect GPU), cuda, cuda:0, mps, cpu
SIE_MODELS_DIR./modelsPath to model configs directory. Supports local paths, s3://, or gs:// URLs
SIE_MODEL_FILTERNoneComma-separated list of model names to load. If unset, all models are available
SIE_GPU_TYPEAuto-detectedOverride detected GPU type for routing (e.g., l4, a100-80gb, h100)

Control where model weights are stored and retrieved.

VariableDefaultDescription
SIE_LOCAL_CACHEHF_HOMELocal cache directory for model weights
SIE_CLUSTER_CACHENoneCluster cache URL for shared weights (s3:// or gs://)
SIE_HF_FALLBACKtrueAllow HuggingFace Hub downloads after cache miss

Cache resolution order:

  1. Local cache (SIE_LOCAL_CACHE)
  2. Cluster cache (SIE_CLUSTER_CACHE)
  3. HuggingFace Hub (if SIE_HF_FALLBACK=true)

Control request batching behavior for GPU efficiency. Standalone sie-server uses the Python batching knobs. Kubernetes queue-mode clusters use the SIE server sidecar knobs.

VariableDefaultDescription
SIE_MAX_BATCH_REQUESTS64Maximum requests per batch
SIE_MAX_BATCH_WAIT_MS10Maximum milliseconds to wait for batch to fill
SIE_MAX_CONCURRENT_REQUESTS512Maximum concurrent requests (queue size)
SIE_RUST_PIPELINE_DEPTH2Queue-mode SIE server sidecar IPC dispatch depth
SIE_BATCHER_COALESCE_MS5Queue-mode SIE server sidecar coalesce window in milliseconds
SIE_BATCHER_MAX_BATCH_REQUESTS12Queue-mode SIE server sidecar item cap per batch
SIE_ADAPTIVE_MIN_QUANTUM_MS2Queue-mode pull-loop wait floor
SIE_ADAPTIVE_MAX_QUANTUM_MS15Queue-mode pull-loop wait ceiling
SIE_ADAPTIVE_TARGET_P50_MS50Queue-mode pull-loop latency target

Tuning guidance:

  • Increase Docker SIE_MAX_BATCH_REQUESTS or Helm workerSidecar.batcher.maxBatchRequests for higher throughput on high-memory GPUs
  • Decrease Docker SIE_MAX_BATCH_WAIT_MS or Helm workerSidecar.batcher.coalesceMs for lower latency at the cost of smaller batches
  • Set SIE_MAX_CONCURRENT_REQUESTS based on expected burst traffic

Prefer Helm values for queue-mode clusters, for example workers.common.workerSidecar.batcher.coalesceMs, so the chart and rendered environment stay in sync.


Helm normally renders these variables in Kubernetes. Set them by hand only when running the Rust gateway, sie-server-sidecar, or sie-config outside Helm.

VariableDefaultDescription
SIE_NATS_URLNoneNATS URL for queued inference, result inboxes, SIE server sidecar health, and config deltas
SIE_GATEWAY_HEALTH_MODEws raw CLI, nats via HelmHealth source used by the gateway. Helm renders nats for the SIE server sidecar path
SIE_GATEWAY_CONFIGURED_GPUSNoneComma-separated machine profiles available for routing and scale-from-zero
SIE_CONFIG_SERVICE_URLNonesie-config base URL used by gateway and SIE server sidecar drift polling
SIE_PAYLOAD_STORE_URLNoneShared payload store for large queued requests (s3://, gs://, or local path)
SIE_ADMIN_TOKENNoneAdmin bearer token for config writes and config export reads
SIE_AUTH_MODEnoneGateway auth mode: none, static, or token
SIE_AUTH_TOKEN, SIE_AUTH_TOKENSNoneBearer tokens accepted by protected gateway routes
SIE_NATS_CONFIG_TRUSTED_PRODUCERSsie-configComma-separated producer IDs trusted for config-delta subjects

Static worker URL and Kubernetes endpoint discovery variables (SIE_GATEWAY_WORKERS, SIE_GATEWAY_KUBERNETES, SIE_GATEWAY_K8S_*) are local diagnostics for WebSocket health. Queue-mode Helm deployments route through NATS.


The SIE server sidecar runs beside the Python sie-server adapter in each worker pod. Helm renders the sidecar container as worker-sidecar. The sidecar pulls from JetStream, batches by model and operation, calls the adapter over Unix domain socket IPC, publishes results, and emits sidecar health over NATS.

VariableDefaultDescription
SIE_POOL_defaultWorker pool name, also used in JetStream stream and subject names
SIE_BUNDLEdefaultBundle ID used for the durable consumer and config subscription
SIE_IPC_SOCKET_PATH/tmp/sie-ipc.sockUnix socket path for SIE server sidecar to sie-server adapter IPC
SIE_MAX_CONCURRENT_BATCHES4Maximum concurrent sidecar batches
SIE_IPC_POOL_SIZEMatches SIE_MAX_CONCURRENT_BATCHES when unsetConcurrent IPC connections to the Python sie-server adapter
SIE_WORKER_METRICS_PORT9095SIE server sidecar /metrics, /healthz, and /readyz port
SIE_WORKER_IDPod hostname or generated UUIDStable worker ID surfaced in logs, results, and NATS health
SIE_MACHINE_PROFILESIE_POOLMachine-profile label used by gateway routing
SIE_GPU_COUNT1GPU count advertised in SIE server sidecar health
SIE_GATEWAY_URLNoneGateway base URL for worker-side pool admission checks
SIE_POOL_ADMISSION_ENABLEDtrueEnable SIE server sidecar pool admission before pulling work
SIE_WORKER_PING_INTERVAL_MS2000IPC ping cadence used for SIE server sidecar readiness
SIE_WORKER_READYZ_STALE_MULT3Readiness staleness multiplier applied to the ping interval
SIE_WORKER_CONFIG_POLL_INTERVAL_MS30000Worker-side config epoch poll interval
SIE_WORKER_CONFIG_FULL_EXPORT_INTERVAL_MS300000Slow full-export reconcile interval. Set 0 to disable after startup
SIE_HEALTH_PUBLISH_INTERVAL_MS5000NATS SIE server sidecar health publish interval

Batch and pull-loop knobs for the SIE server sidecar are listed in Batching Configuration.


Control memory pressure thresholds and LRU eviction behavior.

VariableDefaultDescription
SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT85VRAM usage percent that triggers LRU eviction (0-100)
SIE_DISK_CACHE_ENABLEDtrueEnable LRU disk cache for model weights
SIE_DISK_PRESSURE_THRESHOLD_PERCENT85Disk usage percent that triggers LRU eviction of cached weights
SIE_IDLE_EVICT_S(unset)Unload models idle for N seconds. Disabled by default; set e.g. 300 for a 5-minute idle TTL.
SIE_PRELOAD_MODELS(unset)Comma-separated list of model IDs to eagerly load at server startup, instead of lazy on first request.

How LRU eviction works:

  1. Background monitor checks memory usage periodically
  2. When usage exceeds SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT, the least-recently-used model is evicted
  3. Models are re-loaded on-demand when the next request arrives
  4. Set SIE_IDLE_EVICT_S to also evict models that have been idle for too long, regardless of memory pressure

Control log format and verbosity.

VariableDefaultDescription
SIE_LOG_JSONfalseEnable structured JSON logging for Loki compatibility

JSON log format includes structured fields:

{
"timestamp": "2025-12-18T10:30:00Z",
"level": "INFO",
"logger": "sie_server.core.registry",
"message": "Inference completed",
"model": "bge-m3",
"request_id": "abc123",
"trace_id": "def456",
"latency_ms": 45.2
}

Enable OpenTelemetry distributed tracing.

VariableDefaultDescription
SIE_TRACING_ENABLEDfalseEnable OpenTelemetry tracing

When tracing is enabled, SIE respects standard OpenTelemetry environment variables:

VariableDefaultDescription
OTEL_SERVICE_NAMEsie-serverService name in traces
OTEL_TRACES_EXPORTERotlpExporter type (otlp, console, none)
OTEL_EXPORTER_OTLP_ENDPOINThttp://localhost:4317OTLP collector endpoint
OTEL_TRACES_SAMPLERalways_onSampling strategy
OTEL_TRACES_SAMPLER_ARG1.0Sampling rate (for traceidratio sampler)

Advanced settings for compute precision and preprocessing.

VariableDefaultDescription
SIE_PREPROCESSOR_WORKERS4Number of preprocessing worker threads
SIE_IMAGE_WORKERS4Image preprocessing worker threads (for VLMs)
SIE_ATTENTION_BACKENDautoAttention implementation: auto, flash_attention_2, sdpa, eager
SIE_DEFAULT_COMPUTE_PRECISIONfloat16Default compute precision: float16, bfloat16, float32
SIE_INSTRUMENTATIONfalseEnable detailed batch statistics for debugging

Control LoRA adapter loading behavior.

VariableDefaultDescription
SIE_MAX_LORAS_PER_MODEL10Maximum LoRA adapters to keep loaded per model

When the limit is reached, the least-recently-used LoRA adapter is evicted.


# High-throughput production setup
export SIE_DEVICE=cuda
export SIE_MODELS_DIR=s3://my-bucket/models/
export SIE_CLUSTER_CACHE=s3://my-bucket/weights/
export SIE_LOCAL_CACHE=/mnt/nvme/cache
# Batching optimized for A100-80GB
export SIE_MAX_BATCH_REQUESTS=128
export SIE_MAX_BATCH_WAIT_MS=5
export SIE_MAX_CONCURRENT_REQUESTS=1024
# Memory management
export SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT=90
# Observability
export SIE_LOG_JSON=true
export SIE_TRACING_ENABLED=true
export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317

# Local development setup
export SIE_DEVICE=mps # or cuda, cpu
export SIE_MODELS_DIR=./models
# Lower batching for faster iteration
export SIE_MAX_BATCH_REQUESTS=8
export SIE_MAX_BATCH_WAIT_MS=1
# Debug logging
export SIE_INSTRUMENTATION=true

Contact us

Tell us about your use case and we'll get back to you shortly.