Skip to content
Why did we open-source our inference engine? Read the post

Performance Tuning

SIE provides several tuning parameters that affect throughput, latency, and resource usage. This guide covers the main configuration options.

Batching groups requests to maximize GPU utilization. The tuning surface depends on deployment mode:

ModeRuntime that forms batchesPrimary knobs
Standalone sie-serverPython sie-serverSIE_MAX_BATCH_WAIT_MS, SIE_MAX_BATCH_REQUESTS, per-model max_batch_tokens
Kubernetes queue modeSIE server sidecar inside the worker podworkers.common.workerSidecar.batcher.*, pipelineDepth, adaptive.*

Maximum total cost per batch. For text, cost equals token count. Default: 16384 tokens.

Batch cost is an internal default in BatchConfig and is configured per-model, not via environment variable.

Maximum time to wait for more requests before processing a batch. Default: 10ms.

# Environment variable
export SIE_MAX_BATCH_WAIT_MS=20

Lower values reduce latency for sparse traffic. Higher values improve batching efficiency under load.

Maximum number of requests per batch. Default: 64.

# Environment variable
export SIE_MAX_BATCH_REQUESTS=128

This is a secondary limit. Cost-based batching typically triggers first for text workloads.

In Kubernetes, the SIE server sidecar pulls from JetStream and sends fully formed RunBatch IPC calls to the sie-server adapter. The production defaults are intentionally conservative:

Helm valueDefaultPurpose
workers.common.workerSidecar.pipelineDepth2One Python batch active and one queued behind it
workers.common.workerSidecar.batcher.coalesceMs5Server-sidecar batch coalesce window
workers.common.workerSidecar.batcher.maxBatchRequests12Hard item cap per SIE server sidecar batch
workers.common.workerSidecar.adaptive.minQuantumMs2Pull-loop coalesce floor
workers.common.workerSidecar.adaptive.maxQuantumMs15Pull-loop coalesce ceiling
workers.common.workerSidecar.adaptive.targetP50Ms50Pull-loop latency target

Treat these as a group. Raising maxBatchRequests without checking pipelineDepth and adaptive wait can improve throughput while hurting p50 and p95 latency.

For low-latency Docker workloads, reduce SIE_MAX_BATCH_WAIT_MS to 5ms or less. For high-throughput Docker workloads, increase SIE_MAX_BATCH_WAIT_MS and SIE_MAX_BATCH_REQUESTS.

For Kubernetes queue-mode workloads, start with the Helm SIE server sidecar defaults. Increase batcher.maxBatchRequests only after sie_worker_scheduler_batch_items, sie_worker_backend_process_seconds, and sie_gateway_request_latency_seconds show that the GPU is underfed and latency has room.

SIE uses reactive LRU eviction to manage GPU memory. No static VRAM budget is required.

When memory usage exceeds this percentage, the least-recently-used model is evicted. Default: 85%.

# Environment variable
export SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT=85

Lower values keep more headroom for inference spikes. Higher values allow more models to stay loaded.

The memory manager checks pressure at two points:

  1. Before loading: If above threshold, evict LRU model first
  2. After each batch: Background check for gradual memory growth

Models are tracked by last-use time. The oldest model is evicted first.

# From memory.py - LRU tracking
def touch(self, model_name: str) -> None:
if model_name in self._models:
self._models[model_name].touch()
self._models.move_to_end(model_name)

Memory tracking adapts to your hardware:

DeviceMemory Source
CUDANVML device memory query
MPSPyTorch allocated memory
CPUSystem RAM via psutil

The attention implementation affects inference speed significantly.

BackendRequirementsSpeedup
flash_attention_2Ampere+ GPU, flash-attn package2-4x
sdpaPyTorch 2.0+1.5-2x
eagerAnyBaseline
# Auto-select best available (default)
export SIE_ATTENTION_BACKEND=auto
# Force specific backend
export SIE_ATTENTION_BACKEND=flash_attention_2
export SIE_ATTENTION_BACKEND=sdpa

Auto mode selects Flash Attention 2 if available, then SDPA, then eager.

Flash Attention 2 requires:

  • CUDA compute capability 8.0+ (Ampere: A100, RTX 30xx, RTX 40xx)
  • The flash-attn package installed
  • FP16 or BF16 compute precision (not FP32)

If requirements are not met, the server uses SDPA automatically.

Control the precision used for model inference:

# Options: float16, bfloat16, float32
export SIE_DEFAULT_COMPUTE_PRECISION=float16
PrecisionMemorySpeedCompatibility
float16LowFastAll CUDA GPUs
bfloat16LowFastAmpere+, MPS, CPU
float32HighSlowAll devices

BF16 offers better numerical stability than FP16 for some models. FP32 is mainly for debugging.

Tokenization and image processing run in a CPU thread pool.

# Environment variable
export SIE_PREPROCESSOR_WORKERS=8

Default: 4. Increase for high request rates. Decrease on memory-constrained systems.

The thread pool is shared across all models. Both tokenization and image preprocessing use the same pool.

The most commonly used tuning parameters can be set via environment variables with the SIE_ prefix:

VariableDefaultDescription
SIE_MAX_BATCH_REQUESTS64Max requests per batch
SIE_MAX_BATCH_WAIT_MS10Max wait time (ms)
SIE_MAX_CONCURRENT_REQUESTS512Request queue size
SIE_RUST_PIPELINE_DEPTH2Queue-mode SIE server sidecar IPC pipeline depth
SIE_BATCHER_COALESCE_MS5Queue-mode SIE server sidecar batch coalesce window
SIE_BATCHER_MAX_BATCH_REQUESTS12Queue-mode SIE server sidecar max items per batch
SIE_ADAPTIVE_MIN_QUANTUM_MS2Queue-mode pull-loop wait floor
SIE_ADAPTIVE_MAX_QUANTUM_MS15Queue-mode pull-loop wait ceiling
SIE_ADAPTIVE_TARGET_P50_MS50Queue-mode pull-loop latency target
SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT85Eviction trigger (%)
SIE_PREPROCESSOR_WORKERS4CPU thread pool size
SIE_ATTENTION_BACKENDautoAttention implementation
SIE_DEFAULT_COMPUTE_PRECISIONfloat16Model precision

Use the eval runner to measure the impact of tuning changes:

# Performance benchmark
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf -s sie
# Compare before/after
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf -s sie,targets

The perf eval reports throughput (items/sec), latency percentiles, and GPU utilization.

See the Evals documentation for the full benchmarking workflow.

Contact us

Tell us about your use case and we'll get back to you shortly.