Skip to content
Why did we open-source our inference engine? Read the post

Monitoring & Observability

SIE exposes monitoring at each runtime layer: gateway, config service, and worker pods. Inside each Kubernetes worker pod, the SIE server sidecar owns queue health while the Python sie-server adapter owns model execution. Use health endpoints for orchestration, Prometheus metrics for alerting, WebSocket streams for interactive status, and sie-top for terminal inspection.

SIE exposes Kubernetes-compatible health probes for liveness and readiness checks. In Docker, the Python sie-server process owns these endpoints. In Kubernetes, the gateway, config service, and both containers inside each worker pod have their own health contract.

Component/healthz/readyz
sie-gatewayProcess liveness, returns okProcess readiness. It does not wait for SIE server sidecar health or sie-config
SIE server sidecar (worker-sidecar container)Process livenessFresh IPC Ping to the in-pod Python process and no active drain
sie-serverPython process livenessAdapter process ready to receive work
sie-configConfig process livenessRegistry initialized and able to serve config endpoints
curl http://localhost:8080/healthz
# Returns: ok

Use /healthz for Kubernetes liveness probes. A failed check triggers container restart.

curl http://localhost:8080/readyz
# Returns: ok

Use /readyz for Kubernetes readiness probes. On the gateway, readiness means the process can accept traffic and return 202 for cold-start capacity; worker-pod availability is exposed through /health, inference responses, and metrics.

Kubernetes configuration:

livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /readyz
port: 8080
initialDelaySeconds: 5
periodSeconds: 5

SIE exposes Prometheus-format metrics at /metrics. Cluster deployments use component prefixes so dashboards can separate request edge, queue runtime, config, and adapter work.

MetricTypeLabelsDescription
sie_gateway_requests_totalCounterendpoint, status, machine_profileGateway request count
sie_gateway_request_latency_secondsHistogramendpoint, machine_profileGateway request latency
sie_gateway_pending_demandGaugemachine_profile, bundleKEDA scale-from-zero trigger
sie_gateway_worker_queue_depthGaugeworker, machine_profile, bundleQueue depth from SIE server sidecar health
sie_gateway_config_epochGaugenoneHighest config epoch applied on this gateway
sie_gateway_nats_connectedGaugenoneGateway NATS connection state
MetricTypeLabelsDescription
sie_config_http_requests_totalCountermethod, path, statusConfig API request count
sie_config_http_request_duration_secondsHistogrammethod, pathConfig API request latency
sie_config_epochGaugenoneAuthoritative persisted config epoch
sie_config_models_totalGaugesourceModels known to the registry by origin (api or filesystem)
sie_config_nats_connectedGaugenoneConfig publisher NATS connection state
sie_config_nats_publishes_totalCounterresultConfig-delta publish attempts (success, partial, failure)
sie_config_store_writes_totalCounterop, resultConfigStore writes and epoch increments by result
MetricTypeLabelsDescription
sie_worker_messages_received_totalCounternoneJetStream messages pulled
sie_worker_messages_acked_totalCounternoneJetStream messages ACKed
sie_worker_messages_naked_totalCounternoneJetStream messages NAKed
sie_worker_backend_process_secondsHistogrambackend, operation, model, resultIPC batch processing time in the sie-server adapter
sie_worker_scheduler_batch_itemsHistogrammodel, operation, loraItems per batch formed by the SIE server sidecar
sie_worker_ipc_request_secondsHistogrammethod, resultSIE server sidecar to sie-server adapter IPC latency
sie_worker_config_epochGaugenoneHighest config epoch applied by this SIE server sidecar
sie_worker_nats_redelivery_totalCounternoneJetStream redelivery count
MetricTypeLabelsDescription
sie_requests_totalCountermodel, endpoint, statusRequests processed by standalone sie-server or Python sie-server adapter
sie_request_duration_secondsHistogrammodel, endpoint, phaseAdapter-side request duration breakdown
sie_batch_sizeHistogrammodelItems per Python batch
sie_model_loadedGaugemodel, deviceModel load state
sie_model_memory_bytesGaugemodel, deviceGPU memory usage per model

The sie_request_duration_seconds histogram tracks latency by phase:

PhaseDescription
totalEnd-to-end request latency
queueTime spent waiting in the request queue
tokenizeTokenization and preprocessing time
inferenceGPU inference time

Duration buckets (seconds): 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0

Batch size buckets: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024

Helm can create the ServiceMonitors for gateway, SIE server sidecar, config, and observability sub-charts. For a manual Prometheus scrape, target each component separately:

# prometheus.yml
scrape_configs:
- job_name: 'sie-gateway'
static_configs:
- targets: ['gateway:8080']
metrics_path: /metrics
- job_name: 'sie-worker-sidecar'
static_configs:
- targets: ['worker:9095']
metrics_path: /metrics
scrape_interval: 15s
- job_name: 'sie-config'
static_configs:
- targets: ['sie-config:8080']
metrics_path: /metrics

The sie-top command provides a real-time terminal interface for monitoring SIE servers.

pip install 'sie-admin[top]'
# Monitor local server (auto-detects mode)
sie-top
# Monitor specific server
sie-top localhost:8080
# Force Python sie-server status mode
sie-top --worker worker-0.sie.svc:8080
# Force cluster mode (connect to gateway)
sie-top --cluster gateway.example.com:8080

Mode is auto-detected by probing the gateway /health endpoint. Use --worker for a Python sie-server status endpoint or --cluster for gateway cluster status.

The TUI displays:

  • Server info: Version, uptime, user, PID
  • GPU table: Device name, memory usage, compute utilization, trend sparkline
  • Model table: Name, state, device, memory, queue depth, QPS sparkline
  • Detail panel: Selected GPU or model with 60-second history charts

Keyboard shortcuts:

KeyAction
j / DownMove selection down
k / UpMove selection up
?Show help
qQuit

The Python sie-server process streams real-time status over WebSocket at /ws/status. Updates push every 200ms. In Kubernetes, the gateway also exposes /ws/cluster-status for aggregate cluster status, while routing health comes from SIE server sidecar NATS heartbeats.

import asyncio
import websockets
import json
async def monitor():
async with websockets.connect("ws://localhost:8080/ws/status") as ws:
async for message in ws:
status = json.loads(message)
print(f"Loaded models: {status['loaded_models']}")
print(f"GPU type: {status['gpu']}")
{
"timestamp": 1703001234.567,
"gpu": "l4",
"loaded_models": ["bge-m3", "e5-base-v2"],
"server": {
"version": "0.1.0",
"uptime_seconds": 3600,
"user": "sie",
"working_dir": "/app",
"pid": 1
},
"gpus": [
{
"device": "cuda:0",
"name": "NVIDIA L4",
"gpu_type": "l4",
"utilization_pct": 45,
"memory_used_bytes": 8589934592,
"memory_total_bytes": 23622320128,
"memory_threshold_pct": 85
}
],
"models": [
{
"name": "bge-m3",
"state": "loaded",
"device": "cuda:0",
"memory_bytes": 2147483648,
"queue_depth": 0,
"queue_pending_items": 0,
"config": {
"hf_id": "BAAI/bge-m3",
"adapter": "bge_m3",
"inputs": ["text"],
"outputs": ["dense", "sparse"]
}
}
],
"counters": {},
"histograms": {}
}
StateDescription
availableConfig loaded, weights not in memory
loadingWeights currently loading to GPU
loadedReady for inference
unloadingWeights being evicted from GPU

SIE includes pre-built Grafana dashboards in the Helm chart at deploy/helm/sie-cluster/files/dashboards/. These are automatically provisioned when deploying with Grafana’s sidecar.

Example queries for common panels:

sum(rate(sie_requests_total{status="success"}[5m])) by (model)
histogram_quantile(0.99,
sum(rate(sie_request_duration_seconds_bucket{phase="total"}[5m])) by (le, model)
)
sum(sie_model_memory_bytes) by (device)
sum(sie_queue_depth) by (model)
histogram_quantile(0.5,
sum(rate(sie_batch_size_bucket[5m])) by (le, model)
)

The sie-cluster chart can render pre-configured Prometheus alert rules:

AlertSeverityConditionDescription
SIEWorkerDowncriticalSIE server sidecar scrape target down for 2 minA SIE server sidecar scrape target is unreachable
SIENoHealthyWorkerscriticalNo SIE server sidecar scrape targets healthy for 1 minNo healthy SIE server sidecar targets are reporting
SIEWorkerHighQueueDepthwarningQueue depth > 50 for 5 minSIE server sidecar queue depth is high; consider scaling up
SIEGPUMemoryHighwarningGPU memory > 90% for 5 minRisk of OOM, LRU eviction may be insufficient
SIEGPUTemperatureHighwarningGPU temp > 80°C for 5 minGPU throttling likely, check cooling
SIEGPUECCErrorscriticalDouble-bit ECC errors increase over 1hHardware issue likely
SIEGatewayDowncriticalGateway scrape target down for 1 minTraffic cannot be routed
SIEHighErrorRatewarningGateway 5xx rate > 5% for 5 minServer or model errors spiking
SIEHighLatencywarningp95 latency > 5s for 5 minRequest latency is elevated
SIEConfigDowncriticalConfig scrape target down for 2 minConfig writes are blocked; gateways serve cached state
SIEProvisioningStuckwarningPod Pending for 10 minCheck scheduling events and GPU capacity
SIEScaleUpFailedwarningFailedScheduling event in 10 minLikely insufficient GPU capacity

Alert rules are included in the sie-cluster chart when kube-prometheus-stack is installed or alertRules.enabled is true:

helm upgrade --install sie oci://ghcr.io/superlinked/charts/sie-cluster \
--version 0.1.10 \
-n sie \
-f helm-values.yaml \
--set alertRules.enabled=true

Add custom alerts to your Prometheus configuration:

# Alert when P99 latency exceeds 5 seconds
- alert: SIEHighLatency
expr: |
histogram_quantile(0.99,
sum(rate(sie_request_duration_seconds_bucket{phase="total"}[5m])) by (le, model)
) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "High P99 latency for model {{ $labels.model }}"

SIE supports both human-readable and structured JSON logging.

Enable verbose logging with --verbose or -v:

sie-server serve --verbose

Enable JSON format for Loki and log aggregation systems:

sie-server serve --json-logs

Or via environment variable:

export SIE_LOG_JSON=true
sie-server serve
{
"timestamp": "2025-12-18T10:30:00.123Z",
"level": "INFO",
"logger": "sie_server.api.encode",
"message": "Inference completed",
"model": "bge-m3",
"request_id": "abc123",
"trace_id": "def456",
"latency_ms": 45.2,
"batch_size": 16,
"gpu_type": "l4"
}

JSON logs include optional fields when available:

FieldDescription
modelModel name for the request
request_idUnique request identifier
trace_idOpenTelemetry trace ID
latency_msRequest latency in milliseconds
batch_sizeNumber of items in the batch
gpu_typeDetected GPU type

Contact us

Tell us about your use case and we'll get back to you shortly.