Monitoring & Observability
SIE exposes monitoring at each runtime layer: gateway, config service, and worker pods. Inside each Kubernetes worker pod, the SIE server sidecar owns queue health while the Python sie-server adapter owns model execution. Use health endpoints for orchestration, Prometheus metrics for alerting, WebSocket streams for interactive status, and sie-top for terminal inspection.
Health Endpoints
Section titled “Health Endpoints”SIE exposes Kubernetes-compatible health probes for liveness and readiness checks. In Docker, the Python sie-server process owns these endpoints. In Kubernetes, the gateway, config service, and both containers inside each worker pod have their own health contract.
| Component | /healthz | /readyz |
|---|---|---|
sie-gateway | Process liveness, returns ok | Process readiness. It does not wait for SIE server sidecar health or sie-config |
SIE server sidecar (worker-sidecar container) | Process liveness | Fresh IPC Ping to the in-pod Python process and no active drain |
sie-server | Python process liveness | Adapter process ready to receive work |
sie-config | Config process liveness | Registry initialized and able to serve config endpoints |
Liveness
Section titled “Liveness”curl http://localhost:8080/healthz# Returns: okUse /healthz for Kubernetes liveness probes. A failed check triggers container restart.
Readiness
Section titled “Readiness”curl http://localhost:8080/readyz# Returns: okUse /readyz for Kubernetes readiness probes. On the gateway, readiness means the process can accept traffic and return 202 for cold-start capacity; worker-pod availability is exposed through /health, inference responses, and metrics.
Kubernetes configuration:
livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 10 periodSeconds: 10
readinessProbe: httpGet: path: /readyz port: 8080 initialDelaySeconds: 5 periodSeconds: 5Prometheus Metrics
Section titled “Prometheus Metrics”SIE exposes Prometheus-format metrics at /metrics. Cluster deployments use component prefixes so dashboards can separate request edge, queue runtime, config, and adapter work.
Gateway Metrics
Section titled “Gateway Metrics”| Metric | Type | Labels | Description |
|---|---|---|---|
sie_gateway_requests_total | Counter | endpoint, status, machine_profile | Gateway request count |
sie_gateway_request_latency_seconds | Histogram | endpoint, machine_profile | Gateway request latency |
sie_gateway_pending_demand | Gauge | machine_profile, bundle | KEDA scale-from-zero trigger |
sie_gateway_worker_queue_depth | Gauge | worker, machine_profile, bundle | Queue depth from SIE server sidecar health |
sie_gateway_config_epoch | Gauge | none | Highest config epoch applied on this gateway |
sie_gateway_nats_connected | Gauge | none | Gateway NATS connection state |
Config Service Metrics
Section titled “Config Service Metrics”| Metric | Type | Labels | Description |
|---|---|---|---|
sie_config_http_requests_total | Counter | method, path, status | Config API request count |
sie_config_http_request_duration_seconds | Histogram | method, path | Config API request latency |
sie_config_epoch | Gauge | none | Authoritative persisted config epoch |
sie_config_models_total | Gauge | source | Models known to the registry by origin (api or filesystem) |
sie_config_nats_connected | Gauge | none | Config publisher NATS connection state |
sie_config_nats_publishes_total | Counter | result | Config-delta publish attempts (success, partial, failure) |
sie_config_store_writes_total | Counter | op, result | ConfigStore writes and epoch increments by result |
SIE Server Sidecar Metrics
Section titled “SIE Server Sidecar Metrics”| Metric | Type | Labels | Description |
|---|---|---|---|
sie_worker_messages_received_total | Counter | none | JetStream messages pulled |
sie_worker_messages_acked_total | Counter | none | JetStream messages ACKed |
sie_worker_messages_naked_total | Counter | none | JetStream messages NAKed |
sie_worker_backend_process_seconds | Histogram | backend, operation, model, result | IPC batch processing time in the sie-server adapter |
sie_worker_scheduler_batch_items | Histogram | model, operation, lora | Items per batch formed by the SIE server sidecar |
sie_worker_ipc_request_seconds | Histogram | method, result | SIE server sidecar to sie-server adapter IPC latency |
sie_worker_config_epoch | Gauge | none | Highest config epoch applied by this SIE server sidecar |
sie_worker_nats_redelivery_total | Counter | none | JetStream redelivery count |
Python sie-server Adapter Metrics
Section titled “Python sie-server Adapter Metrics”| Metric | Type | Labels | Description |
|---|---|---|---|
sie_requests_total | Counter | model, endpoint, status | Requests processed by standalone sie-server or Python sie-server adapter |
sie_request_duration_seconds | Histogram | model, endpoint, phase | Adapter-side request duration breakdown |
sie_batch_size | Histogram | model | Items per Python batch |
sie_model_loaded | Gauge | model, device | Model load state |
sie_model_memory_bytes | Gauge | model, device | GPU memory usage per model |
Duration Phases
Section titled “Duration Phases”The sie_request_duration_seconds histogram tracks latency by phase:
| Phase | Description |
|---|---|
total | End-to-end request latency |
queue | Time spent waiting in the request queue |
tokenize | Tokenization and preprocessing time |
inference | GPU inference time |
Histogram Buckets
Section titled “Histogram Buckets”Duration buckets (seconds): 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0
Batch size buckets: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024
Scrape Configuration
Section titled “Scrape Configuration”Helm can create the ServiceMonitors for gateway, SIE server sidecar, config, and observability sub-charts. For a manual Prometheus scrape, target each component separately:
# prometheus.ymlscrape_configs: - job_name: 'sie-gateway' static_configs: - targets: ['gateway:8080'] metrics_path: /metrics
- job_name: 'sie-worker-sidecar' static_configs: - targets: ['worker:9095'] metrics_path: /metrics scrape_interval: 15s
- job_name: 'sie-config' static_configs: - targets: ['sie-config:8080'] metrics_path: /metricssie-top TUI
Section titled “sie-top TUI”The sie-top command provides a real-time terminal interface for monitoring SIE servers.
Installation
Section titled “Installation”pip install 'sie-admin[top]'# Monitor local server (auto-detects mode)sie-top
# Monitor specific serversie-top localhost:8080
# Force Python sie-server status modesie-top --worker worker-0.sie.svc:8080
# Force cluster mode (connect to gateway)sie-top --cluster gateway.example.com:8080Mode is auto-detected by probing the gateway /health endpoint. Use --worker for a Python sie-server status endpoint or --cluster for gateway cluster status.
Features
Section titled “Features”The TUI displays:
- Server info: Version, uptime, user, PID
- GPU table: Device name, memory usage, compute utilization, trend sparkline
- Model table: Name, state, device, memory, queue depth, QPS sparkline
- Detail panel: Selected GPU or model with 60-second history charts
Keyboard shortcuts:
| Key | Action |
|---|---|
j / Down | Move selection down |
k / Up | Move selection up |
? | Show help |
q | Quit |
WebSocket Status
Section titled “WebSocket Status”The Python sie-server process streams real-time status over WebSocket at /ws/status. Updates push every 200ms. In Kubernetes, the gateway also exposes /ws/cluster-status for aggregate cluster status, while routing health comes from SIE server sidecar NATS heartbeats.
Connection
Section titled “Connection”import asyncioimport websocketsimport json
async def monitor(): async with websockets.connect("ws://localhost:8080/ws/status") as ws: async for message in ws: status = json.loads(message) print(f"Loaded models: {status['loaded_models']}") print(f"GPU type: {status['gpu']}")Status Message Format
Section titled “Status Message Format”{ "timestamp": 1703001234.567, "gpu": "l4", "loaded_models": ["bge-m3", "e5-base-v2"], "server": { "version": "0.1.0", "uptime_seconds": 3600, "user": "sie", "working_dir": "/app", "pid": 1 }, "gpus": [ { "device": "cuda:0", "name": "NVIDIA L4", "gpu_type": "l4", "utilization_pct": 45, "memory_used_bytes": 8589934592, "memory_total_bytes": 23622320128, "memory_threshold_pct": 85 } ], "models": [ { "name": "bge-m3", "state": "loaded", "device": "cuda:0", "memory_bytes": 2147483648, "queue_depth": 0, "queue_pending_items": 0, "config": { "hf_id": "BAAI/bge-m3", "adapter": "bge_m3", "inputs": ["text"], "outputs": ["dense", "sparse"] } } ], "counters": {}, "histograms": {}}Model States
Section titled “Model States”| State | Description |
|---|---|
available | Config loaded, weights not in memory |
loading | Weights currently loading to GPU |
loaded | Ready for inference |
unloading | Weights being evicted from GPU |
Grafana Dashboards
Section titled “Grafana Dashboards”SIE includes pre-built Grafana dashboards in the Helm chart at deploy/helm/sie-cluster/files/dashboards/. These are automatically provisioned when deploying with Grafana’s sidecar.
Example queries for common panels:
Request Rate
Section titled “Request Rate”sum(rate(sie_requests_total{status="success"}[5m])) by (model)P99 Latency
Section titled “P99 Latency”histogram_quantile(0.99, sum(rate(sie_request_duration_seconds_bucket{phase="total"}[5m])) by (le, model))GPU Memory Usage
Section titled “GPU Memory Usage”sum(sie_model_memory_bytes) by (device)Queue Depth
Section titled “Queue Depth”sum(sie_queue_depth) by (model)Batch Efficiency
Section titled “Batch Efficiency”histogram_quantile(0.5, sum(rate(sie_batch_size_bucket[5m])) by (le, model))Alert Rules
Section titled “Alert Rules”The sie-cluster chart can render pre-configured Prometheus alert rules:
| Alert | Severity | Condition | Description |
|---|---|---|---|
SIEWorkerDown | critical | SIE server sidecar scrape target down for 2 min | A SIE server sidecar scrape target is unreachable |
SIENoHealthyWorkers | critical | No SIE server sidecar scrape targets healthy for 1 min | No healthy SIE server sidecar targets are reporting |
SIEWorkerHighQueueDepth | warning | Queue depth > 50 for 5 min | SIE server sidecar queue depth is high; consider scaling up |
SIEGPUMemoryHigh | warning | GPU memory > 90% for 5 min | Risk of OOM, LRU eviction may be insufficient |
SIEGPUTemperatureHigh | warning | GPU temp > 80°C for 5 min | GPU throttling likely, check cooling |
SIEGPUECCErrors | critical | Double-bit ECC errors increase over 1h | Hardware issue likely |
SIEGatewayDown | critical | Gateway scrape target down for 1 min | Traffic cannot be routed |
SIEHighErrorRate | warning | Gateway 5xx rate > 5% for 5 min | Server or model errors spiking |
SIEHighLatency | warning | p95 latency > 5s for 5 min | Request latency is elevated |
SIEConfigDown | critical | Config scrape target down for 2 min | Config writes are blocked; gateways serve cached state |
SIEProvisioningStuck | warning | Pod Pending for 10 min | Check scheduling events and GPU capacity |
SIEScaleUpFailed | warning | FailedScheduling event in 10 min | Likely insufficient GPU capacity |
Installing Alert Rules
Section titled “Installing Alert Rules”Alert rules are included in the sie-cluster chart when kube-prometheus-stack is installed or alertRules.enabled is true:
helm upgrade --install sie oci://ghcr.io/superlinked/charts/sie-cluster \ --version 0.1.10 \ -n sie \ -f helm-values.yaml \ --set alertRules.enabled=trueCustom Alerts
Section titled “Custom Alerts”Add custom alerts to your Prometheus configuration:
# Alert when P99 latency exceeds 5 seconds- alert: SIEHighLatency expr: | histogram_quantile(0.99, sum(rate(sie_request_duration_seconds_bucket{phase="total"}[5m])) by (le, model) ) > 5 for: 5m labels: severity: warning annotations: summary: "High P99 latency for model {{ $labels.model }}"Logging
Section titled “Logging”SIE supports both human-readable and structured JSON logging.
Log Levels
Section titled “Log Levels”Enable verbose logging with --verbose or -v:
sie-server serve --verboseJSON Logging
Section titled “JSON Logging”Enable JSON format for Loki and log aggregation systems:
sie-server serve --json-logsOr via environment variable:
export SIE_LOG_JSON=truesie-server serveJSON Log Format
Section titled “JSON Log Format”{ "timestamp": "2025-12-18T10:30:00.123Z", "level": "INFO", "logger": "sie_server.api.encode", "message": "Inference completed", "model": "bge-m3", "request_id": "abc123", "trace_id": "def456", "latency_ms": 45.2, "batch_size": 16, "gpu_type": "l4"}Structured Fields
Section titled “Structured Fields”JSON logs include optional fields when available:
| Field | Description |
|---|---|
model | Model name for the request |
request_id | Unique request identifier |
trace_id | OpenTelemetry trace ID |
latency_ms | Request latency in milliseconds |
batch_size | Number of items in the batch |
gpu_type | Detected GPU type |
What’s Next
Section titled “What’s Next”- Scale-from-Zero - autoscaling lifecycle and troubleshooting
- Troubleshooting - common issues and solutions
- CLI Reference for all server options
- API Reference for endpoint documentation