Monitoring & Observability

SIE exposes monitoring at each runtime layer: gateway, config service, and worker pods. Inside each Kubernetes worker pod, the SIE server sidecar owns queue health while the Python sie-server adapter owns model execution. Use health endpoints for orchestration, Prometheus metrics for alerting, WebSocket streams for interactive status, and sie-top for terminal inspection.

Health Endpoints

SIE exposes Kubernetes-compatible health probes for liveness and readiness checks. In Docker, the Python sie-server process owns these endpoints. In Kubernetes, the gateway, config service, and both containers inside each worker pod have their own health contract.

Component	`/healthz`	`/readyz`
`sie-gateway`	Process liveness, returns `ok`	Process readiness. It does not wait for SIE server sidecar health or `sie-config`
SIE server sidecar (`worker-sidecar` container)	Process liveness	Fresh IPC `Ping` to the in-pod Python process and no active drain
`sie-server`	Python process liveness	Adapter process ready to receive work
`sie-config`	Config process liveness	Registry initialized and able to serve config endpoints

Liveness

curl http://localhost:8080/healthz
# Returns: ok

Use /healthz for Kubernetes liveness probes. A failed check triggers container restart.

Readiness

curl http://localhost:8080/readyz
# Returns: ok

Use /readyz for Kubernetes readiness probes. On the gateway, readiness means the process can accept traffic and return 202 for cold-start capacity; worker-pod availability is exposed through /health, inference responses, and metrics.

Kubernetes configuration:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /readyz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

Prometheus Metrics

SIE exposes Prometheus-format metrics at /metrics. Cluster deployments use component prefixes so dashboards can separate request edge, queue runtime, config, and adapter work.

Gateway Metrics

Metric	Type	Labels	Description
`sie_gateway_requests_total`	Counter	`endpoint`, `status`, `machine_profile`	Gateway request count
`sie_gateway_request_latency_seconds`	Histogram	`endpoint`, `machine_profile`	Gateway request latency
`sie_gateway_pending_demand`	Gauge	`machine_profile`, `bundle`	KEDA scale-from-zero trigger
`sie_gateway_worker_queue_depth`	Gauge	`worker`, `machine_profile`, `bundle`	Queue depth from SIE server sidecar health
`sie_gateway_config_epoch`	Gauge	none	Highest config epoch applied on this gateway
`sie_gateway_nats_connected`	Gauge	none	Gateway NATS connection state

Config Service Metrics

Metric	Type	Labels	Description
`sie_config_http_requests_total`	Counter	`method`, `path`, `status`	Config API request count
`sie_config_http_request_duration_seconds`	Histogram	`method`, `path`	Config API request latency
`sie_config_epoch`	Gauge	none	Authoritative persisted config epoch
`sie_config_models_total`	Gauge	`source`	Models known to the registry by origin (`api` or `filesystem`)
`sie_config_nats_connected`	Gauge	none	Config publisher NATS connection state
`sie_config_nats_publishes_total`	Counter	`result`	Config-delta publish attempts (`success`, `partial`, `failure`)
`sie_config_store_writes_total`	Counter	`op`, `result`	ConfigStore writes and epoch increments by result

SIE Server Sidecar Metrics

Metric	Type	Labels	Description
`sie_worker_messages_received_total`	Counter	none	JetStream messages pulled
`sie_worker_messages_acked_total`	Counter	none	JetStream messages ACKed
`sie_worker_messages_naked_total`	Counter	none	JetStream messages NAKed
`sie_worker_backend_process_seconds`	Histogram	`backend`, `operation`, `model`, `result`	IPC batch processing time in the `sie-server` adapter
`sie_worker_scheduler_batch_items`	Histogram	`model`, `operation`, `lora`	Items per batch formed by the SIE server sidecar
`sie_worker_ipc_request_seconds`	Histogram	`method`, `result`	SIE server sidecar to `sie-server` adapter IPC latency
`sie_worker_config_epoch`	Gauge	none	Highest config epoch applied by this SIE server sidecar
`sie_worker_nats_redelivery_total`	Counter	none	JetStream redelivery count

Python `sie-server` Adapter Metrics

Metric	Type	Labels	Description
`sie_requests_total`	Counter	`model`, `endpoint`, `status`	Requests processed by standalone `sie-server` or Python `sie-server` adapter
`sie_request_duration_seconds`	Histogram	`model`, `endpoint`, `phase`	Adapter-side request duration breakdown
`sie_batch_size`	Histogram	`model`	Items per Python batch
`sie_model_loaded`	Gauge	`model`, `device`	Model load state
`sie_model_memory_bytes`	Gauge	`model`, `device`	GPU memory usage per model

Duration Phases

The sie_request_duration_seconds histogram tracks latency by phase:

Phase	Description
`total`	End-to-end request latency
`queue`	Time spent waiting in the request queue
`tokenize`	Tokenization and preprocessing time
`inference`	GPU inference time

Histogram Buckets

Duration buckets (seconds): 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0

Batch size buckets: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024

Scrape Configuration

Helm can create the ServiceMonitors for gateway, SIE server sidecar, config, and observability sub-charts. For a manual Prometheus scrape, target each component separately:

# prometheus.yml
scrape_configs:
  - job_name: 'sie-gateway'
    static_configs:
      - targets: ['gateway:8080']
    metrics_path: /metrics

  - job_name: 'sie-worker-sidecar'
    static_configs:
      - targets: ['worker:9095']
    metrics_path: /metrics
    scrape_interval: 15s

  - job_name: 'sie-config'
    static_configs:
      - targets: ['sie-config:8080']
    metrics_path: /metrics

sie-top TUI

The sie-top command provides a real-time terminal interface for monitoring SIE servers.

Installation

pip install 'sie-admin[top]'

Usage

# Monitor local server (auto-detects mode)
sie-top

# Monitor specific server
sie-top localhost:8080

# Force Python sie-server status mode
sie-top --worker worker-0.sie.svc:8080

# Force cluster mode (connect to gateway)
sie-top --cluster gateway.example.com:8080

Mode is auto-detected by probing the gateway /health endpoint. Use --worker for a Python sie-server status endpoint or --cluster for gateway cluster status.

Features

The TUI displays:

Server info: Version, uptime, user, PID
GPU table: Device name, memory usage, compute utilization, trend sparkline
Model table: Name, state, device, memory, queue depth, QPS sparkline
Detail panel: Selected GPU or model with 60-second history charts

Keyboard shortcuts:

Key	Action
`j` / `Down`	Move selection down
`k` / `Up`	Move selection up
`?`	Show help
`q`	Quit

WebSocket Status

The Python sie-server process streams real-time status over WebSocket at /ws/status. Updates push every 200ms. In Kubernetes, the gateway also exposes /ws/cluster-status for aggregate cluster status, while routing health comes from SIE server sidecar NATS heartbeats.

Connection

import asyncio
import websockets
import json

async def monitor():
    async with websockets.connect("ws://localhost:8080/ws/status") as ws:
        async for message in ws:
            status = json.loads(message)
            print(f"Loaded models: {status['loaded_models']}")
            print(f"GPU type: {status['gpu']}")

Status Message Format

{
  "timestamp": 1703001234.567,
  "gpu": "l4",
  "loaded_models": ["bge-m3", "e5-base-v2"],
  "server": {
    "version": "0.1.0",
    "uptime_seconds": 3600,
    "user": "sie",
    "working_dir": "/app",
    "pid": 1
  },
  "gpus": [
    {
      "device": "cuda:0",
      "name": "NVIDIA L4",
      "gpu_type": "l4",
      "utilization_pct": 45,
      "memory_used_bytes": 8589934592,
      "memory_total_bytes": 23622320128,
      "memory_threshold_pct": 85
    }
  ],
  "models": [
    {
      "name": "bge-m3",
      "state": "loaded",
      "device": "cuda:0",
      "memory_bytes": 2147483648,
      "queue_depth": 0,
      "queue_pending_items": 0,
      "config": {
        "hf_id": "BAAI/bge-m3",
        "adapter": "bge_m3",
        "inputs": ["text"],
        "outputs": ["dense", "sparse"]
      }
    }
  ],
  "counters": {},
  "histograms": {}
}

Model States

State	Description
`available`	Config loaded, weights not in memory
`loading`	Weights currently loading to GPU
`loaded`	Ready for inference
`unloading`	Weights being evicted from GPU

Grafana Dashboards

SIE includes pre-built Grafana dashboards in the Helm chart at deploy/helm/sie-cluster/files/dashboards/. These are automatically provisioned when deploying with Grafana’s sidecar.

Example queries for common panels:

Request Rate

sum(rate(sie_requests_total{status="success"}[5m])) by (model)

P99 Latency

histogram_quantile(0.99,
  sum(rate(sie_request_duration_seconds_bucket{phase="total"}[5m])) by (le, model)
)

GPU Memory Usage

sum(sie_model_memory_bytes) by (device)

Queue Depth

sum(sie_queue_depth) by (model)

Batch Efficiency

histogram_quantile(0.5,
  sum(rate(sie_batch_size_bucket[5m])) by (le, model)
)

Alert Rules

The sie-cluster chart can render pre-configured Prometheus alert rules:

Alert	Severity	Condition	Description
`SIEWorkerDown`	critical	SIE server sidecar scrape target down for 2 min	A SIE server sidecar scrape target is unreachable
`SIENoHealthyWorkers`	critical	No SIE server sidecar scrape targets healthy for 1 min	No healthy SIE server sidecar targets are reporting
`SIEWorkerHighQueueDepth`	warning	Queue depth > 50 for 5 min	SIE server sidecar queue depth is high; consider scaling up
`SIEGPUMemoryHigh`	warning	GPU memory > 90% for 5 min	Risk of OOM, LRU eviction may be insufficient
`SIEGPUTemperatureHigh`	warning	GPU temp > 80°C for 5 min	GPU throttling likely, check cooling
`SIEGPUECCErrors`	critical	Double-bit ECC errors increase over 1h	Hardware issue likely
`SIEGatewayDown`	critical	Gateway scrape target down for 1 min	Traffic cannot be routed
`SIEHighErrorRate`	warning	Gateway 5xx rate > 5% for 5 min	Server or model errors spiking
`SIEHighLatency`	warning	p95 latency > 5s for 5 min	Request latency is elevated
`SIEConfigDown`	critical	Config scrape target down for 2 min	Config writes are blocked; gateways serve cached state
`SIEProvisioningStuck`	warning	Pod Pending for 10 min	Check scheduling events and GPU capacity
`SIEScaleUpFailed`	warning	FailedScheduling event in 10 min	Likely insufficient GPU capacity

Installing Alert Rules

Alert rules are included in the sie-cluster chart when kube-prometheus-stack is installed or alertRules.enabled is true:

helm upgrade --install sie oci://ghcr.io/superlinked/charts/sie-cluster \
  --version 0.1.10 \
  -n sie \
  -f helm-values.yaml \
  --set alertRules.enabled=true

Custom Alerts

Add custom alerts to your Prometheus configuration:

# Alert when P99 latency exceeds 5 seconds
- alert: SIEHighLatency
  expr: |
    histogram_quantile(0.99,
      sum(rate(sie_request_duration_seconds_bucket{phase="total"}[5m])) by (le, model)
    ) > 5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High P99 latency for model {{ $labels.model }}"

Logging

SIE supports both human-readable and structured JSON logging.

Log Levels

Enable verbose logging with --verbose or -v:

sie-server serve --verbose

JSON Logging

Enable JSON format for Loki and log aggregation systems:

sie-server serve --json-logs

Or via environment variable:

export SIE_LOG_JSON=true
sie-server serve

JSON Log Format

{
  "timestamp": "2025-12-18T10:30:00.123Z",
  "level": "INFO",
  "logger": "sie_server.api.encode",
  "message": "Inference completed",
  "model": "bge-m3",
  "request_id": "abc123",
  "trace_id": "def456",
  "latency_ms": 45.2,
  "batch_size": 16,
  "gpu_type": "l4"
}

Structured Fields

JSON logs include optional fields when available:

Field	Description
`model`	Model name for the request
`request_id`	Unique request identifier
`trace_id`	OpenTelemetry trace ID
`latency_ms`	Request latency in milliseconds
`batch_size`	Number of items in the batch
`gpu_type`	Detected GPU type

What’s Next

Scale-from-Zero - autoscaling lifecycle and troubleshooting
Troubleshooting - common issues and solutions
CLI Reference for all server options
API Reference for endpoint documentation