---
title: Monitoring & Observability
description: Health checks, Prometheus metrics, real-time TUI, and observability for SIE servers.
canonical_url: https://superlinked.com/docs/deployment/monitoring
last_updated: 2026-06-11
---

SIE exposes monitoring at each runtime layer: gateway, config service, and worker pods. Inside each Kubernetes worker pod, the SIE server sidecar owns queue health while the Python `sie-server` adapter owns model execution. Use health endpoints for orchestration, Prometheus metrics for alerting, WebSocket streams for interactive status, and `sie-top` for terminal inspection.

## Health Endpoints

Source: [packages/sie_gateway/src/handlers/health.rs](https://github.com/superlinked/sie/blob/main/packages/sie_gateway/src/handlers/health.rs)

Source: [packages/sie_server_sidecar/src/readiness.rs](https://github.com/superlinked/sie/blob/main/packages/sie_server_sidecar/src/readiness.rs)

Source: [packages/sie_server/src/sie_server/api/health.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/api/health.py)

Source: [packages/sie_config/src/sie_config/health.py](https://github.com/superlinked/sie/blob/main/packages/sie_config/src/sie_config/health.py)

SIE exposes Kubernetes-compatible health probes for liveness and readiness checks. In Docker, the Python `sie-server` process owns these endpoints. In Kubernetes, the gateway, config service, and both containers inside each worker pod have their own health contract.

| Component | `/healthz` | `/readyz` |
|-----------|------------|-----------|
| `sie-gateway` | Process liveness, returns `ok` | Process readiness. It does not wait for SIE server sidecar health or `sie-config` |
| SIE server sidecar (`worker-sidecar` container) | Process liveness | Fresh IPC `Ping` to the in-pod Python process and no active drain |
| `sie-server` | Python process liveness | Adapter process ready to receive work |
| `sie-config` | Config process liveness | Registry initialized and able to serve config endpoints |

### Liveness

```bash
curl http://localhost:8080/healthz
# Returns: ok
```

Use `/healthz` for Kubernetes liveness probes. A failed check triggers container restart.

### Readiness

```bash
curl http://localhost:8080/readyz
# Returns: ok
```

Use `/readyz` for Kubernetes readiness probes. On the gateway, readiness means the process can accept traffic and return `202` for cold-start capacity; worker-pod availability is exposed through `/health`, inference responses, and metrics.

**Kubernetes configuration:**

```yaml
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /readyz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
```

## Prometheus Metrics

Source: [packages/sie_gateway/src/metrics.rs](https://github.com/superlinked/sie/blob/main/packages/sie_gateway/src/metrics.rs)

Source: [packages/sie_config/src/sie_config/metrics.py](https://github.com/superlinked/sie/blob/main/packages/sie_config/src/sie_config/metrics.py)

Source: [packages/sie_server_sidecar/src/metrics.rs](https://github.com/superlinked/sie/blob/main/packages/sie_server_sidecar/src/metrics.rs)

Source: [packages/sie_server/src/sie_server/observability/metrics.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/observability/metrics.py)

SIE exposes Prometheus-format metrics at `/metrics`. Cluster deployments use component prefixes so dashboards can separate request edge, queue runtime, config, and adapter work.

### Gateway Metrics

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `sie_gateway_requests_total` | Counter | `endpoint`, `status`, `machine_profile` | Gateway request count |
| `sie_gateway_request_latency_seconds` | Histogram | `endpoint`, `machine_profile` | Gateway request latency |
| `sie_gateway_pending_demand` | Gauge | `machine_profile`, `bundle` | KEDA scale-from-zero trigger |
| `sie_gateway_worker_queue_depth` | Gauge | `worker`, `machine_profile`, `bundle` | Queue depth from SIE server sidecar health |
| `sie_gateway_config_epoch` | Gauge | none | Highest config epoch applied on this gateway |
| `sie_gateway_nats_connected` | Gauge | none | Gateway NATS connection state |

### Config Service Metrics

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `sie_config_http_requests_total` | Counter | `method`, `path`, `status` | Config API request count |
| `sie_config_http_request_duration_seconds` | Histogram | `method`, `path` | Config API request latency |
| `sie_config_epoch` | Gauge | none | Authoritative persisted config epoch |
| `sie_config_models_total` | Gauge | `source` | Models known to the registry by origin (`api` or `filesystem`) |
| `sie_config_nats_connected` | Gauge | none | Config publisher NATS connection state |
| `sie_config_nats_publishes_total` | Counter | `result` | Config-delta publish attempts (`success`, `partial`, `failure`) |
| `sie_config_store_writes_total` | Counter | `op`, `result` | ConfigStore writes and epoch increments by result |

### SIE Server Sidecar Metrics

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `sie_worker_messages_received_total` | Counter | none | JetStream messages pulled |
| `sie_worker_messages_acked_total` | Counter | none | JetStream messages ACKed |
| `sie_worker_messages_naked_total` | Counter | none | JetStream messages NAKed |
| `sie_worker_backend_process_seconds` | Histogram | `backend`, `operation`, `model`, `result` | IPC batch processing time in the `sie-server` adapter |
| `sie_worker_scheduler_batch_items` | Histogram | `model`, `operation`, `lora` | Items per batch formed by the SIE server sidecar |
| `sie_worker_ipc_request_seconds` | Histogram | `method`, `result` | SIE server sidecar to `sie-server` adapter IPC latency |
| `sie_worker_config_epoch` | Gauge | none | Highest config epoch applied by this SIE server sidecar |
| `sie_worker_nats_redelivery_total` | Counter | none | JetStream redelivery count |

### Python `sie-server` Adapter Metrics

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `sie_requests_total` | Counter | `model`, `endpoint`, `status` | Requests processed by standalone `sie-server` or Python `sie-server` adapter |
| `sie_request_duration_seconds` | Histogram | `model`, `endpoint`, `phase` | Adapter-side request duration breakdown |
| `sie_batch_size` | Histogram | `model` | Items per Python batch |
| `sie_model_loaded` | Gauge | `model`, `device` | Model load state |
| `sie_model_memory_bytes` | Gauge | `model`, `device` | GPU memory usage per model |

### Duration Phases

The `sie_request_duration_seconds` histogram tracks latency by phase:

| Phase | Description |
|-------|-------------|
| `total` | End-to-end request latency |
| `queue` | Time spent waiting in the request queue |
| `tokenize` | Tokenization and preprocessing time |
| `inference` | GPU inference time |

### Histogram Buckets

**Duration buckets (seconds):** 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0

**Batch size buckets:** 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024

### Scrape Configuration

Helm can create the ServiceMonitors for gateway, SIE server sidecar, config, and observability sub-charts. For a manual Prometheus scrape, target each component separately:

```yaml
# prometheus.yml
scrape_configs:
  - job_name: 'sie-gateway'
    static_configs:
      - targets: ['gateway:8080']
    metrics_path: /metrics

  - job_name: 'sie-worker-sidecar'
    static_configs:
      - targets: ['worker:9095']
    metrics_path: /metrics
    scrape_interval: 15s

  - job_name: 'sie-config'
    static_configs:
      - targets: ['sie-config:8080']
    metrics_path: /metrics
```

## sie-top TUI

> **Caution — Work in progress:**
>
> `sie-admin` and `sie-top` are not yet released. The CLI and package name may change before general availability.

Source: [packages/sie_admin/src/sie_admin/top/app.py](https://github.com/superlinked/sie/blob/main/packages/sie_admin/src/sie_admin/top/app.py)

The `sie-top` command provides a real-time terminal interface for monitoring SIE servers.

### Installation

```bash
pip install 'sie-admin[top]'
```

### Usage

```bash
# Monitor local server (auto-detects mode)
sie-top

# Monitor specific server
sie-top localhost:8080

# Force Python sie-server status mode
sie-top --worker worker-0.sie.svc:8080

# Force cluster mode (connect to gateway)
sie-top --cluster gateway.example.com:8080
```

Mode is auto-detected by probing the gateway `/health` endpoint. Use `--worker` for a Python `sie-server` status endpoint or `--cluster` for gateway cluster status.

### Features

The TUI displays:

- **Server info:** Version, uptime, user, PID
- **GPU table:** Device name, memory usage, compute utilization, trend sparkline
- **Model table:** Name, state, device, memory, queue depth, QPS sparkline
- **Detail panel:** Selected GPU or model with 60-second history charts

**Keyboard shortcuts:**

| Key | Action |
|-----|--------|
| `j` / `Down` | Move selection down |
| `k` / `Up` | Move selection up |
| `?` | Show help |
| `q` | Quit |

## WebSocket Status

Source: [packages/sie_server/src/sie_server/api/ws.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/api/ws.py)

The Python `sie-server` process streams real-time status over WebSocket at `/ws/status`. Updates push every 200ms. In Kubernetes, the gateway also exposes `/ws/cluster-status` for aggregate cluster status, while routing health comes from SIE server sidecar NATS heartbeats.

### Connection

```python
import asyncio
import websockets
import json

async def monitor():
    async with websockets.connect("ws://localhost:8080/ws/status") as ws:
        async for message in ws:
            status = json.loads(message)
            print(f"Loaded models: {status['loaded_models']}")
            print(f"GPU type: {status['gpu']}")
```

### Status Message Format

```json
{
  "timestamp": 1703001234.567,
  "gpu": "l4",
  "loaded_models": ["bge-m3", "e5-base-v2"],
  "server": {
    "version": "0.1.0",
    "uptime_seconds": 3600,
    "user": "sie",
    "working_dir": "/app",
    "pid": 1
  },
  "gpus": [
    {
      "device": "cuda:0",
      "name": "NVIDIA L4",
      "gpu_type": "l4",
      "utilization_pct": 45,
      "memory_used_bytes": 8589934592,
      "memory_total_bytes": 23622320128,
      "memory_threshold_pct": 85
    }
  ],
  "models": [
    {
      "name": "bge-m3",
      "state": "loaded",
      "device": "cuda:0",
      "memory_bytes": 2147483648,
      "queue_depth": 0,
      "queue_pending_items": 0,
      "config": {
        "hf_id": "BAAI/bge-m3",
        "adapter": "bge_m3",
        "inputs": ["text"],
        "outputs": ["dense", "sparse"]
      }
    }
  ],
  "counters": {},
  "histograms": {}
}
```

### Model States

| State | Description |
|-------|-------------|
| `available` | Config loaded, weights not in memory |
| `loading` | Weights currently loading to GPU |
| `loaded` | Ready for inference |
| `unloading` | Weights being evicted from GPU |

## Grafana Dashboards

SIE includes pre-built Grafana dashboards in the Helm chart at [`deploy/helm/sie-cluster/files/dashboards/`](https://github.com/superlinked/sie/tree/main/deploy/helm/sie-cluster/files/dashboards). These are automatically provisioned when deploying with Grafana's sidecar.

Example queries for common panels:

### Request Rate

```txt
sum(rate(sie_requests_total{status="success"}[5m])) by (model)
```

### P99 Latency

```txt
histogram_quantile(0.99,
  sum(rate(sie_request_duration_seconds_bucket{phase="total"}[5m])) by (le, model)
)
```

### GPU Memory Usage

```txt
sum(sie_model_memory_bytes) by (device)
```

### Queue Depth

```txt
sum(sie_queue_depth) by (model)
```

### Batch Efficiency

```txt
histogram_quantile(0.5,
  sum(rate(sie_batch_size_bucket[5m])) by (le, model)
)
```

## Alert Rules

Source: [deploy/helm/sie-cluster/files/alerts/sie-rules.yaml](https://github.com/superlinked/sie/blob/main/deploy/helm/sie-cluster/files/alerts/sie-rules.yaml)

The `sie-cluster` chart can render pre-configured Prometheus alert rules:

| Alert | Severity | Condition | Description |
|-------|----------|-----------|-------------|
| `SIEWorkerDown` | critical | SIE server sidecar scrape target down for 2 min | A SIE server sidecar scrape target is unreachable |
| `SIENoHealthyWorkers` | critical | No SIE server sidecar scrape targets healthy for 1 min | No healthy SIE server sidecar targets are reporting |
| `SIEWorkerHighQueueDepth` | warning | Queue depth > 50 for 5 min | SIE server sidecar queue depth is high; consider scaling up |
| `SIEGPUMemoryHigh` | warning | GPU memory > 90% for 5 min | Risk of OOM, LRU eviction may be insufficient |
| `SIEGPUTemperatureHigh` | warning | GPU temp > 80°C for 5 min | GPU throttling likely, check cooling |
| `SIEGPUECCErrors` | critical | Double-bit ECC errors increase over 1h | Hardware issue likely |
| `SIEGatewayDown` | critical | Gateway scrape target down for 1 min | Traffic cannot be routed |
| `SIEHighErrorRate` | warning | Gateway 5xx rate > 5% for 5 min | Server or model errors spiking |
| `SIEHighLatency` | warning | p95 latency > 5s for 5 min | Request latency is elevated |
| `SIEConfigDown` | critical | Config scrape target down for 2 min | Config writes are blocked; gateways serve cached state |
| `SIEProvisioningStuck` | warning | Pod Pending for 10 min | Check scheduling events and GPU capacity |
| `SIEScaleUpFailed` | warning | FailedScheduling event in 10 min | Likely insufficient GPU capacity |

### Installing Alert Rules

Alert rules are included in the `sie-cluster` chart when kube-prometheus-stack is installed or `alertRules.enabled` is true:

```bash
helm upgrade --install sie oci://ghcr.io/superlinked/charts/sie-cluster \
  --version 0.1.10 \
  -n sie \
  -f helm-values.yaml \
  --set alertRules.enabled=true
```

### Custom Alerts

Add custom alerts to your Prometheus configuration:

```yaml
# Alert when P99 latency exceeds 5 seconds
- alert: SIEHighLatency
  expr: |
    histogram_quantile(0.99,
      sum(rate(sie_request_duration_seconds_bucket{phase="total"}[5m])) by (le, model)
    ) > 5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High P99 latency for model {{ $labels.model }}"
```

---

## Logging

Source: [packages/sie_server/src/sie_server/core/logging.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/core/logging.py)

SIE supports both human-readable and structured JSON logging.

### Log Levels

Enable verbose logging with `--verbose` or `-v`:

```bash
sie-server serve --verbose
```

### JSON Logging

Enable JSON format for Loki and log aggregation systems:

```bash
sie-server serve --json-logs
```

Or via environment variable:

```bash
export SIE_LOG_JSON=true
sie-server serve
```

### JSON Log Format

```json
{
  "timestamp": "2025-12-18T10:30:00.123Z",
  "level": "INFO",
  "logger": "sie_server.api.encode",
  "message": "Inference completed",
  "model": "bge-m3",
  "request_id": "abc123",
  "trace_id": "def456",
  "latency_ms": 45.2,
  "batch_size": 16,
  "gpu_type": "l4"
}
```

### Structured Fields

JSON logs include optional fields when available:

| Field | Description |
|-------|-------------|
| `model` | Model name for the request |
| `request_id` | Unique request identifier |
| `trace_id` | OpenTelemetry trace ID |
| `latency_ms` | Request latency in milliseconds |
| `batch_size` | Number of items in the batch |
| `gpu_type` | Detected GPU type |

## What's Next

- [Scale-from-Zero](/docs/deployment/autoscaling/) - autoscaling lifecycle and troubleshooting
- [Troubleshooting](/docs/reference/troubleshooting/) - common issues and solutions
- [CLI Reference](/docs/reference/cli/) for all server options
- [API Reference](/docs/reference/api/) for endpoint documentation
