---
title: Release Notes
description: SIE release history.
canonical_url: https://superlinked.com/docs/reference/release-notes
last_updated: 2026-06-15
---

{/* AUTOGENERATED by scripts/fetch-release-notes.mjs. Do not edit by hand. */}

Latest version: **v0.6.6** (2026-06-14).

## v0.6.6 (2026-06-14)

### Highlights

- **Reliability and operations:** align pool-scoped bundle hashes; avoid sticky missing bundle hashes; clarify missing profile inheritance; fail closed on missing bundle metadata; stabilize keda all-marker e2e

### Bug Fixes

* **config:** align pool-scoped bundle hashes
* **config:** avoid sticky missing bundle hashes
* **config:** clarify missing profile inheritance
* **config:** fail closed on missing bundle metadata
* **tilt:** stabilize keda all-marker e2e

## v0.6.5 (2026-06-13)

### Highlights

- **New capabilities:** demand-side token-reduction benchmark (Req 12, #1311); add describe_image tool (caption + zero-shot tags); add describe_image tool (caption + zero-shot tags) — Req 12 #1310; cap describe_image payload size before cluster calls; claude.ai connector surface — OAuth bridge + skill ZIP (Req 12 #1312); sie_mcp edge with docs_to_markdown tool (Req 12 #1306)
- **Reliability and operations:** harden replace snapshot IPC; return retryable OpenAI provisioning errors; bind OAuth authorization codes to client_id; doctor classifies probe read-timeouts as cold, not unreachable; align GPU memory pressure defaults
- **Performance:** skip tag embedding when top_k &lt;= 0

### Features

* **bench:** demand-side token-reduction benchmark (Req 12, #1311)
* **mcp:** add describe_image tool (caption + zero-shot tags)
* **mcp:** add describe_image tool (caption + zero-shot tags) — Req 12 #1310
* **mcp:** cap describe_image payload size before cluster calls
* **mcp:** claude.ai connector surface — OAuth bridge + skill ZIP (Req 12 #1312)
* **mcp:** sie_mcp edge with docs_to_markdown tool (Req 12 #1306)
* **mcp:** structured extraction + structured generation tools (Req 12 #1308)
* **mcp:** wire measured token-reduction figures into savings metadata
* **tools:** add sie doctor — per-capability cluster diagnostics
* **tools:** Florence-2 fallback for image OCR
* **tools:** sie_tools — Claude Code context-offload client for managed clusters

### Bug Fixes

* align GPU memory pressure defaults
* **config:** detect bundle config hash drift
* **config:** fingerprint model pool ownership
* **config:** harden replace snapshot IPC
* **config:** replace drifted export snapshots
* **deps:** bump sidecar prometheus for protobuf advisory
* **gateway:** address provisioning review feedback
* **gateway:** align provisioning contract docs
* **gateway:** decode native media JSON bytes
* **gateway:** dereference structured output schema refs
* **gateway:** make provisioning non-2xx universally
* **gateway:** preserve ref sibling schema semantics
* **gateway:** return retryable OpenAI provisioning errors
* **mcp:** address review feedback on structured tools
* **mcp:** bind OAuth authorization codes to client_id
* **mcp:** blank-env fallback for model ids; honor SIE_MCP_IMAGE_TOP_K=0
* **mcp:** deep-copy committed token-reduction figures in build_metadata
* **mcp:** validate embedding shapes in _top_k_tags
* **sdk:** normalize score image payloads for wire transport
* **sdk:** normalize score images for wire transport
* **server:** guard readiness for removed configs
* **server:** honor pool-aware model configs
* **server:** render qwen3 vl reranker document images in user prompt
* **server:** render Qwen3-VL reranker document images in user prompt
* **sie-cluster:** add spot toleration to AKS worker pool
* **tools:** address doctor review feedback
* **tools:** doctor classifies probe read-timeouts as cold, not unreachable
* **worker:** keep SGLang loads off event loop

### Performance Improvements

* **mcp:** skip tag embedding when top_k &lt;= 0

## v0.6.4 (2026-06-11)

### Highlights

- **New capabilities:** add `grant_admin_to_creator` (opt-in AAD-RBAC for caller); lock model-cache storage account to cluster VNet by default; install kubelogin and convert kubeconfig after AKS get-credentials; wire Azure provider tooling; add azure (AKS) terraform module; ship values-aks.yaml AKS overlay with the Azure module
- **Reliability and operations:** harden storage_allowed_ip_ranges CIDR validation; harden release guarded merge checks; resubscribe stale NATS health stream; emit `az aks get-credentials --overwrite-existing`; drop unreachable final_registry guard so ACR path can fire

### Features

* **azure-terraform:** add `grant_admin_to_creator` (opt-in AAD-RBAC for caller)
* **azure-terraform:** lock model-cache storage account to cluster VNet by default
* **cluster:** install kubelogin and convert kubeconfig after AKS get-credentials
* **cluster:** wire Azure provider tooling
* **deploy:** add azure (AKS) terraform module
* **helm:** ship values-aks.yaml AKS overlay with the Azure module

### Bug Fixes

* **azure-terraform:** emit `az aks get-credentials --overwrite-existing`
* **azure-terraform:** harden storage_allowed_ip_ranges CIDR validation
* **ci:** harden release guarded merge checks
* **cluster:** address review feedback on Azure provider wiring
* **cluster:** drop unreachable final_registry guard so ACR path can fire
* **cluster:** set TF_VAR_* on Azure destroy path (same as create)
* **deploy:** revert system pool default to Standard_D4s_v3 (zoned everywhere)
* **gateway:** resubscribe stale NATS health stream
* **sidecar:** preserve msgpack work item payloads

## v0.6.3 (2026-06-10)

### Highlights

- **New capabilities:** add azure blob payload store support; add server-side copy fast path for cloud weight sync; informational generation eval CI gate over committed floors; add vision (image) input to generate(); preserve text/image content-part ordering; vision (image) input for generate()
- **Reliability and operations:** harden cloud cache sync paths; clear HIGH Dependabot alerts (docling, rustls-webpki); ensure cloud weight sync creates local parents; evict stale gateway workers on shutdown; fall back to relay on S3/GCS server-side copy failure
- **Performance:** engage conformant image preprocessing for v1; engage conformant image preprocessing for v1 (1.8x)

### Features

* add azure blob payload store support
* add server-side copy fast path for cloud weight sync
* **bench:** informational generation eval CI gate over committed floors
* **generate:** add vision (image) input to generate()
* **generate:** preserve text/image content-part ordering
* **generate:** vision (image) input for generate()
* support azure blob cluster cache
* **tester-cluster:** rtx6000 g7e.4xlarge + sglang preload + hf-token wiring

### Bug Fixes

* address azure cache review feedback
* address final cloud storage review issues
* **bench:** harden generation eval gate per review
* **deps:** clear HIGH Dependabot alerts (docling, rustls-webpki)
* ensure cloud weight sync creates local parents
* evict stale gateway workers on shutdown
* fall back to relay on S3/GCS server-side copy failure
* **generate:** address CodeRabbit review on vision input
* **generate:** address huronat review on vision input (F2-F8)
* **generate:** image-free content_parts field must not shadow layout
* **generate:** reject both-present image-bearing content layouts
* harden cloud cache sync paths
* **loadtest-ci:** self-heal orphaned cluster + stale lock in preflight
* **nemo_colembed:** trim left-padding rows from v1 conformant doc embeddings
* normalize local weight sync destination
* **quality-adapter:** gate v1 Vidore3 on English; finalize ?lang= plumbing
* **skill:** add bash language tag to hfCache --set fenced block (MD040)
* **skill:** move inline comments off shell continuation lines so the helm snippet pastes cleanly
* support cloud source weight sync
* **tester-cluster:** update rtx6000-spot machineType doc to g7e.4xlarge to match terraform

### Performance Improvements

* **nemo_colembed:** engage conformant image preprocessing for v1
* **nemo_colembed:** engage conformant image preprocessing for v1 (1.8x)

## v0.6.2 (2026-06-08)

### Highlights

- **New capabilities:** defer sie-config NATS startup and honor log levels; refresh KEDA Tilt local dev branch; M4 dense encoders — mxbai-embed-large-v1, arctic-embed-l-v2.0, modernbert-embed-base; add daily guarded stable releases
- **Reliability and operations:** accept dense dim in qwen3 vl embedding adapter; preserve model query templates in mteb eval; scale single-profile bundles on gpu-agnostic demand; consolidate runtime ninja install; install ninja in cuda runtime

### Features

* defer sie-config NATS startup and honor log levels
* **dev:** refresh KEDA Tilt local dev branch
* **models:** M4 dense encoders — mxbai-embed-large-v1, arctic-embed-l-v2.0, modernbert-embed-base
* **release:** add daily guarded stable releases

### Bug Fixes

* accept dense dim in qwen3 vl embedding adapter
* **bench:** preserve model query templates in mteb eval
* **dev:** address KEDA Tilt PR review
* **helm:** scale single-profile bundles on gpu-agnostic demand
* **server:** consolidate runtime ninja install
* **server:** install ninja in cuda runtime
* **server:** install ninja in CUDA SGLang runtime
* **terraform:** deny non-HTTPS access on state and quality-eval S3 buckets

## v0.6.1 (2026-06-07)

### Highlights

- **New capabilities:** configure GPU disk sizing and generate smoke; support static queue pools
- **Reliability and operations:** fail fast on invalid static pool config; pin kind smoke workers to default queue pool; canonicalize static queue pool names; stabilize GPU disk Terraform test

### Features

* configure GPU disk sizing and generate smoke
* **gateway:** support static queue pools

### Bug Fixes

* address GPU disk review comments
* **ci:** pin kind smoke workers to default queue pool
* **gateway:** canonicalize static queue pool names
* **gateway:** fail fast on invalid static pool config
* stabilize GPU disk Terraform test

## v0.6.0 (2026-06-07)

### Highlights

- **Breaking change:** Queue work subjects and pool streams use the new sie.work.\{pool\}.\{machine_profile\}.\{bundle\}.\{model\} shape only; legacy subject filters are intentionally not preserved.; workers will subscribe to `sie.work.*.<poolName>` instead of `sie.work.*.default`. Deployed alone (without the matching gateway/sidecar update that publishes/filters on the new subject) this will break routing on every cluster. To preserve the old shared-queue behavior, set `workers.common.queuePool: "default"` explicitly.
- **New capabilities:** route work by queue pool lanes; default SIE_POOL to pool name (not "default")
- **Reliability and operations:** harden queue lane admission; bump vitest 2.1.9 -&gt; 4.1.0 (CVE-2026-47429); align lane defaults and tilt e2e; preserve worker-group queue defaults

### ⚠ BREAKING CHANGES

* **gateway:** Queue work subjects and pool streams use the new sie.work.\{pool\}.\{machine_profile\}.\{bundle\}.\{model\} shape only; legacy subject filters are intentionally not preserved.
* **helm:** workers will subscribe to `sie.work.*.<poolName>` instead of `sie.work.*.default`. Deployed alone (without the matching gateway/sidecar update that publishes/filters on the new subject) this will break routing on every cluster. To preserve the old shared-queue behavior, set `workers.common.queuePool: "default"` explicitly.

### Features

* **gateway:** route work by queue pool lanes
* **helm:** default SIE_POOL to pool name (not "default")

### Bug Fixes

* **deps:** bump vitest 2.1.9 -&gt; 4.1.0 (CVE-2026-47429)
* **gateway:** harden queue lane admission
* **helm:** align lane defaults and tilt e2e
* **helm:** preserve worker-group queue defaults

## v0.5.0 (2026-06-04)

### Highlights

- **Breaking change:** `workers.pools.<name>.bundle` (string), `workers.pools.<name>.minReplicas`, `workers.pools.<name>.maxReplicas`, `workers.pools.<name>.extraEnv`, and `workers.pools.<name>.imageBundle` are replaced by `workers.pools.<name>.bundles.<bundle>.{minReplicas, maxReplicas, extraEnv, imageBundle, enabled}`. `workers.common.bundle` is removed (no longer consumed). StatefulSet, ScaledObject, PDB, and image-prepull DaemonSet names change from `worker-{pool}` to `worker-{pool}-{bundle}`, so in-place upgrades require deleting the old resources first.
- **New capabilities:** agent-jobs text-gen readiness — code/SQL/tools/guard evals + Qwen3.6-27B + precision routing; transfer sie-cluster claude skill; P(unsafe) logprob threshold for CHECK POLICY precision; split worker pools into pool × bundles schema; surface code/sql/guard capabilities; resolve job aliases in configs/resolve; add sglang worker pool for generative models
- **Reliability and operations:** expose unauthenticated metrics scrape port; expose unauthenticated metrics scrape port safely for prom; preserve gateway metrics scrape labels; drop unsupported ebnf advertisement + restore guardian a100 guard threshold; fail-fast on missing Spider DBs + order-sensitive SQL exec accuracy

### ⚠ BREAKING CHANGES

* **helm:** `workers.pools.<name>.bundle` (string), `workers.pools.<name>.minReplicas`, `workers.pools.<name>.maxReplicas`, `workers.pools.<name>.extraEnv`, and `workers.pools.<name>.imageBundle` are replaced by `workers.pools.<name>.bundles.<bundle>.{minReplicas, maxReplicas, extraEnv, imageBundle, enabled}`. `workers.common.bundle` is removed (no longer consumed). StatefulSet, ScaledObject, PDB, and image-prepull DaemonSet names change from `worker-{pool}` to `worker-{pool}-{bundle}`, so in-place upgrades require deleting the old resources first.

### Features

* agent-jobs text-gen readiness — code/SQL/tools/guard evals + Qwen3.6-27B + precision routing
* **agents:** transfer sie-cluster claude skill
* **guard:** P(unsafe) logprob threshold for CHECK POLICY precision
* **helm:** split worker pools into pool × bundles schema
* **models:** surface code/sql/guard capabilities; resolve job aliases in configs/resolve
* **tester-cluster:** add sglang worker pool for generative models

### Bug Fixes

* **agents:** address sie cluster review comments
* **bench:** fail-fast on missing Spider DBs + order-sensitive SQL exec accuracy
* **gateway:** expose unauthenticated metrics scrape port
* **gateway:** expose unauthenticated metrics scrape port safely for prom
* **guard:** reject multi-candidate sampling + keep logprobs consistent on rewrite
* **guard:** robust verdict thresholding, logprob hygiene, decoded-token logprobs
* **helm:** fail-fast on missing/invalid bundle replica bounds
* **helm:** preserve gateway metrics scrape labels
* **helm:** use sidecar binary for image pre-pull
* **models:** drop unsupported ebnf advertisement + restore guardian a100 guard threshold
* **sie_server:** honor params.instruction in Florence-2 extract
* **tester-cluster:** cap rtx6000 default bundle to avoid over-subscription
* **tools:** via-SIE EBNF response_format shape + request/preload model split

## v0.4.2 (2026-06-03)

### Highlights

- **New capabilities:** 5-domain generation bench + via-sie quality matrix + gateway schema gaps; add e0-02 all-minilm time-share experiment; land coalesce_ms=5 + max_batch_requests=12 as Rust defaults; add --via-sie smoke path (route through sie_server); add min_tokens + system_prompt + temperature for G4 retry; close Qwen3.6-27B gap — min_tokens=10 + max=768 + ctx=4096
- **Reliability and operations:** set verbose=True on SIEServer so launch errors surface; document worker-sidecar metrics wiring; gate sidecar nats reconnect refresh; harden sidecar config recovery; budget loadtest barrier timeouts
- **Performance:** anchor min_batch_cost floor at max_batch_tokens // 4; tighten adaptive wait ceiling + revert gte-multilingual 32k; rebind vision Conv3d patch-embed to F.linear; raise max_batch_tokens 16k → 32k to stop IPC-batch shred

### Features

* 5-domain generation bench + via-sie quality matrix + gateway schema gaps
* add e0-02 all-minilm time-share experiment
* **batch_config:** land coalesce_ms=5 + max_batch_requests=12 as Rust defaults
* **bench-27b:** add --via-sie smoke path (route through sie_server)
* **bench-27b:** add min_tokens + system_prompt + temperature for G4 retry
* **bench-27b:** close Qwen3.6-27B gap — min_tokens=10 + max=768 + ctx=4096
* **bench-27b:** launch full SIE stack (NATS+worker+gateway) for --via-sie
* **bench+model:** via-sie 4-task n=300 sweep + NEXTN smaller-draft on 27B
* **bench:** 0.6B via-sie validated; harness + 27B config gains
* **bench:** 5-shot CoT for CaseHOLD (item 5 — close 27B target gap)
* **bench:** fix Qwen3-0.6B GPQA (parrot bug) + 27B diagnostics; final matrix
* **bench:** improve perf eval output handling
* **docling:** accept image input + run on OCR-bench quality path
* **gateway+worker:** chat surface accepts min_tokens + chat_template_kwargs
* **gateway:** strengthen generation isolation guardrails
* **latency:** tighten FetchExpiryController defaults to 2/15/50
* **model+bench:** RTX-PRO-6000 FP8 profile for Qwen3.6-27B + 6000 validation
* **model:** bump Qwen3-0.6B serving context 1024→4096 for prod simple-task use
* **models:** add Marqo/marqo-fashionSigLIP (SigLIP open_clip, fashion image-text)
* **ocr:** docling accepts images + quality eval prefers documents
* reconcile live worker config in sidecar
* RTX PRO 6000 FP8 profile for Qwen3.6-27B + SIE-on-6000 generative benchmark matrix
* **scheduler:** load-aware pipeline_depth autotune (S14 follow-up)
* **scheduler:** production-parity defaults + serial pipeline (carveout p99 fix)
* **scheduler:** restore SIE_RUST_PIPELINE_DEPTH=2 default (deep-saturation fix)
* **scheduler:** SIE_PULL_QUANTUM_INCLUDE_QUEUE_MS for Py-main parity
* **scheduler:** SIE_RUST_WAVE_CADENCE env toggle (default on)
* **scheduler:** step adaptive controller once per wave (Python parity)
* **sidecar:** add worker config and pool admission reconciliation
* **sidecar:** wire generation direct dispatch
* **sie_server:** add MinerU2.5-Pro-2604-1.2B doc OCR adapter
* **sie_server:** carve out QueueExecutor + IPC types for Rust worker POC
* **sie_server:** integrate MinerU2.5-Pro-2604-1.2B doc OCR adapter
* **sie_server:** UDS msgpack IPC server for Rust worker sidecar
* **sie_worker_rust:** close parity gaps with Python pull loop + smoke test
* **sie_worker_rust:** scaffold Rust worker sidecar crate (Phase 1c)
* **sie_worker_rust:** wire end-to-end NATS -&gt; IPC -&gt; publish loop (Phase 1d)
* **sie-bench:** synchronize loadtest measurement start
* **worker/rust:** IPC connection pool — lift the sidecar's last serialization bottleneck
* **worker/rust:** narrate the hot path — structured INFO, slow-RPC + heartbeat-streak WARNs, full error chains
* **worker:** introduce InferenceBackend trait + BackendRouter
* **worker:** native Candle BERT backend behind `candle` feature

### Bug Fixes

* accept dense_dim in dense adapters
* **adapters:** replace Qwen3-VL vision Conv3d patch-embed with matmul
* **adapters:** route Qwen3-VL VLMs through flash attention (Vidore3 throughput)
* address pr review quality issues
* **bench-27b:** drop bundle from SIEServer (sie-server rejects bundle+models combo)
* **bench-27b:** set verbose=True on SIEServer so launch errors surface
* **bench-27b:** skip chat_template_kwargs on via-sie (gateway rejects unsupported field)
* **bench-27b:** wait for sie-server /healthz (not /health)
* **bench:** bump casehold/gpqa max_tokens to 2048 (CoT truncation)
* **bench:** let via-SIE smoke serve a profile-variant model end-to-end
* **bench:** resolve CPU deps for quality server
* **catalog:** include eval-matrix tasks so dispatch filter accepts them
* **ci:** address analyzer findings and stale queue test
* **ci:** avoid nested mise in integration fixture
* **ci:** keep sidecar out of warm cache
* **ci:** refresh gateway openapi contract
* correct e0 vm runbook paths
* **deploy:** add sidecar registry resources
* **deploy:** address server sidecar review feedback
* **deploy:** align server sidecar naming
* **deploy:** align server sidecar naming and kind preload smoke
* **deploy:** align tilt sidecar image naming
* **deploy:** document worker-sidecar metrics wiring
* **deploy:** keep sidecar on GHCR by default
* **deploy:** normalize server sidecar naming
* **deploy:** publish server sidecar image
* **deploy:** rename sidecar container to worker-sidecar
* **deploy:** wire SIE server sidecar for kind smoke
* **deploy:** wire worker sidecar image across kind and cloud
* gate sidecar nats reconnect refresh
* **gateway+server:** queue is the only mode — kill direct-mode cruft
* **gateway:** suppress H9 first-chunk-fallback on single-worker pools
* harden sidecar config recovery
* **impact-map:** keep profiles distinct when adapter_options differ
* keep generation machinery off default queue path
* **loader:** wire profile runtime.default_sampling into the adapter
* **modal:** report actual GPU on remote, not stale env-default
* **model:** bump Qwen3.6-27B default/h100 mem_fraction_static 0.85 → 0.92
* **orchestrator:** thread CLI -p profile through to client.extract
* preserve worker batch identity and publish image
* **product:** update design audit for topical docs
* **quality_eval:** take results-bearing JSON envelope in load_eval_json
* **quality:** batch3 of CodeQL findings + bench KIE bug
* **quality:** batch3 of CodeQL findings + bench KIE root-cause
* **quality:** batches 1+2 of CodeQL quality findings
* **quality:** close CodeQL quality-tab findings
* **quality:** drop redundant inline imports in donut + registry
* **quality:** repair adapter eval harness regressions
* remove e0 preflight httpx dependency
* require rust sidecar for queue workers
* **review:** 0.6B ctx test 1024-&gt;4096, loader except logs, README gaps resolved
* **review:** recompute 27B target delta_vs_baseline for the 2048 scores
* run directory creation
* run e0 vm scripts via uv
* **scheduler:** autotune signal — observed_p50/target_p50 ratio
* scope bundle config hash cache per registry
* **security:** bump astro to ^6.4.2 for website
* **security:** bump gateway deps to patched versions
* **security:** bump product/gtm Python lockfiles
* **security:** bump product/gtm/content/slides npm transitives
* **security:** bump root pnpm deps + add overrides for transitives
* **security:** bump root Python deps to patched versions
* **security:** bump sie_dashboard npm deps to patched versions
* **security:** bump sie_ts_sdk standalone pnpm transitives
* **security:** bump sst to ^4 to drop vulnerable aws-sdk v2
* **security:** cap vite at ^6 + add Node engines to website
* **security:** close ~190 Dependabot alerts across 9 manifests
* **security:** sanitize one-pager template with DOMPurify
* **security:** use Reflect.construct for WebSocket headers shim
* **sie_bench:** send SIE profile via X-SIE-MACHINE-PROFILE header
* **sie_bench:** use rapidfuzz for OmniDocBench edit distance
* **sie_server:** clear CUDA cache on uncovered VLM paths + drop private sem _value access
* **sie_server:** VLM cache clears on uncovered paths + drop private sem _value access
* **sie-bench:** budget loadtest barrier timeouts
* slow sidecar nats consumer reconcile
* **smoke:** launch sie_server worker with -b sglang, not -m &lt;model&gt;
* **smoke:** preload the target model in via-sie worker
* **test:** restore donut helper call contract
* **worker-sidecar:** harden queue carveout contracts
* **worker/rust:** one long-lived pull stream — kill 30s ack_wait stall
* **worker/rust:** re-copy src after cargo chef cook so real build isn't a stub
* **worker/rust:** set CUDA_COMPUTE_CAP at build time (default 89, L4)
* **worker/rust:** stop shipping the cargo-chef stub binary as the real build
* **worker:** harden Candle backend + align dispatcher error contract
* **worker:** harden payload store + error paths; surface silent success bugs
* **worker:** SGLang adapter accepts min_new_tokens kwarg + 27B via-sie validated

### Performance Improvements

* **adaptive:** anchor min_batch_cost floor at max_batch_tokens // 4
* **batching:** tighten adaptive wait ceiling + revert gte-multilingual 32k
* **glm_ocr:** rebind vision Conv3d patch-embed to F.linear
* **gte-multilingual-base:** raise max_batch_tokens 16k → 32k to stop IPC-batch shred
* **mineru_vl:** O(L) incremental no-repeat-ngram for greedy decode
* **ocr:** swap pure-Python Levenshtein DP for rapidfuzz
* **rope_flash:** vectorize CLS/mean pooling, eliminate per-item .item() sync
* **server:** FP16 on GPU, coalesce sized for IPC bursts, starvation self-heal

### Reverts

* restore adaptive batching defaults to 15/50ms
* **scheduler:** drop depth autotune (signal didn't pan out in S17)

## v0.4.1 (2026-05-28)

### Highlights

- **New capabilities:** add Qwen3.6-27B model + migrate to CUDA 12.9
- **Reliability and operations:** isolate generation direct dispatch from shared queues; resolve 18 open CodeQL alerts; use SHA256 (not SHA1) for actor_id log tag; colocate tests under infra/, update sync contract

### Features

* **server:** add Qwen3.6-27B model + migrate to CUDA 12.9

### Bug Fixes

* isolate generation direct dispatch from shared queues
* **security:** resolve 18 open CodeQL alerts
* **security:** use SHA256 (not SHA1) for actor_id log tag
* **terraform-sync:** colocate tests under infra/, update sync contract

### Reverts

* **security:** drop advanced CodeQL setup

## v0.4.0 (2026-05-27)

### Highlights

- **Breaking change:** fail-closed authentication (default-deny)
- **New capabilities:** generation quality-gate scoring core (roadmap §5, trust-critical); generation-quality regression gate over the existing scorers; add cohere measurements for us-east-1; add openai measurements for us-east-1; add voyage measurements for us-east-1; regex/EBNF response_format + developer role (roadmap 1.7)
- **Reliability and operations:** forward provision_timeout_s in SIEImageTextWrapper.encode; raise image-task eval timeouts to fix Flickr30k nightly; exclude favicon + OG image from auth middleware; keep public surfaces vague about what's behind auth; revert NextAuth function-form, use try/catch on Resource
- **Performance:** warm one Lambda, bump timeout, narrow S3 verdict fetch

### ⚠ BREAKING CHANGES

* **gateway:** fail-closed authentication (default-deny)

### Features

* **bench:** generation quality-gate scoring core (roadmap §5, trust-critical)
* **bench:** generation-quality regression gate over the existing scorers
* **benchmarks:** add cohere measurements for us-east-1
* **benchmarks:** add openai measurements for us-east-1
* **benchmarks:** add voyage measurements for us-east-1
* **chat:** regex/EBNF response_format + developer role (roadmap 1.7)
* **dashboard:** add executive quality summary widget on landing
* **dashboard:** add hover-tooltip on 'to verify' explaining WARN
* **dashboard:** brand alignment foundation - palette, fonts, header
* **dashboard:** brand favicon, opengraph image, light-mode hover fix
* **dashboard:** brand foundation - palette, fonts, header logo
* **dashboard:** brand surfaces - retokenise landing page + widget
* **dashboard:** public preview shell on / so Slack unfurls work
* **dashboard:** retokenise badges - brand-elevated neutrals, DM Mono labels
* **dashboard:** retokenise landing surfaces onto brand palette
* **dashboard:** retokenise loadtest pages onto brand surfaces
* **dashboard:** retokenise quality pages onto brand surfaces
* **dashboard:** retokenise shared components and sign-in onto brand palette
* **dashboard:** treat WARN as passing-with-verify, shade gauge amber
* **gateway,sdk,server:** add native generate endpoint with improved admission control and validation
* **gateway,sdk,server:** add OpenAI-compatible chat completions with streaming and sampling extensions
* **gateway,server:** add multi-turn tool-use support with OpenAI-compatible message format
* **gateway:** /v1/completions (legacy OpenAI Completions, raw-prompt)
* **gateway:** /v1/completions streaming (text_completion SSE)
* **gateway:** /v1/generate accepts seed/logprobs/logit_bias/n/best_of/lora_adapter (M8)
* **gateway:** /v1/responses (OpenAI Responses API, MVP)
* **gateway:** /v1/responses structured array input (conversation history)
* **gateway+worker:** per-choice OpenAI streaming for n&gt;1 (H4, H5, M4)
* **gateway:** accept OpenAI multimodal content-parts; reject images (no VL model)
* **gateway:** add routing salt + byte-preserving key mode (M11)
* **gateway:** advertise lora_adapters on /v1/models + pre-validate unknown names
* **gateway:** fail-closed authentication (default-deny)
* **gateway:** meaningful system_fingerprint on chat responses (roadmap 1.3/§5)
* **gateway:** refactor streaming and routing with improved error handling and metrics
* **gateway:** register /v1/moderations as explicit 501 (roadmap 1.8, phase 3)
* **gateway:** serve a rendered API reference at /docs (Redoc)
* **gateway:** unify /v1/embeddings on the OpenAI error envelope (roadmap 1.4)
* **generation:** best_of — over-generate + rank by logprob, return top n
* **generation:** complete M4 req2 generation primitive with streaming, structured outputs, and routing
* **generation:** multi-candidate n&gt;1 (non-streaming) end-to-end (roadmap 1.5)
* **generation:** multi-LoRA serving (one base, N adapters, per-request) (roadmap 6.2)
* **generation:** ship generate() primitive — Qwen3.5-4B + NEXTN/MTP + xgrammar, adapter perf at parity with raw SGLang
* **generation:** streaming n&gt;1 — per-candidate SSE interleave
* **helm/sie-cluster:** bundle cert-manager + trust-manager (opt-in) with self-signed TLS mode
* **openapi:** add tool_calls support to chat completion schema
* **python-sdk:** expose typed params for chat n/logprobs/lora_adapter/etc (M7)
* **quality-eval:** add heartbeat logging and improve long-running process observability
* **routing:** cache-aware (prefix-hash) routing (roadmap §6.3)
* **sie_bench:** add Cohere as a first-class eval source
* **sie_bench:** add Cohere multimodal embeddings
* **sie_bench:** add Cohere rerank backend for native MTEB rerank tasks
* **sie_bench:** add OpenAI Embeddings as a first-class eval source
* **sie_bench:** add Voyage provider source plumbing
* **sie_bench:** implement Voyage text embedding runner
* **sie_dashboard:** add /quality/compare to diff two quality runs
* **terraform:** add default_tags Project=sie/Cluster on all AWS providers
* **terraform:** add on-demand RTX 6000 baseline pool to tester-cluster
* **terraform:** add uniform project=sie label across all GCP clusters
* **terraform:** idle-stop and on-demand wake for quality-eval runner fleet
* **terraform:** scale quality-eval fleet to 5+5, smart-wake, 4h timeout
* **terraform:** wake-runners retries + watchdog queued-jobs backstop
* **tester-cluster:** add on-demand L4 worker pool + capacityType node pins
* **ts-sdk:** handle 202 provisioning in chatCompletions + expose missing fields (H1+M6)
* **worker:** SGLang owns grammar; worker preflight opt-in only (H8, ADR-0002)
* **worker:** wire mixed-pool fairness scheduler into the pull-loop (opt-in)
* **worker:** WorkClassScheduler core for mixed-pool fairness (roadmap §6.1)

### Bug Fixes

* **bench:** declare olmocr[bench] dep for OCR-bench quality eval
* **bench:** drop --with-deps from playwright install (no sudo on g7e)
* **bench:** forward provision_timeout_s in SIEImageTextWrapper.encode
* **bench:** install playwright chromium for OCR-bench KaTeX rendering
* **bench:** playwright install --with-deps for OCR-bench chromium
* **bench:** raise image-task eval timeouts to fix Flickr30k nightly
* **bench:** respect similarity() inputs in ColBERT/ColPali wrappers
* **bench:** run olmocr.bench.tests off orchestrator's asyncio loop
* centralize worker_id subject normalization (M5)
* **chart:** pin image-prepull DaemonSets to GPU nodes
* **ci:** copy assets/ into the gateway Docker build (redoc bundle)
* **dashboard:** close remaining 'SIE Dashboard' leaks in &lt;title&gt; and og:image:alt
* **dashboard:** drop edge runtime on opengraph-image for OpenNext
* **dashboard:** exclude favicon + OG image from auth middleware
* **dashboard:** keep public surfaces vague about what's behind auth
* **dashboard:** pick healthy daily via coverage + health gates
* **dashboard:** require &gt;=50 pairs on main-run fallback in nightly picker
* **dashboard:** revert NextAuth function-form, use try/catch on Resource
* **dashboard:** short tab title for authed users, neutral for unauth
* **dev:** set explicit auth opt-in for local gateway launchers (post fail-closed)
* **gateway,sdk,server:** add wire-level validation, improve resource cleanup, and enhance observability across request lifecycle
* **gateway,sdk,server:** prevent metric cardinality DoS and fix non-idempotent retry logic
* **gateway,sdk,server:** strengthen validation and eliminate silent failures across request lifecycle
* **gateway,sdk,server:** validate numeric fields and improve error handling
* **gateway:** add NATS config trusted producers helm override
* **gateway:** document ModelCapabilities in OpenAPI + refresh on profile delta-update
* **gateway:** generation timeouts bypass legacy request-timeout ceiling (H7)
* **gateway:** scope LoRA adapter capabilities per profile (M10)
* **gateway:** strict allow-list + 400 contract on /v1/completions (H3)
* **gateway:** strict allow-list + 400 contract on /v1/responses (H2)
* **gateway:** tighten chat sampler/token-cap + tool-history validation (M1, M13)
* **gateway:** trust chart-rendered sie-config pod name for NATS deltas
* **generation:** cancel tombstone prevents first-chunk fallback double-execution (H9)
* **generation:** LoRA lora_path is a top-level /generate field, not a sampling param
* **generation:** tighten lossy tool-control flags (M14)
* **grammar:** resolve tokenizer adapter for Outlines processor factories and remove anchors from regex patterns
* **helm/sie-cluster:** guard validateTls probe against deployments with nil labels
* **helm/sie-cluster:** guard validateTls probe against nil deployment labels
* **helm/sie-cluster:** include cert-manager mode in presence-check gate
* **helm/sie-cluster:** label-based cert-manager detection + bidirectional runtime check
* **helm/sie-cluster:** make self-signed root-CA namespace configurable
* **helm/sie-cluster:** one-step bundled cert-manager install with self-signed TLS
* **helm/sie-cluster:** probe cert-manager controllers cluster-wide
* **helm/sie-cluster:** regenerate Chart.lock with synced digest
* **helm:** trim and drop empty entries in ingress.hosts
* **modal:** exclude cargo target/ from sandbox image mount
* **pytorch_embedding:** accept and forward revision kwarg
* **quality_eval:** tolerate stdout noise around eval JSON envelope
* **quality-eval:** handle paginated jobs API and align last two filters
* **release-docker,warm-cache:** address review findings
* **server,gateway:** add GPU-aware health probes to detect and recover from wedged CUDA contexts
* **server,gateway:** GPU-aware health probes to detect & recover from wedged CUDA contexts
* **sie_bench:** register OPENAI_SOURCE so --save-targets openai actually saves
* **sie_dashboard:** address compare-page review nits
* **sie_dashboard:** label compare log links with run IDs
* **sie_server:** base64-decode JSON image inputs
* **sie_server:** enforce media bytes contract at every consumer
* **sie_server:** install cv2 system libs for docling extract
* **streaming:** no-silent-drop on chunk-queue backpressure (H6)
* **terraform:** detect in-flight workflow runs via explicit status query
* **terraform:** drop redundant Project overrides in quality-eval-l4
* **terraform:** drop watchdog_idle_minutes default to 5, ignore PIP drift
* **terraform:** grant ec2:DescribeInstanceStatus to wake role
* **terraform:** per-runner idle-stop via GitHub Actions runners API
* **terraform:** require positive activity observation before idle-stop
* **terraform:** scope quality-eval IAM on Role tag instead of Project
* **terraform:** seed GPU node-group desired_size from min_size
* **terraform:** watchdog backstop covers queued-status runs
* **terraform:** watchdog grants HANG_MINUTES grace from LaunchTime
* **terraform:** watchdog ignores LastBusyAt older than LaunchTime
* **tester-cluster:** pin worker pool nodeSelectors to gpu-type as well

### Performance Improvements

* **dashboard:** warm one Lambda, bump timeout, narrow S3 verdict fetch

### Reverts

* **release-docker:** drop matrix consolidation, keep deps push retry

## v0.3.4 (2026-05-14)

### Highlights

- **New capabilities:** default payload store to model-cache bucket /payloads; typed InputTooLongError for extract 400 INPUT_TOO_LONG
- **Reliability and operations:** bump dev-g6-spot to g6.2xlarge; bump dev-g6-spot to g6.2xlarge so default worker pool fits; default workers to shared queue pool; pin opencv-python-headless to drop X11 runtime deps; tolerate config conflicts in bootstrap, gate on sie-config

### Features

* **infra:** default payload store to model-cache bucket /payloads
* **sdks:** typed InputTooLongError for extract 400 INPUT_TOO_LONG

### Bug Fixes

* **aws-example:** bump dev-g6-spot to g6.2xlarge
* **aws-example:** bump dev-g6-spot to g6.2xlarge so default worker pool fits
* **chart:** default workers to shared queue pool
* **cluster.py,aws.py:** address review suggestions 1 & 2
* **deps:** pin opencv-python-headless to drop X11 runtime deps
* **gateway:** tolerate config conflicts in bootstrap, gate on sie-config
* **gateway:** tolerate config conflicts in bootstrap, gate on sie-config ready
* **sdk:** widen sie-sdk requires-python to &gt;=3.12
* **terraform-aws:** default ECR creation off, prefix repo names with project_name
* **terraform-aws:** trim slashes from ecr_repository_prefix
* **terraform-google-sie:** wait for identity pool before binding WI
* **terraform:** relax required_version from ~&gt; 1.14.3 to &gt;= 1.14

## v0.3.3 (2026-05-13)

### Highlights

- **New capabilities:** add ColQwen3 + Nemotron ColEmbed v2 visual doc retrieval; add text classification task support; add post-download load timeout with stall-based download bounds; add scope-able workflow_dispatch with model/profile/task filters; add new INPUT_TOO_LONG ErrorCode; enforce overflow_policy in gliclass adapter
- **Reliability and operations:** surface empty matrix and add measurement-mode for unbaselined adapters; emit task_class in quality-adapter JSON output; score detection eval predictions from result["objects"]; annotate empty-diff path that bypasses impact_map; capture real exit code from impact_map in resolve-impact.sh

### Features

* **adapters:** add ColQwen3 + Nemotron ColEmbed v2 visual doc retrieval
* **extraction:** add text classification task support
* **model-loader:** add post-download load timeout with stall-based download bounds
* **quality-adapter:** add scope-able workflow_dispatch with model/profile/task filters
* **server:** add new INPUT_TOO_LONG ErrorCode
* **server:** enforce overflow_policy in gliclass adapter
* **server:** route INPUT_TOO_LONG to HTTP 400 in extract API
* **server:** validate overflow_policy in resolve_runtime_options

### Bug Fixes

* **bench:** emit task_class in quality-adapter JSON output
* **bench:** score detection eval predictions from result["objects"]
* **ci:** annotate empty-diff path that bypasses impact_map
* **ci:** capture real exit code from impact_map in resolve-impact.sh
* **ci:** collapse adapter-equivalent profiles in quality-adapter matrix
* **ci:** pin mise to 2026.5.5 in loadtest workflows
* **ci:** surface empty matrix and add measurement-mode for unbaselined adapters
* **docker:** install libspatialindex-c6 in worker images
* **gliclass:** catch IndexError empty-tensor crash as InputTooLongError
* **gliclass:** raise InputTooLongError from argmax-empty backstop
* **probe-chart:** make 3-OLD/2-NEW sample asymmetry explicit in title
* **quality-adapter:** namespace quality_eval tests to avoid conftest collision
* **quality-adapter:** split adapter_paths on commas before --changed-dirs
* **quality-adapter:** split Pair column so pair_key | stops breaking the table
* **quality-adapter:** split Pair column to stop pair_key | breaking the table

## v0.3.2 (2026-05-08)

### Highlights

- **New capabilities:** default score_pairs() in BaseAdapter + baseline reranking targets; bump cold-start schema to v6 with deserialize/warmup split; per-model perf concurrency defaults for OCR adapters; adapter-triggered quality eval on persistent L4 runner; make destroy conditional on workflow_dispatch input; nightly loadtest pipeline + baseline recorder
- **Reliability and operations:** raise loadtest job timeout to GH Actions ceiling (360 min / 6h); clarify experimental NATS health mode; lang tag on fenced block; fail fast on invalid scenarios; surface error/no_results rows in MD; harden parse_label against unexpected filenames; adapt OCR perf Item shape to model.inputs and fail loudly on errors

### Features

* **adapters:** default score_pairs() in BaseAdapter + baseline reranking targets
* **bench,charts:** bump cold-start schema to v6 with deserialize/warmup split
* **bench:** per-model perf concurrency defaults for OCR adapters
* **ci:** adapter-triggered quality eval on persistent L4 runner
* **ci:** make destroy conditional on workflow_dispatch input
* **ci:** nightly loadtest pipeline + baseline recorder
* **colbert:** add score_pairs support and expand model coverage
* **dashboard:** add status and kind filters to runs list
* **dashboard:** introduce run-group concept (run = 3 scenarios)
* **dashboard:** loadtest results dashboard (Next.js + SST + DynamoDB)
* **dashboard:** render every metric in the perf-lab archive
* **dashboard:** scaffold loadtest dashboard (Next.js + SST)
* **dashboard:** status and kind filters on runs list
* **dashboard:** track run_status; gh-API one-time backfill
* **docling:** add ocr profile defaulting do_ocr=true
* **gateway:** expose OpenAPI contract
* **gateway:** unify API errors and align probe contracts
* **helm:** expose probes value trees for worker/gateway/config
* **helm:** tighten startup/readiness probes for faster pod-ready
* **helm:** TLS termination via cert-manager + BYO matrix docs
* **helm:** wire probe templates to values trees
* **infra:** opt-in S3 cluster model cache
* **ltfr:** cache-vs-no-cache compare chart, with-cache run data, and 8 single-mode chart refresh
* **matrix:** add task_class stamping to eval measurements
* **server,bench:** split deserialize/warmup in cold-start instrumentation (v6)
* **server:** cap torch CPU threads at worker startup
* **sie_server:** per-stage timing markers in lifespan for engine_boot attribution
* **sie_server:** split adapter.warmup() out of load() with cold-start log markers
* **tools:** bump cold-start bench to v5 with scenario flag
* **tools:** LTFR per-scenario bench tooling + results (issue #652)
* **tools:** ltfr-bench orchestrator (issue #652)

### Bug Fixes

* **bench:** adapt OCR perf Item shape to model.inputs and fail loudly on errors
* **bench:** address CodeRabbit review on PR #779
* **bench:** correctly detect v6 split presence in flattened runs[]
* **bench:** derive emitted gpu_load_s from v6 deserialize+warmup split when available
* **chart:** pass --cluster-cache to sie-server and correct populate command in docs
* **charts:** vertical legend so 'image pull + container init' and 'node prov' aren't clipped
* **ci+terraform:** three deterministic root causes for loadtest pipeline
* **ci:** address CodeRabbit findings on quality-adapter PR
* **ci:** address CodeRabbit's second-pass review on quality-adapter
* **ci:** address CodeRabbit's third-pass review on quality-adapter
* **ci:** auto-clear stale terraform state lock from prior runner crashes
* **ci:** drop double cuda12 suffix + force codebuild for missing images
* **ci:** forensic dump on argo failure + LB-ENI release before destroy
* **ci:** gate stale-lock clear behind force_unlock input + pass --aws-region to destroy
* **ci:** override registry/gpu-selector/tolerations + Python heredoc
* **ci:** parse markdown bench output → result.json synthesis
* **ci:** pass WORKFLOW env to run_scenarios.sh in loadtest.yml
* **ci:** preflight env-var check in run_scenarios + finalize scripts
* **ci:** provision GH PAT secret + in-cluster github-token before bootstrap
* **ci:** raise loadtest job timeout to GH Actions ceiling (360 min / 6h)
* **ci:** read bench-config from local clone, not raw.githubusercontent.com
* **ci:** right-size bench pod + worker pod resources for cluster shape
* **cluster:** move orphan-LB sweep into cmd_destroy, drop parallel script
* **cluster:** use project_name (not example name) for orphan-LB VPC tag
* **cluster:** use project_name for orphan-LB VPC tag lookup
* **dashboard,ci:** keep run_status consistent between S3 and DynamoDB
* **dashboard,ci:** wire real Prometheus matrix shape + extend headlines
* **dashboard:** drop time-based legacy run grouping (was unsafe)
* **dashboard:** GPU util shown as 0-100 (was being multiplied by 100 again)
* **dashboard:** include duration_seconds in run-meta.json (was DynamoDB-only)
* **dashboard:** normalize array-shaped searchParams before .trim()
* **deps:** bump plotly to &gt;=6.1.1 for kaleido compat
* **docker:** stub bundles/ and models/ in deps stage
* **docling:** cache DocumentConverter per (device, ocr_enabled)
* **docling:** mark adapter unloaded in unload()
* **docling:** thread device through PdfPipelineOptions accelerator_options
* **gateway:** address latest coderabbit contract notes
* **gateway:** address PR review for NATS health mode
* **gateway:** address probe and SDK review findings
* **gateway:** align CreatePoolRequest OpenAPI with runtime validation
* **gateway:** clarify experimental NATS health mode
* **gateway:** close remaining review contract gaps
* **gateway:** preserve embeddings timing headers
* **gateway:** preserve scale-from-zero request path
* **gateway:** reject unsupported embeddings token arrays
* **helm:** validate ACME server and privateKeySecretRef in validateTls
* **infra:** grant kms:Decrypt to workers when model cache uses SSE-KMS
* **infra:** normalize whitespace-only model cache string inputs
* **infra:** treat empty model_cache_kms_key_id as unset
* **infra:** use flat lifecycle key for s3-bucket module v5
* **loadtest-ci:** force-delete orphan elbv2 LBs before terraform destroy
* **loadtest-ci:** force-delete orphan LBs and stop swallowing destroy failures
* **loadtest-ci:** poll workflow phase instead of argo submit --wait
* **loadtest-ci:** poll workflow phase instead of relying on argo --wait
* **ltfr-bench,notes:** lang tag on fenced block; fail fast on invalid scenarios; surface error/no_results rows in MD
* **ltfr-bench:** hoist imports to top; guard payload.results shape
* **ltfr-bench:** mark request_failed rows in scenario-a/b MD tables
* **ltfr-bench:** preserve failure context in aggregated rows; add request_failed status
* **ltfr-bench:** treat no_results cells as failures in exit code
* **ltfr-charts:** strip legend clip-path so labels render full width
* **ltfr:** tighten UID/timestamp guards in capture_image_pull_events
* **multi_pod_cold_start:** raise on ASG terminate fail; UID-filter pull events; isolate scenario-c pod
* **paddleocr_vl:** pass use_cache=True to generate
* **paddleocr_vl:** pass use_cache=True to generate to enable KV-cache
* **review:** tighten score_pairs options handling and query text validation
* **sie-server:** include model and bundle directories in wheel distribution
* **terraform:** detect HF-cache EBS by NVMe model + size, not by Linux name
* **terraform:** set resolve_conflicts_on_* = OVERWRITE on EKS addons
* **tools:** drop module-level docstrings (AGENTS.md rule)
* **tools:** guard fig_per_cell_table aggregation against empty results
* **tools:** guard mean() against empty engine_boot_s in aggregate()
* **tools:** harden parse_label against unexpected filenames
* **tools:** mark cold-start-bench.py executable (EXE001)
* **tools:** remove module docstring from cold_start_charts.py (repo rule)

## v0.3.1 (2026-04-29)

### Highlights

- **New capabilities:** add OmniDocBench OCR quality loader; support /v1/score with dense/sparse/colbert/hybrid modes; add Marqo/marqo-ecommerce-embeddings-B via open_clip backend
- **Reliability and operations:** add terminal failed state to model registry (sie-test#85)

### Features

* **bench:** add OmniDocBench OCR quality loader
* **bge-m3:** support /v1/score with dense/sparse/colbert/hybrid modes
* **siglip:** add Marqo/marqo-ecommerce-embeddings-B via open_clip backend

### Bug Fixes

* **server:** add terminal failed state to model registry (sie-test#85)

## v0.3.0 (2026-04-29)

### Highlights

- **Breaking change:** openapi.json is now a committed artifact that must be regenerated and committed when API changes are made
- **New capabilities:** add GLM-OCR adapter; add Qwen3-VL-Embedding-2B and Qwen3-VL-Reranker-2B multimodal adapters; add GLiNER2 and GLiNER-bi adapters; add Qwen3-Reranker-0.6B and 4B causal LM reranker support; add SigLIP 2 base-patch16-224 vision-language encoder; add minimal cache weights snapshot command for offline deployments
- **Reliability and operations:** retry only transient connection errors under wait_for_capacity; surface unrouteable models loudly and helm-repo-add on pristine hosts; emit identical NATS payload to bundle and _all subjects; surface mixed-profile unrouteable models and keep snapshot consistent on writes; add retry logic for deadsnakes PPA to handle Launchpad outages
- **Performance:** cache JPEG-encoded corpus images across queries; lazily JPEG-encode corpus images on first use; cache SDK version parse, integer audit latency, UUIDv7; cut hot-path allocations, fuse numpy decode, tighten backpressure

### ⚠ BREAKING CHANGES

* **openapi:** openapi.json is now a committed artifact that must be regenerated and committed when API changes are made

### Features

* **adapters:** add GLM-OCR adapter
* **adapters:** add Qwen3-VL-Embedding-2B and Qwen3-VL-Reranker-2B multimodal adapters
* add GLiNER2 and GLiNER-bi adapters
* add Qwen3-Reranker-0.6B and 4B causal LM reranker support
* add SigLIP 2 base-patch16-224 vision-language encoder
* **admin:** add minimal cache weights snapshot command for offline deployments
* **bench:** honor SIE_BENCH_SERVER_READY_TIMEOUT in eval orchestrator
* **ci:** nightly loadtest gate against dedicated EKS cluster
* **ci:** nightly loadtest gate, ephemeral cluster per run
* **extract:** add Docling adapter for PDF/DOCX/HTML extraction
* **extract:** add Docling adapter for PDF/DOCX/HTML parsing
* **extract:** plumb document items and structured `data` results
* **observability:** add Prometheus metrics to sie-config and expand sie-gateway coverage
* **observability:** Prometheus metrics for sie-config and sie-gateway
* **oom:** implement defensive exception fan-out and improve recovery metrics
* **oom:** improve error semantics and budget exhaustion detection
* **openapi:** add static spec export and validation
* **router:** import Rust gateway source tree
* **server:** add reactive OOM recovery and proactive idle eviction
* **social:** daily social content pipeline with 5-source drafts + engagement
* **types:** add `document` input modality across SDKs, server, and metadata

### Bug Fixes

* **adapters:** add input validation guards for empty/failed visual inputs
* **adapters:** address review findings for Qwen3-VL adapters
* **adapters:** clarify video placeholder, validate token IDs, fix torch_dtype key
* add client-side hour filter to search_x_posts (was date-level only)
* address CodeRabbit review feedback
* address follow-up PR review nits
* address remaining CodeRabbit feedback (round 2)
* address review findings — negative truncation guard, score() options, constant dedup
* **bench:** show correct unit labels for MP/s throughput in --print-gap
* **bundles:** declare Qwen3-VL adapters in default bundle
* **ci:** use Blacksmith runner in CI
* **client:** retry only transient connection errors under wait_for_capacity
* **cluster:** address PR #701 review comments
* **cluster:** correct kubectl flag combo and reorder LB sweep before helm uninstall
* **cluster:** helm uninstall before terraform destroy to clean up AWS LB leftovers
* **cluster:** unblock end-to-end `mise run cluster create --build`
* **config,cluster:** surface unrouteable models loudly and helm-repo-add on pristine hosts
* **config:** emit identical NATS payload to bundle and _all subjects
* **config:** surface mixed-profile unrouteable models and keep snapshot consistent on writes
* **docker:** add retry logic for deadsnakes PPA to handle Launchpad outages
* **docker:** propagate failure when all add-apt-repository retries exhausted
* **docling:** per-task converter, hf_revision guard, callable typing (CodeRabbit)
* **docs:** Update packages/sie_server/Dockerfile.cuda11
* fail closed on missing/unparsable timestamps in lookback filter
* **gateway,config,sdk:** resiliency, concurrency, and cross-service hash parity
* **gateway,config:** address PR review -- 404 for unknown models, 202 on default routing, full YAML propagation
* **gateway,config:** harden auth, trusted NATS producers, and recovery path; drop gateway HA default
* **gateway,sdk:** map upstream timeouts to 503+MODEL_LOADING for SDK retry
* **gateway:** add GET /v1/models/\{model\} detail route
* **gateway:** address PR #716 review feedback
* **gateway:** align /v1/models error and list shapes
* **gateway:** drop double-counted REQUEST_COUNT / REQUEST_LATENCY emit
* **gateway:** emit X-SIE-Error-Code header on model-loading 503
* **gateway:** keep record_request async to match main's call shape
* **gateway:** make sie-config single source of truth for bundles with live resync
* **gateway:** normalize model ids in NATS work subjects + docs/tooling/ha cleanup
* **gateway:** pre-instantiate request/demand metric families on startup
* **gateway:** prioritize epoch-rewind branch; harden no-thrash test; correct arch-guide on ephemeral restart
* guard score() and score_pairs() against empty input lists
* **helm:** default clusterRouting to "queue" on import-sie-router-rust
* **helm:** enable NATS + JetStream by default to match queue clusterRouting
* **helm:** fail fast when gateway has no bundle source
* **kind-smoke:** add --no-pool-isolation for static clusters + contract-drift fixes
* **kind-smoke:** address bot review feedback
* **kind-smoke:** enable configStore and harden config/gateway tests
* **kind-smoke:** enable JetStream on test NATS and drop duplicate subchart
* **kind-smoke:** wire sie-config image and helm overrides into kind cluster fixture
* **kind-smoke:** wire sie-config image into kind cluster fixture
* **observability:** address PR review blockers on metrics PR
* **sdk:** cluster cache prefix probe uses list, not head
* **sdk:** cluster cache prefix probe uses list, not head (Refs #732, #654)
* **sdk:** has_children filters folder-marker objects (Refs #732, #654)
* **sdk:** preserve caller-supplied document format over inferred (CodeRabbit)
* **sdk:** retry mid-flight transport disconnects, not just timeouts
* **sdk:** retry on connection errors and generic 503s
* **sie_config:** address PR review feedback
* **terraform/aws:** set 100GB root volume on cpu node group to avoid DiskPressure
* **terraform/gcp:** undo router→gateway rename on GCP Cloud Router + NAT
* **tests:** include sie-config in expected missing-image list
* **tests:** restore docker gateway smoke test after router rename
* **tmux-scripts:** improve robustness of session parsing and argument handling
* **types:** adapt to ty 0.0.32 stricter ignore handling
* use searchTerms for X tweet-scraper actor (was searchQueries)

### Performance Improvements

* **bench:** cache JPEG-encoded corpus images across queries
* **bench:** lazily JPEG-encode corpus images on first use
* **docker:** add --link + move ARG BUNDLE to eliminate cross-bundle layer noise
* **docker:** normalize mtimes so shared venv layer is dedupable
* **docker:** reorder stages for maximum BuildKit cache reuse
* **docker:** split worker venv into shared + bundle-specific layers
* **gateway:** cache SDK version parse, integer audit latency, UUIDv7
* **gateway:** cut hot-path allocations, fuse numpy decode, tighten backpressure
* **gateway:** fuse msgpack_numpy decode into the response path
* **gateway:** move score-endpoint unwrap instead of cloning
* **gateway:** pass msgpack items through as rmpv::Value
* **gateway:** publish work items concurrently + borrow shared fields
* **gateway:** tighten cold-pool backpressure + cheaper QPS counter
* **gateway:** trim per-request work on the inference hot path

## v0.2.0 (2026-04-17)

### Highlights

- **Breaking change:** Removed `--model` CLI args from worker startup; use `SIE_PRELOAD_MODELS` env var or `--preload` flag instead
- **New capabilities:** add ModernBERT flash dense embedding support with fallback mechanism; add OCR quality benchmarks (olmOCR-bench); add OCR quality benchmarks with olmOCR-bench; add pages/sec throughput metric for OCR perf eval; add perf metrics to OCR eval pipeline; also report query throughput in mpix/s for image queries
- **Reliability and operations:** add missing NATS Helm repo to release workflow; don't set NODE_AUTH_TOKEN for OIDC npm publishes; harden affinity spill with bounds check, clamp, and debug log; make rejected requests visible to KEDA scaling metrics; remove redundant tokenizer validation and unused template parameter

### ⚠ BREAKING CHANGES

* **workers:** Removed `--model` CLI args from worker startup; use `SIE_PRELOAD_MODELS` env var or `--preload` flag instead

### Features

* **adapters:** add ModernBERT flash dense embedding support with fallback mechanism
* **bench:** add OCR quality benchmarks (olmOCR-bench)
* **bench:** add OCR quality benchmarks with olmOCR-bench
* **bench:** add pages/sec throughput metric for OCR perf eval
* **bench:** add perf metrics to OCR eval pipeline
* **bench:** also report query throughput in mpix/s for image queries
* **benchmarks:** add MTEB NFCorpus evaluation results for ModernBERT-based embedders
* **bench:** report vision corpus throughput in mpix/s
* **bench:** report vision corpus throughput in mpix/s instead of items/s
* **deps:** migrate from pynvml to nvidia-ml-py package
* **haystack:** add haystack_integrations namespace aliases
* **haystack:** add namespace-convention aliases
* **observability:** add anonymous usage telemetry
* **sdk:** add max_concurrency param to SIEAsyncClient to prevent connection pool exhaustion
* **server:** add lightonai/LightOnOCR-2-1B OCR adapter with next bundle
* **workers:** implement model preloading at startup to reduce first-request latency

### Bug Fixes

* **adapters:** remove redundant tokenizer validation and unused template parameter
* address PR review — panel title, namespace variable
* **bench:** handle unloaded images in pixel count computation
* **bench:** use concurrent async requests for OCR perf eval
* **bench:** validate image entries before computing pixel counts
* **bench:** validate pixel counts before using them for image corpus throughput
* **build:** downgrade dockerfile syntax version to 1 for broader compatibility
* **ci:** add missing NATS Helm repo to release workflow
* **ci:** don't set NODE_AUTH_TOKEN for OIDC npm publishes
* **dashboard:** queue routing dashboard accuracy and usability
* **docs:** add update date to portfolio header
* **docs:** correct PR reference in reranker reclassification note
* **docs:** populate reranker data and simplify table header
* **docs:** update stale model counts after reranker reclassification
* **haystack:** rename namespace alias to sie
* install uv via curl instead of COPY --from ghcr.io
* preload smoke test checks model.loaded instead of nonexistent workers field
* **readme:** heading format
* **release:** add LanceDB integrations to release-please config
* **router:** add overflow spill to break model affinity deadlock
* **router:** harden affinity spill with bounds check, clamp, and debug log
* **router:** make rejected requests visible to KEDA scaling metrics
* **sie-bench:** account for in-flight drain in throughput calculation
* **sie-bench:** use union wall-clock for multiprocess throughput merge
* **tester-cluster:** patient KEDA scale-down for worker pools

## v0.1.10 (2026-04-09)

### Highlights

- **New capabilities:** add async, chunking, and streaming to Weaviate document enricher; improve DLQ routing and score response handling; implement Config Management API with NATS-based distribution and review fixes; add LanceDB integration (Python + TypeScript); queue routing dashboard + NATS exporter + router image tag; queue routing dashboard, NATS prom exporter, router image tag
- **Reliability and operations:** correct cluster routing condition, stream max_age units, and reconnect state ordering; add recreate strategy for router deployment when nats config restore is enabled; restore Chart.yaml deps from main, keep appVersion v-prefix; queue routing dashboard PromQL for NATS wait; configurable NATS fetch budget, Helm-wired queue params
- **Performance:** decouple scanner and SIE batch sizes in enrich_table; stream enrich_table batch-by-batch instead of full materialization; use Lance scanner for column projection in enrich_table; bypass FastAPI for hot proxy paths via raw ASGI middleware

### Features

* add async, chunking, and streaming to Weaviate document enricher
* **dlq,pull-loop:** improve DLQ routing and score response handling
* implement Config Management API with NATS-based distribution and review fixes
* **integrations:** add LanceDB integration (Python + TypeScript)
* **observability:** queue routing dashboard + NATS exporter + router image tag
* **observability:** queue routing dashboard, NATS prom exporter, router image tag
* **sdk:** add get_model() and configure LanceDB release workflows
* **terraform:** add AWS eval-eu EKS cluster with multi-GPU support
* **terraform:** add evaluation cluster setup for AWS with multi-GPU support and updated configurations
* **terraform:** add node labels, adjust pool sizes for tester cluster

### Bug Fixes

* **config,queue,nats:** correct cluster routing condition, stream max_age units, and reconnect state ordering
* handle BytesIO images in LlamaIndex and validate Weaviate classify config
* **helm:** add recreate strategy for router deployment when nats config restore is enabled
* **helm:** restore Chart.yaml deps from main, keep appVersion v-prefix
* **helm:** use generic release-please updater for appVersion
* **helm:** use generic updater for both Chart.yaml version fields
* **helm:** use l4-spot/rtx6000-spot naming convention for spot profiles
* **integrations:** address CodeRabbit review findings for LanceDB PR
* **observability:** queue routing dashboard PromQL for NATS wait
* **queue-routing:** configurable NATS fetch budget, Helm-wired queue params
* **queue-routing:** resolve bugs, add configurable NATS params, fix score wire format
* **queue-routing:** score response format and DLQ fallback routing key
* **release:** use NPM_TOKEN for initial sie-lancedb publish
* **router:** use "scores" key in queue-mode score responses
* **terraform:** add GPU subnet coverage validation
* **terraform:** relax AZ validation and clarify defaults
* **terraform:** review fixes for tester cluster infra
* **terraform:** switch tester-cluster to us-east-2
* **terraform:** Switch tester-cluster to us-east-2 and update deployment docs
* **terraform:** validate gpu_node_groups for duplicate and reserved names
* **test:** add buildx builder pause recovery and improve build error diagnostics
* update adapter tests and address code review feedback
* use OCI registry URI for helm chart in README

### Performance Improvements

* **lancedb:** decouple scanner and SIE batch sizes in enrich_table
* **lancedb:** stream enrich_table batch-by-batch instead of full materialization
* **lancedb:** use Lance scanner for column projection in enrich_table
* **router:** bypass FastAPI for hot proxy paths via raw ASGI middleware
* **router:** reduce thread pool pressure by inlining small deserialization
* **router:** remove msgpack_numpy global patch and BaseHTTPMiddleware
* **router:** replace stdlib json with orjson for 3-10x faster serialization
* **sdk+router:** lazy msgpack_numpy.patch and pure ASGI middleware

## v0.1.9 (2026-04-02)

### Highlights

- **Reliability and operations:** increase docker smoke test timeouts and add retry; include $platform in worker image tag format; revert pool names to machine profile names; remove --provenance flag (requires public repo)

### Bug Fixes

* **helm:** include $platform in worker image tag format
* **helm:** revert pool names to machine profile names
* increase docker smoke test timeouts and add retry
* remove --provenance flag (requires public repo)

## v0.1.8 (2026-04-01)

### Highlights

- **Reliability and operations:** add sie-qdrant and sie-weaviate to release-please config; point sync-terraform default repos to production; remove --provenance from npm publish for private repo; correct image.tag comment to reflect actual format; remove duplicate platform suffix from worker image tag

### Bug Fixes

* add sie-qdrant and sie-weaviate to release-please config
* **ci:** point sync-terraform default repos to production
* **ci:** remove --provenance from npm publish for private repo
* **helm:** correct image.tag comment to reflect actual format
* **helm:** remove duplicate platform suffix from worker image tag
* remove internal-only references from COMPATIBILITY.md

## v0.1.7 (2026-04-01)

### Highlights

- **New capabilities:** add profiling script for sparse encoding hot path; add GitHub Actions workflow to sync Terraform modules to registry repos; apply QoL improvements from PR #484 review comments; switch default GPU from g5 (A10G) to g6 (L4); add rerank/score support to TEI runner; implement configurable document length limits and custom prefix token registration
- **Reliability and operations:** restore triggering ref for source checkout; restore quality by enabling causal attention and QK-normalization; restore dev-l4-spot zones to us-central1 for GPU availability; check /metrics endpoint in test_prometheus_metrics_exist; add per-attempt timeout to lease renewal fetch
- **Performance:** optimize MoE expert dispatch with sorted-expert routing; batch MaxSim scoring across documents on GPU; batch sparse aggregation with segment_reduce and fuse relu; batch split_embeddings + validate ColBERT performance

### Features

* **adapters:** add profiling script for sparse encoding hot path
* add GitHub Actions workflow to sync Terraform modules to registry repos
* apply QoL improvements from PR #484 review comments
* **aws:** switch default GPU from g5 (A10G) to g6 (L4)
* **bench:** add rerank/score support to TEI runner
* **colbert:** implement configurable document length limits and custom prefix token registration
* **deploy:** move namespace, SA, and HF token secret management to Helm chart
* **deploy:** prepare Terraform modules for public registry publishing
* **deploy:** rewrite example module sources to registry references
* **deploy:** rewrite Helm and internal references for public release
* **deploy:** two-artifact model — GCP Terraform infra-only, batteries-included Helm chart
* **docker:** add --docker-platform flag to docker build task
* extend create_pool API/SDK with minimum_worker_count and bundle
* **helm:** add batteries-included sub-chart dependencies to sie-cluster
* **helm:** add image pre-pull DaemonSet for GPU worker pools
* **helm:** add step to build Helm chart dependencies in Kind smoke tests
* **helm:** default router to image-embedded model configs
* **helm:** enable image pre-pull DaemonSet by default
* **helm:** port health gates from Terraform to Helm post-install hooks
* **helm:** remove prometheus alias, bump to v0.2.0, standardize chart
* **infra:** add Modal GPU sandbox for remote benchmark execution
* **infra:** add rollout warning and explicit image_type for GCFS
* **infra:** enable GCFS image streaming on GPU node pools
* **infra:** set min_node_count=1 on L4 spot GPU node pools
* **integrations:** add Qdrant integration
* **integrations:** add Qdrant integration with native sparse vector support
* **integrations:** add Weaviate v4 integration with Go module spec
* multiprocess loadtest + SDK aiohttp migration
* **sdk:** add version negotiation headers between SDK and server
* **sdk:** default wait_for_capacity=True and timeout=900s
* **sdk:** version negotiation header (SDK ↔ server)
* **sie-bench:** add dataset/input_type fields for mTEB corpus inputs
* **sie-bench:** built-in multiprocess loadtest mode
* **skills:** add eval-model skill for HF model assessment
* **skills:** add eval-model skill for HF model integration assessment
* sync Terraform modules to registry repos
* **tei-runner:** add /embed_sparse support for sparse models
* **tei:** add /embed_sparse support and auto-detect pooling mode
* **terraform/aws:** restore cluster autoscaler helm release to infra module
* **terraform/aws:** strip k8s resources, restructure as infra module with examples
* **terraform:** add cluster name and artifact registry variables; update node pool configuration
* **terraform:** add EBS CSI driver, NVIDIA device plugin, default StorageClass
* **terraform:** strip gcp k8s/ layer; examples use infra-only module
* **tools:** add ColBERT query vs document profiling script
* **tools:** add dense P50 latency profiling script

### Bug Fixes

* **adapters:** sort IDF unique_ids to satisfy SparseVector contract
* add missing production example to tf validate; fix tempfile leak; remove module docstring
* address PR #478 review feedback
* address PR review — GPU alert formula, kubectl parsing, CI path filter
* address review feedback for npm publish
* address review findings - race prevention, cleanup, lighter checkout
* **alloy:** add stage.cri\{\} before stage.json to unwrap CRI log envelopes
* **alloy:** explicitly set configMap name and key for sub-chart wiring
* **alloy:** scope pod discovery to current node via field selector
* **bench:** complete g5 to g6 migration in AWS eval configs and GPU mapping
* **benchmarks:** use TEI /embed_all for ColBERT multi-vector models
* **bench:** skip loading candidates_model for single-model servers
* **chart:** update home URL and Helm install command in README
* CI compatibility and consistent env var usage
* CI compatibility for sync-terraform workflow
* **ci:** add contents: read permission to publish-pypi-oidc job
* **ci:** add helm repo add + dep build to kind-smoke workflow
* **ci:** g5 refactored to g6 already
* **cluster:** build concrete helm command in status from infra_outputs
* **cluster:** guard helm/kubectl post-create log when outputs are empty
* **deploy:** clean terraform init artifacts before push
* **deploy:** correct smoke test TypedDict access and helm dry-run args
* **deploy:** correct StatefulSet rollout semantics, PDB scope, and KEDA pause
* **deploy:** remove dangling kubernetes_namespace_v1.sie references from health_gates.tf
* **deploy:** restore triggering ref for source checkout
* **deploy:** update default destination repos for GCP and AWS modules
* **deploy:** use triggering ref for source checkout in sync-terraform
* disable LoRA adapter layers after loading to prevent quality corruption
* **docs:** clarify optional image push in AWS and GCP README files
* **docs:** update Helm chart path in AWS and GCP README files
* fix integration test
* **helm,hook:** deploy/helm/sie-cluster/templates/hooks/prometheus-ready-test.yaml
* **helm:** add before-hook-creation to Job delete policies; document count==0 expectation
* **helm:** address coderabbit findings on health gate hooks
* **helm:** address non-blocking review findings from PR #336
* **helm:** address review findings in batteries-included sub-chart PR
* **helm:** address reviewer suggestions for health gate hooks
* **helm:** address second-pass review findings
* **helm:** aggregate buckets by le in p95 latency alert
* **helm:** bump chart version to 0.1.1 (patch, not minor)
* **helm:** clarify prometheusAddress comment — ignored when sub-chart is installed
* **helm:** correct kube-prometheus-stack semver constraint and remove hardcoded grafana password
* **helm:** correct misleading validation comment in router-deployment.yaml
* **helm:** don't emit ScaledObject CRDs unless KEDA is confirmed present
* **helm:** downscope KEDA RBAC to Role/RoleBinding; remove runtime apk installs
* **helm:** fix loki service URL and extract alloy config to file
* **helm:** fix three blocking review issues in sub-chart dependencies
* **helm:** improve temporary values file handling in helm_template function
* **helm:** improve, simplify, and modularize sie-cluster chart
* **helm:** move 'app.kubernetes.io/part-of' label to selector labels for consistency
* **helm:** remove autoscaling.enabled from values-aws.yaml
* **helm:** render KEDA ScaledObjects via post-install hook to avoid CRD chicken-and-egg
* **helm:** replace hardcoded namespace in provisioning alert rules
* **helm:** require non-empty hfToken.value when hfToken.create is true
* **helm:** sub-chart naming, Loki compactor, event exporter ECR, Grafana folders
* **helm:** use autoscaling.prometheusAddress in prometheus hook; remove stub health_gates.tf
* **helm:** use full FQDN for Prometheus service in KEDA and health gates
* **helm:** use router.service.port in NOTES.txt instead of hardcoded 8080
* **infra:** update min_node_count default in top-level GCP module
* normalize SDK version warned-set key to major.minor
* pool error types, add pool/progress test coverage
* **profiling:** add flash variant registry, device validation, top-level import
* **profiling:** sync GPU before tensor timing, move script to tools/
* **profiling:** use in-place relu_ to match production code path
* **qwen3:** restore quality by enabling causal attention and QK-normalization
* **readme:** correct helm chart path
* **readme:** correct helm install command
* **release:** track all package versions via release-please extra-files
* **release:** track TS SDK version.ts via release-please
* replace corrupted bge-m3 NanoFiQA2018 target + set bfloat16 precision
* review items
* **router:** increase pool lease TTL to survive rolling upgrades
* **router:** resolve default pool GPU for scale-up when gpu/pool omitted
* **router:** use effective_pool instead of pool_name for default pool GPU extraction
* **sdk:** defer aiohttp session creation to fix "no running event loop" in SIEAsyncClient
* **sie_bench:** improve --print-gap report accuracy and readability
* **tei-runner:** validate /embed_all returns per-token embeddings
* **tei-runner:** validate output_type in TEIRunner init
* **terraform/aws:** add full -backend-config flags to production init command
* **terraform/aws:** add precondition asserting &gt;=2 GPU-capable AZs exist
* **terraform/aws:** address review findings post-restructure
* **terraform/aws:** correct helm chart path in dev-g5-spot example comment
* **terraform/aws:** filter VPC AZs to only zones offering the GPU instance type
* **terraform/aws:** fix invalid splat on instance type offerings locations
* **terraform/aws:** remove provider aws block from child module
* **terraform/aws:** use var.project_name in VPC subnet cluster tags
* **terraform:** add validation for GPU node pool zones to ensure they match the configured region
* **terraform:** restore dev-l4-spot zones to us-central1 for GPU availability
* **terraform:** update GPU instance type description for clarity and add dev-g6-spot example
* **terraform:** update stale k8s module references in comments
* **terraform:** upgrade AWS modules and fix deprecations
* **test:** check /metrics endpoint in test_prometheus_metrics_exist
* **test:** update EKS tests from g5 to g6 after GPU instance type change
* **ts-sdk:** add per-attempt timeout to lease renewal fetch
* use prepack instead of prepublishOnly
* validate minimum_worker_count input and soften docstrings

### Performance Improvements

* **adapter:** optimize MoE expert dispatch with sorted-expert routing
* **adapters:** batch MaxSim scoring across documents on GPU
* **adapters:** batch sparse aggregation with segment_reduce and fuse relu
* **adapters:** batch split_embeddings + validate ColBERT performance
* **adapters:** batch split_embeddings in ColBERT adapters
* **adapters:** eliminate GPU overhead from IDF query encode path
* **florence2:** greedy decoding for OCR (-23% P50)
* **florence2:** switch OCR configs from beam search to greedy decoding
* **server,bench:** add batch coalescing, query warmup, and benchmark stability improvements
* **server:** dispatch immediately when worker is idle
* **server:** optimize BertFlashAdapter inference path (+35% corpus throughput)
* **server:** reduce batch wait timeout 10ms \u2192 2ms for lower Doc P50

## v0.1.6 (2026-03-12)

### Highlights

- **Breaking change:** remove `florence2` and `gliner` standalone bundles — extraction adapters (gliner, glirel, gliclass) are now included in the `default` bundle
- **New capabilities:** add native MTEB reranking task support with MRR metric; encode-dense matrix eval — 3 models × 8 tasks; add date-prefixed versioning for chronological filename ordering; reorganize and expand model size lookup table with alphabetical ordering; add detailed perf metrics, metric filter, and threshold selector; add marimo benchmark dashboard notebook
- **Reliability and operations:** restore /var/cache/apt mounts, keep /var/lib/apt removed; add HF_TOKEN auth and config kwargs and fix stella models; add dense projection support to Qwen2FlashAdapter; apply query_template from runtime options in SentenceTransformerDenseAdapter; replace pip install with uv add in docker error messages
- **Performance:** vectorize GTE sparse encode path; vectorize tokenization and packing for gte-multilingual-base; vectorize tokenization and packing to reduce throughput gap; switch Qwen/GTE models to flash attention adapter

### Breaking Changes

* **bundles:** remove `florence2` and `gliner` standalone bundles — extraction adapters (gliner, glirel, gliclass) are now included in the `default` bundle

### Features

* **bench:** add native MTEB reranking task support with MRR metric
* **bench:** encode-dense matrix eval — 3 models × 8 tasks
* **benchmarks:** add date-prefixed versioning for chronological filename ordering
* **benchmarks:** reorganize and expand model size lookup table with alphabetical ordering
* **benchview:** add detailed perf metrics, metric filter, and threshold selector
* **benchview:** add marimo benchmark dashboard notebook
* **benchview:** add perf metric selector to Model Size tab
* **benchview:** detailed perf metrics, metric filter, threshold selector
* **router,bench,sdk:** improve throughput with inflight tracking, batching, and connection pooling
* **server:** typed request parsing with msgspec
* **sie_server:** add gliner, glirel, and gliclass extraction dependencies to the default bundle

### Bug Fixes

* **adapter:** add dense projection support to Qwen2FlashAdapter
* **adapter:** apply query_template from runtime options in SentenceTransformerDenseAdapter
* apply CodeRabbit auto-fixes
* **bench:** replace pip install with uv add in docker error messages
* **benchview:** add missing statistics import and use _median helper
* **bundles:** include sglang bundle in default cluster and eval-matrix configs
* **client:** update websocket header parameter name from extra_headers to additional_headers
* **colbert:** enable native mode fallback for non-CUDA devices and add Matryoshka truncation
* **deps:** cap timm upper bound and fix lazy handler init
* **docker:** clear stale apt lists before update to prevent 404s
* **docker:** remove all apt cache mounts from Dockerfiles
* **docker:** remove no-op /var/lib/apt cache mount from apt RUN blocks
* **docker:** restore /var/cache/apt mounts, keep /var/lib/apt removed
* **helm:** increase CPU worker pool memory limits for expanded default bundle
* **model:** add missing query_template to stella_en_400M_v5
* **models:** switch all-MiniLM-L6-v2 to SentenceTransformerDenseAdapter
* **multilingual-e5-large-instruct:** use instruct query template, NFCorpus 0.3521 → 0.3567
* replace invalid HTML entities in SVG with XML numeric entities
* **rope_flash:** clear cached _rope_dummy on unload and use torch.cat for packing
* **router:** also resolve pool-derived GPU names to spot variants
* **router:** resolve bare GPU types to spot variants for KEDA scaling
* **sdk:** resolve sync/async client inconsistencies in score() and encode()
* **server:** centralize request validation to prevent 500s from malformed items
* **server:** use BertFlashAdapter for e5-small-v2, resolve e5 perf anomalies
* **server:** use BertFlashAdapter for intfloat/e5-small-v2 and remove stale benchmarks
* set compute_precision to bfloat16 for stella_en_1.5B_v5
* **sie_server:** add HF_TOKEN auth and config kwargs and fix stella models
* **sie_server:** resolve BGE-M3 linear weights loading for HF model IDs and fix test fixtures
* **sie_server:** support NV-Embed-v2 with PyTorch embedding adapter
* **splade:** align special token filtering and guard empty batches
* **test:** always rebuild Docker images to pick up code changes
* **typecheck:** move ty type checker from mise tool to uv dependency

### Performance Improvements

* **adapters:** vectorize GTE sparse encode path
* **rope_flash:** vectorize tokenization and packing for gte-multilingual-base
* **rope_flash:** vectorize tokenization and packing to reduce throughput gap
* **server:** switch Qwen/GTE models to flash attention adapter
* **splade:** vectorize tokenization and sparse aggregation (1.5x throughput)
* **splade:** vectorize tokenization and sparse aggregation in SPLADEFlashAdapter

## v0.1.5 (2026-02-27)

### Highlights

- **New capabilities:** add GLiNER v2.5 model configs; stream request bodies through proxy instead of buffering; add classification model configs for GLiClass-large and cross-encoder NLI
- **Reliability and operations:** release pipeline cache collision and smoke test timeout; strip content-length header from streamed proxy responses
- **Performance:** stream response body to eliminate bytes.join bottleneck

### Features

* **models:** add GLiNER v2.5 model configs
* **router:** stream request bodies through proxy instead of buffering
* **sie_server:** add classification model configs for GLiClass-large and cross-encoder NLI

### Bug Fixes

* release pipeline cache collision and smoke test timeout
* **router:** strip content-length header from streamed proxy responses

### Performance Improvements

* **router:** stream response body to eliminate bytes.join bottleneck

## v0.1.4 (2026-02-27)

### Highlights

- **Reliability and operations:** revert sharing=locked, add cache-read-only for build step

### Bug Fixes

* revert sharing=locked, add cache-read-only for build step

## v0.1.3 (2026-02-26)

### Highlights

- **Reliability and operations:** revert token lifetime extension, re-auth before push instead; revert token lifetime, re-auth before push

### Bug Fixes

* revert token lifetime extension, re-auth before push instead
* revert token lifetime, re-auth before push

## v0.1.2 (2026-02-26)

### Highlights

- **Reliability and operations:** release image builds failing from GCP token expiry

### Bug Fixes

* release image builds failing from GCP token expiry

## v0.1.1 (2026-02-26)

### Highlights

- **Reliability and operations:** update bundle definitions to replace legacy and gte-qwen2 with gliner

### Bug Fixes

* **bundles:** update bundle definitions to replace legacy and gte-qwen2 with gliner

## v0.1.0 (2026-02-26)

### Highlights

- **Breaking change:** HTTP 409 dependency conflict responses are removed from all API endpoints; the DEPENDENCY_CONFLICT error code no longer exists; .beads/ issue tracking data removed from repository
- **New capabilities:** add X-SIE-Worker response header for per-worker metrics tracking; add encode-image-text measurements to benchmarks dir; add encode-multivector perf measurements; add encode-multivector performance measurements; add encode-visual-document perf measurements; add encode-visual-document performance measurements
- **Reliability and operations:** increase helm install timeout from 10m to 15m; add trailing empty line to gitignore; align release-images workflow with docker task flags; register GLiClass and DeBERTa models in bundles; build and deploy gliner bundle in Kind smoke tests
- **Performance:** add connection pooling load test results (Feb 24); pool httpx client and add X-SIE-Worker header in router proxy; pool httpx client in router proxy to eliminate per-request TCP overhead; move transformers imports to module level

### ⚠ BREAKING CHANGES

* **deps:** HTTP 409 dependency conflict responses are removed from all API endpoints; the DEPENDENCY_CONFLICT error code no longer exists
* .beads/ issue tracking data removed from repository
* **deps:** model config files no longer support the `dependencies` field

### Features

* add X-SIE-Worker response header for per-worker metrics tracking
* **benchmarks:** add encode-image-text measurements to benchmarks dir
* **benchmarks:** add encode-multivector perf measurements
* **benchmarks:** add encode-multivector performance measurements
* **benchmarks:** add encode-visual-document perf measurements
* **benchmarks:** add encode-visual-document performance measurements
* **benchmarks:** add extract-detection L4-SPOT performance measurements
* **benchmarks:** add extract-kie-docvqa measurements to benchmarks dir
* **benchmarks:** add extract-relation L4-SPOT performance measurement
* **benchmarks:** add score-colbert perf measurements
* **benchmarks:** add score-colbert performance measurements
* **models:** add encode-image-text measurements
* **models:** add extract-detection measurements
* **models:** add extract-kie-docvqa measurements
* **models:** add extract-relation measurements
* **router:** add structured audit logging for API requests

### Bug Fixes

* **.claude:** add trailing empty line to gitignore
* align release-images workflow with docker task flags
* **bundles:** register GLiClass and DeBERTa models in bundles
* **ci:** build and deploy gliner bundle in Kind smoke tests
* **colbert:** remove CUDA requirement and improve device compatibility
* **eval:** read 'sie_id' instead of 'name' from model configs in runner
* **extract:** use dict access for Entity TypedDict in sort
* **gliner:** relax stale transformers&lt;4.52 pin
* increase helm install timeout from 10m to 15m
* reduce cpu-gliner resource requests for Kind CI
* **router:** read 'sie_id' instead of 'name' from model configs
* **server:** migrate NLI adapter to classifications and improve API consistency
* **server:** migrate nli_classification adapter and improve type annotations
* **server:** populate classifications instead of entities in GLiClass adapter
* use manifest mode for release-please and reset to v0.0.0
* use nested .gitignore for .claude/ directory

### Performance Improvements

* add connection pooling load test results (Feb 24)
* pool httpx client and add X-SIE-Worker header in router proxy
* pool httpx client in router proxy to eliminate per-request TCP overhead
* **pytorch-embedding:** move transformers imports to module level
* **server:** use uvloop as default event loop for uvicorn

### Reverts

* keep CONTRIBUTING.md clone URLs pointing to sie.git

### Miscellaneous Chores

* remove beads, agent prompts, mypy refs; consolidate ty config

### Code Refactoring

* **deps:** move adapter dependencies from per-adapter pyproject.toml to bundle YAML
* **deps:** remove model-level dependencies feature