Release Notes
Latest version: v0.6.6 (2026-06-14).
v0.6.6 (2026-06-14)
Section titled “v0.6.6 (2026-06-14)”Highlights
Section titled “Highlights”- Reliability and operations: align pool-scoped bundle hashes; avoid sticky missing bundle hashes; clarify missing profile inheritance; fail closed on missing bundle metadata; stabilize keda all-marker e2e
Bug Fixes
Section titled “Bug Fixes”- config: align pool-scoped bundle hashes
- config: avoid sticky missing bundle hashes
- config: clarify missing profile inheritance
- config: fail closed on missing bundle metadata
- tilt: stabilize keda all-marker e2e
v0.6.5 (2026-06-13)
Section titled “v0.6.5 (2026-06-13)”Highlights
Section titled “Highlights”- New capabilities: demand-side token-reduction benchmark (Req 12, #1311); add describe_image tool (caption + zero-shot tags); add describe_image tool (caption + zero-shot tags) — Req 12 #1310; cap describe_image payload size before cluster calls; claude.ai connector surface — OAuth bridge + skill ZIP (Req 12 #1312); sie_mcp edge with docs_to_markdown tool (Req 12 #1306)
- Reliability and operations: harden replace snapshot IPC; return retryable OpenAI provisioning errors; bind OAuth authorization codes to client_id; doctor classifies probe read-timeouts as cold, not unreachable; align GPU memory pressure defaults
- Performance: skip tag embedding when top_k <= 0
Features
Section titled “Features”- bench: demand-side token-reduction benchmark (Req 12, #1311)
- mcp: add describe_image tool (caption + zero-shot tags)
- mcp: add describe_image tool (caption + zero-shot tags) — Req 12 #1310
- mcp: cap describe_image payload size before cluster calls
- mcp: claude.ai connector surface — OAuth bridge + skill ZIP (Req 12 #1312)
- mcp: sie_mcp edge with docs_to_markdown tool (Req 12 #1306)
- mcp: structured extraction + structured generation tools (Req 12 #1308)
- mcp: wire measured token-reduction figures into savings metadata
- tools: add sie doctor — per-capability cluster diagnostics
- tools: Florence-2 fallback for image OCR
- tools: sie_tools — Claude Code context-offload client for managed clusters
Bug Fixes
Section titled “Bug Fixes”- align GPU memory pressure defaults
- config: detect bundle config hash drift
- config: fingerprint model pool ownership
- config: harden replace snapshot IPC
- config: replace drifted export snapshots
- deps: bump sidecar prometheus for protobuf advisory
- gateway: address provisioning review feedback
- gateway: align provisioning contract docs
- gateway: decode native media JSON bytes
- gateway: dereference structured output schema refs
- gateway: make provisioning non-2xx universally
- gateway: preserve ref sibling schema semantics
- gateway: return retryable OpenAI provisioning errors
- mcp: address review feedback on structured tools
- mcp: bind OAuth authorization codes to client_id
- mcp: blank-env fallback for model ids; honor SIE_MCP_IMAGE_TOP_K=0
- mcp: deep-copy committed token-reduction figures in build_metadata
- mcp: validate embedding shapes in _top_k_tags
- sdk: normalize score image payloads for wire transport
- sdk: normalize score images for wire transport
- server: guard readiness for removed configs
- server: honor pool-aware model configs
- server: render qwen3 vl reranker document images in user prompt
- server: render Qwen3-VL reranker document images in user prompt
- sie-cluster: add spot toleration to AKS worker pool
- tools: address doctor review feedback
- tools: doctor classifies probe read-timeouts as cold, not unreachable
- worker: keep SGLang loads off event loop
Performance Improvements
Section titled “Performance Improvements”- mcp: skip tag embedding when top_k <= 0
v0.6.4 (2026-06-11)
Section titled “v0.6.4 (2026-06-11)”Highlights
Section titled “Highlights”- New capabilities: add
grant_admin_to_creator(opt-in AAD-RBAC for caller); lock model-cache storage account to cluster VNet by default; install kubelogin and convert kubeconfig after AKS get-credentials; wire Azure provider tooling; add azure (AKS) terraform module; ship values-aks.yaml AKS overlay with the Azure module - Reliability and operations: harden storage_allowed_ip_ranges CIDR validation; harden release guarded merge checks; resubscribe stale NATS health stream; emit
az aks get-credentials --overwrite-existing; drop unreachable final_registry guard so ACR path can fire
Features
Section titled “Features”- azure-terraform: add
grant_admin_to_creator(opt-in AAD-RBAC for caller) - azure-terraform: lock model-cache storage account to cluster VNet by default
- cluster: install kubelogin and convert kubeconfig after AKS get-credentials
- cluster: wire Azure provider tooling
- deploy: add azure (AKS) terraform module
- helm: ship values-aks.yaml AKS overlay with the Azure module
Bug Fixes
Section titled “Bug Fixes”- azure-terraform: emit
az aks get-credentials --overwrite-existing - azure-terraform: harden storage_allowed_ip_ranges CIDR validation
- ci: harden release guarded merge checks
- cluster: address review feedback on Azure provider wiring
- cluster: drop unreachable final_registry guard so ACR path can fire
- cluster: set TF_VAR_* on Azure destroy path (same as create)
- deploy: revert system pool default to Standard_D4s_v3 (zoned everywhere)
- gateway: resubscribe stale NATS health stream
- sidecar: preserve msgpack work item payloads
v0.6.3 (2026-06-10)
Section titled “v0.6.3 (2026-06-10)”Highlights
Section titled “Highlights”- New capabilities: add azure blob payload store support; add server-side copy fast path for cloud weight sync; informational generation eval CI gate over committed floors; add vision (image) input to generate(); preserve text/image content-part ordering; vision (image) input for generate()
- Reliability and operations: harden cloud cache sync paths; clear HIGH Dependabot alerts (docling, rustls-webpki); ensure cloud weight sync creates local parents; evict stale gateway workers on shutdown; fall back to relay on S3/GCS server-side copy failure
- Performance: engage conformant image preprocessing for v1; engage conformant image preprocessing for v1 (1.8x)
Features
Section titled “Features”- add azure blob payload store support
- add server-side copy fast path for cloud weight sync
- bench: informational generation eval CI gate over committed floors
- generate: add vision (image) input to generate()
- generate: preserve text/image content-part ordering
- generate: vision (image) input for generate()
- support azure blob cluster cache
- tester-cluster: rtx6000 g7e.4xlarge + sglang preload + hf-token wiring
Bug Fixes
Section titled “Bug Fixes”- address azure cache review feedback
- address final cloud storage review issues
- bench: harden generation eval gate per review
- deps: clear HIGH Dependabot alerts (docling, rustls-webpki)
- ensure cloud weight sync creates local parents
- evict stale gateway workers on shutdown
- fall back to relay on S3/GCS server-side copy failure
- generate: address CodeRabbit review on vision input
- generate: address huronat review on vision input (F2-F8)
- generate: image-free content_parts field must not shadow layout
- generate: reject both-present image-bearing content layouts
- harden cloud cache sync paths
- loadtest-ci: self-heal orphaned cluster + stale lock in preflight
- nemo_colembed: trim left-padding rows from v1 conformant doc embeddings
- normalize local weight sync destination
- quality-adapter: gate v1 Vidore3 on English; finalize ?lang= plumbing
- skill: add bash language tag to hfCache —set fenced block (MD040)
- skill: move inline comments off shell continuation lines so the helm snippet pastes cleanly
- support cloud source weight sync
- tester-cluster: update rtx6000-spot machineType doc to g7e.4xlarge to match terraform
Performance Improvements
Section titled “Performance Improvements”- nemo_colembed: engage conformant image preprocessing for v1
- nemo_colembed: engage conformant image preprocessing for v1 (1.8x)
v0.6.2 (2026-06-08)
Section titled “v0.6.2 (2026-06-08)”Highlights
Section titled “Highlights”- New capabilities: defer sie-config NATS startup and honor log levels; refresh KEDA Tilt local dev branch; M4 dense encoders — mxbai-embed-large-v1, arctic-embed-l-v2.0, modernbert-embed-base; add daily guarded stable releases
- Reliability and operations: accept dense dim in qwen3 vl embedding adapter; preserve model query templates in mteb eval; scale single-profile bundles on gpu-agnostic demand; consolidate runtime ninja install; install ninja in cuda runtime
Features
Section titled “Features”- defer sie-config NATS startup and honor log levels
- dev: refresh KEDA Tilt local dev branch
- models: M4 dense encoders — mxbai-embed-large-v1, arctic-embed-l-v2.0, modernbert-embed-base
- release: add daily guarded stable releases
Bug Fixes
Section titled “Bug Fixes”- accept dense dim in qwen3 vl embedding adapter
- bench: preserve model query templates in mteb eval
- dev: address KEDA Tilt PR review
- helm: scale single-profile bundles on gpu-agnostic demand
- server: consolidate runtime ninja install
- server: install ninja in cuda runtime
- server: install ninja in CUDA SGLang runtime
- terraform: deny non-HTTPS access on state and quality-eval S3 buckets
v0.6.1 (2026-06-07)
Section titled “v0.6.1 (2026-06-07)”Highlights
Section titled “Highlights”- New capabilities: configure GPU disk sizing and generate smoke; support static queue pools
- Reliability and operations: fail fast on invalid static pool config; pin kind smoke workers to default queue pool; canonicalize static queue pool names; stabilize GPU disk Terraform test
Features
Section titled “Features”- configure GPU disk sizing and generate smoke
- gateway: support static queue pools
Bug Fixes
Section titled “Bug Fixes”- address GPU disk review comments
- ci: pin kind smoke workers to default queue pool
- gateway: canonicalize static queue pool names
- gateway: fail fast on invalid static pool config
- stabilize GPU disk Terraform test
v0.6.0 (2026-06-07)
Section titled “v0.6.0 (2026-06-07)”Highlights
Section titled “Highlights”- Breaking change: Queue work subjects and pool streams use the new sie.work.{pool}.{machine_profile}.{bundle}.{model} shape only; legacy subject filters are intentionally not preserved.; workers will subscribe to
sie.work.*.<poolName>instead ofsie.work.*.default. Deployed alone (without the matching gateway/sidecar update that publishes/filters on the new subject) this will break routing on every cluster. To preserve the old shared-queue behavior, setworkers.common.queuePool: "default"explicitly. - New capabilities: route work by queue pool lanes; default SIE_POOL to pool name (not “default”)
- Reliability and operations: harden queue lane admission; bump vitest 2.1.9 -> 4.1.0 (CVE-2026-47429); align lane defaults and tilt e2e; preserve worker-group queue defaults
⚠ BREAKING CHANGES
Section titled “⚠ BREAKING CHANGES”- gateway: Queue work subjects and pool streams use the new sie.work.{pool}.{machine_profile}.{bundle}.{model} shape only; legacy subject filters are intentionally not preserved.
- helm: workers will subscribe to
sie.work.*.<poolName>instead ofsie.work.*.default. Deployed alone (without the matching gateway/sidecar update that publishes/filters on the new subject) this will break routing on every cluster. To preserve the old shared-queue behavior, setworkers.common.queuePool: "default"explicitly.
Features
Section titled “Features”- gateway: route work by queue pool lanes
- helm: default SIE_POOL to pool name (not “default”)
Bug Fixes
Section titled “Bug Fixes”- deps: bump vitest 2.1.9 -> 4.1.0 (CVE-2026-47429)
- gateway: harden queue lane admission
- helm: align lane defaults and tilt e2e
- helm: preserve worker-group queue defaults
v0.5.0 (2026-06-04)
Section titled “v0.5.0 (2026-06-04)”Highlights
Section titled “Highlights”- Breaking change:
workers.pools.<name>.bundle(string),workers.pools.<name>.minReplicas,workers.pools.<name>.maxReplicas,workers.pools.<name>.extraEnv, andworkers.pools.<name>.imageBundleare replaced byworkers.pools.<name>.bundles.<bundle>.{minReplicas, maxReplicas, extraEnv, imageBundle, enabled}.workers.common.bundleis removed (no longer consumed). StatefulSet, ScaledObject, PDB, and image-prepull DaemonSet names change fromworker-{pool}toworker-{pool}-{bundle}, so in-place upgrades require deleting the old resources first. - New capabilities: agent-jobs text-gen readiness — code/SQL/tools/guard evals + Qwen3.6-27B + precision routing; transfer sie-cluster claude skill; P(unsafe) logprob threshold for CHECK POLICY precision; split worker pools into pool × bundles schema; surface code/sql/guard capabilities; resolve job aliases in configs/resolve; add sglang worker pool for generative models
- Reliability and operations: expose unauthenticated metrics scrape port; expose unauthenticated metrics scrape port safely for prom; preserve gateway metrics scrape labels; drop unsupported ebnf advertisement + restore guardian a100 guard threshold; fail-fast on missing Spider DBs + order-sensitive SQL exec accuracy
⚠ BREAKING CHANGES
Section titled “⚠ BREAKING CHANGES”- helm:
workers.pools.<name>.bundle(string),workers.pools.<name>.minReplicas,workers.pools.<name>.maxReplicas,workers.pools.<name>.extraEnv, andworkers.pools.<name>.imageBundleare replaced byworkers.pools.<name>.bundles.<bundle>.{minReplicas, maxReplicas, extraEnv, imageBundle, enabled}.workers.common.bundleis removed (no longer consumed). StatefulSet, ScaledObject, PDB, and image-prepull DaemonSet names change fromworker-{pool}toworker-{pool}-{bundle}, so in-place upgrades require deleting the old resources first.
Features
Section titled “Features”- agent-jobs text-gen readiness — code/SQL/tools/guard evals + Qwen3.6-27B + precision routing
- agents: transfer sie-cluster claude skill
- guard: P(unsafe) logprob threshold for CHECK POLICY precision
- helm: split worker pools into pool × bundles schema
- models: surface code/sql/guard capabilities; resolve job aliases in configs/resolve
- tester-cluster: add sglang worker pool for generative models
Bug Fixes
Section titled “Bug Fixes”- agents: address sie cluster review comments
- bench: fail-fast on missing Spider DBs + order-sensitive SQL exec accuracy
- gateway: expose unauthenticated metrics scrape port
- gateway: expose unauthenticated metrics scrape port safely for prom
- guard: reject multi-candidate sampling + keep logprobs consistent on rewrite
- guard: robust verdict thresholding, logprob hygiene, decoded-token logprobs
- helm: fail-fast on missing/invalid bundle replica bounds
- helm: preserve gateway metrics scrape labels
- helm: use sidecar binary for image pre-pull
- models: drop unsupported ebnf advertisement + restore guardian a100 guard threshold
- sie_server: honor params.instruction in Florence-2 extract
- tester-cluster: cap rtx6000 default bundle to avoid over-subscription
- tools: via-SIE EBNF response_format shape + request/preload model split
v0.4.2 (2026-06-03)
Section titled “v0.4.2 (2026-06-03)”Highlights
Section titled “Highlights”- New capabilities: 5-domain generation bench + via-sie quality matrix + gateway schema gaps; add e0-02 all-minilm time-share experiment; land coalesce_ms=5 + max_batch_requests=12 as Rust defaults; add —via-sie smoke path (route through sie_server); add min_tokens + system_prompt + temperature for G4 retry; close Qwen3.6-27B gap — min_tokens=10 + max=768 + ctx=4096
- Reliability and operations: set verbose=True on SIEServer so launch errors surface; document worker-sidecar metrics wiring; gate sidecar nats reconnect refresh; harden sidecar config recovery; budget loadtest barrier timeouts
- Performance: anchor min_batch_cost floor at max_batch_tokens // 4; tighten adaptive wait ceiling + revert gte-multilingual 32k; rebind vision Conv3d patch-embed to F.linear; raise max_batch_tokens 16k → 32k to stop IPC-batch shred
Features
Section titled “Features”- 5-domain generation bench + via-sie quality matrix + gateway schema gaps
- add e0-02 all-minilm time-share experiment
- batch_config: land coalesce_ms=5 + max_batch_requests=12 as Rust defaults
- bench-27b: add —via-sie smoke path (route through sie_server)
- bench-27b: add min_tokens + system_prompt + temperature for G4 retry
- bench-27b: close Qwen3.6-27B gap — min_tokens=10 + max=768 + ctx=4096
- bench-27b: launch full SIE stack (NATS+worker+gateway) for —via-sie
- bench+model: via-sie 4-task n=300 sweep + NEXTN smaller-draft on 27B
- bench: 0.6B via-sie validated; harness + 27B config gains
- bench: 5-shot CoT for CaseHOLD (item 5 — close 27B target gap)
- bench: fix Qwen3-0.6B GPQA (parrot bug) + 27B diagnostics; final matrix
- bench: improve perf eval output handling
- docling: accept image input + run on OCR-bench quality path
- gateway+worker: chat surface accepts min_tokens + chat_template_kwargs
- gateway: strengthen generation isolation guardrails
- latency: tighten FetchExpiryController defaults to 2/15/50
- model+bench: RTX-PRO-6000 FP8 profile for Qwen3.6-27B + 6000 validation
- model: bump Qwen3-0.6B serving context 1024→4096 for prod simple-task use
- models: add Marqo/marqo-fashionSigLIP (SigLIP open_clip, fashion image-text)
- ocr: docling accepts images + quality eval prefers documents
- reconcile live worker config in sidecar
- RTX PRO 6000 FP8 profile for Qwen3.6-27B + SIE-on-6000 generative benchmark matrix
- scheduler: load-aware pipeline_depth autotune (S14 follow-up)
- scheduler: production-parity defaults + serial pipeline (carveout p99 fix)
- scheduler: restore SIE_RUST_PIPELINE_DEPTH=2 default (deep-saturation fix)
- scheduler: SIE_PULL_QUANTUM_INCLUDE_QUEUE_MS for Py-main parity
- scheduler: SIE_RUST_WAVE_CADENCE env toggle (default on)
- scheduler: step adaptive controller once per wave (Python parity)
- sidecar: add worker config and pool admission reconciliation
- sidecar: wire generation direct dispatch
- sie_server: add MinerU2.5-Pro-2604-1.2B doc OCR adapter
- sie_server: carve out QueueExecutor + IPC types for Rust worker POC
- sie_server: integrate MinerU2.5-Pro-2604-1.2B doc OCR adapter
- sie_server: UDS msgpack IPC server for Rust worker sidecar
- sie_worker_rust: close parity gaps with Python pull loop + smoke test
- sie_worker_rust: scaffold Rust worker sidecar crate (Phase 1c)
- sie_worker_rust: wire end-to-end NATS -> IPC -> publish loop (Phase 1d)
- sie-bench: synchronize loadtest measurement start
- worker/rust: IPC connection pool — lift the sidecar’s last serialization bottleneck
- worker/rust: narrate the hot path — structured INFO, slow-RPC + heartbeat-streak WARNs, full error chains
- worker: introduce InferenceBackend trait + BackendRouter
- worker: native Candle BERT backend behind
candlefeature
Bug Fixes
Section titled “Bug Fixes”- accept dense_dim in dense adapters
- adapters: replace Qwen3-VL vision Conv3d patch-embed with matmul
- adapters: route Qwen3-VL VLMs through flash attention (Vidore3 throughput)
- address pr review quality issues
- bench-27b: drop bundle from SIEServer (sie-server rejects bundle+models combo)
- bench-27b: set verbose=True on SIEServer so launch errors surface
- bench-27b: skip chat_template_kwargs on via-sie (gateway rejects unsupported field)
- bench-27b: wait for sie-server /healthz (not /health)
- bench: bump casehold/gpqa max_tokens to 2048 (CoT truncation)
- bench: let via-SIE smoke serve a profile-variant model end-to-end
- bench: resolve CPU deps for quality server
- catalog: include eval-matrix tasks so dispatch filter accepts them
- ci: address analyzer findings and stale queue test
- ci: avoid nested mise in integration fixture
- ci: keep sidecar out of warm cache
- ci: refresh gateway openapi contract
- correct e0 vm runbook paths
- deploy: add sidecar registry resources
- deploy: address server sidecar review feedback
- deploy: align server sidecar naming
- deploy: align server sidecar naming and kind preload smoke
- deploy: align tilt sidecar image naming
- deploy: document worker-sidecar metrics wiring
- deploy: keep sidecar on GHCR by default
- deploy: normalize server sidecar naming
- deploy: publish server sidecar image
- deploy: rename sidecar container to worker-sidecar
- deploy: wire SIE server sidecar for kind smoke
- deploy: wire worker sidecar image across kind and cloud
- gate sidecar nats reconnect refresh
- gateway+server: queue is the only mode — kill direct-mode cruft
- gateway: suppress H9 first-chunk-fallback on single-worker pools
- harden sidecar config recovery
- impact-map: keep profiles distinct when adapter_options differ
- keep generation machinery off default queue path
- loader: wire profile runtime.default_sampling into the adapter
- modal: report actual GPU on remote, not stale env-default
- model: bump Qwen3.6-27B default/h100 mem_fraction_static 0.85 → 0.92
- orchestrator: thread CLI -p profile through to client.extract
- preserve worker batch identity and publish image
- product: update design audit for topical docs
- quality_eval: take results-bearing JSON envelope in load_eval_json
- quality: batch3 of CodeQL findings + bench KIE bug
- quality: batch3 of CodeQL findings + bench KIE root-cause
- quality: batches 1+2 of CodeQL quality findings
- quality: close CodeQL quality-tab findings
- quality: drop redundant inline imports in donut + registry
- quality: repair adapter eval harness regressions
- remove e0 preflight httpx dependency
- require rust sidecar for queue workers
- review: 0.6B ctx test 1024->4096, loader except logs, README gaps resolved
- review: recompute 27B target delta_vs_baseline for the 2048 scores
- run directory creation
- run e0 vm scripts via uv
- scheduler: autotune signal — observed_p50/target_p50 ratio
- scope bundle config hash cache per registry
- security: bump astro to ^6.4.2 for website
- security: bump gateway deps to patched versions
- security: bump product/gtm Python lockfiles
- security: bump product/gtm/content/slides npm transitives
- security: bump root pnpm deps + add overrides for transitives
- security: bump root Python deps to patched versions
- security: bump sie_dashboard npm deps to patched versions
- security: bump sie_ts_sdk standalone pnpm transitives
- security: bump sst to ^4 to drop vulnerable aws-sdk v2
- security: cap vite at ^6 + add Node engines to website
- security: close ~190 Dependabot alerts across 9 manifests
- security: sanitize one-pager template with DOMPurify
- security: use Reflect.construct for WebSocket headers shim
- sie_bench: send SIE profile via X-SIE-MACHINE-PROFILE header
- sie_bench: use rapidfuzz for OmniDocBench edit distance
- sie_server: clear CUDA cache on uncovered VLM paths + drop private sem _value access
- sie_server: VLM cache clears on uncovered paths + drop private sem _value access
- sie-bench: budget loadtest barrier timeouts
- slow sidecar nats consumer reconcile
- smoke: launch sie_server worker with -b sglang, not -m <model>
- smoke: preload the target model in via-sie worker
- test: restore donut helper call contract
- worker-sidecar: harden queue carveout contracts
- worker/rust: one long-lived pull stream — kill 30s ack_wait stall
- worker/rust: re-copy src after cargo chef cook so real build isn’t a stub
- worker/rust: set CUDA_COMPUTE_CAP at build time (default 89, L4)
- worker/rust: stop shipping the cargo-chef stub binary as the real build
- worker: harden Candle backend + align dispatcher error contract
- worker: harden payload store + error paths; surface silent success bugs
- worker: SGLang adapter accepts min_new_tokens kwarg + 27B via-sie validated
Performance Improvements
Section titled “Performance Improvements”- adaptive: anchor min_batch_cost floor at max_batch_tokens // 4
- batching: tighten adaptive wait ceiling + revert gte-multilingual 32k
- glm_ocr: rebind vision Conv3d patch-embed to F.linear
- gte-multilingual-base: raise max_batch_tokens 16k → 32k to stop IPC-batch shred
- mineru_vl: O(L) incremental no-repeat-ngram for greedy decode
- ocr: swap pure-Python Levenshtein DP for rapidfuzz
- rope_flash: vectorize CLS/mean pooling, eliminate per-item .item() sync
- server: FP16 on GPU, coalesce sized for IPC bursts, starvation self-heal
Reverts
Section titled “Reverts”- restore adaptive batching defaults to 15/50ms
- scheduler: drop depth autotune (signal didn’t pan out in S17)
v0.4.1 (2026-05-28)
Section titled “v0.4.1 (2026-05-28)”Highlights
Section titled “Highlights”- New capabilities: add Qwen3.6-27B model + migrate to CUDA 12.9
- Reliability and operations: isolate generation direct dispatch from shared queues; resolve 18 open CodeQL alerts; use SHA256 (not SHA1) for actor_id log tag; colocate tests under infra/, update sync contract
Features
Section titled “Features”- server: add Qwen3.6-27B model + migrate to CUDA 12.9
Bug Fixes
Section titled “Bug Fixes”- isolate generation direct dispatch from shared queues
- security: resolve 18 open CodeQL alerts
- security: use SHA256 (not SHA1) for actor_id log tag
- terraform-sync: colocate tests under infra/, update sync contract
Reverts
Section titled “Reverts”- security: drop advanced CodeQL setup
v0.4.0 (2026-05-27)
Section titled “v0.4.0 (2026-05-27)”Highlights
Section titled “Highlights”- Breaking change: fail-closed authentication (default-deny)
- New capabilities: generation quality-gate scoring core (roadmap §5, trust-critical); generation-quality regression gate over the existing scorers; add cohere measurements for us-east-1; add openai measurements for us-east-1; add voyage measurements for us-east-1; regex/EBNF response_format + developer role (roadmap 1.7)
- Reliability and operations: forward provision_timeout_s in SIEImageTextWrapper.encode; raise image-task eval timeouts to fix Flickr30k nightly; exclude favicon + OG image from auth middleware; keep public surfaces vague about what’s behind auth; revert NextAuth function-form, use try/catch on Resource
- Performance: warm one Lambda, bump timeout, narrow S3 verdict fetch
⚠ BREAKING CHANGES
Section titled “⚠ BREAKING CHANGES”- gateway: fail-closed authentication (default-deny)
Features
Section titled “Features”- bench: generation quality-gate scoring core (roadmap §5, trust-critical)
- bench: generation-quality regression gate over the existing scorers
- benchmarks: add cohere measurements for us-east-1
- benchmarks: add openai measurements for us-east-1
- benchmarks: add voyage measurements for us-east-1
- chat: regex/EBNF response_format + developer role (roadmap 1.7)
- dashboard: add executive quality summary widget on landing
- dashboard: add hover-tooltip on ‘to verify’ explaining WARN
- dashboard: brand alignment foundation - palette, fonts, header
- dashboard: brand favicon, opengraph image, light-mode hover fix
- dashboard: brand foundation - palette, fonts, header logo
- dashboard: brand surfaces - retokenise landing page + widget
- dashboard: public preview shell on / so Slack unfurls work
- dashboard: retokenise badges - brand-elevated neutrals, DM Mono labels
- dashboard: retokenise landing surfaces onto brand palette
- dashboard: retokenise loadtest pages onto brand surfaces
- dashboard: retokenise quality pages onto brand surfaces
- dashboard: retokenise shared components and sign-in onto brand palette
- dashboard: treat WARN as passing-with-verify, shade gauge amber
- gateway,sdk,server: add native generate endpoint with improved admission control and validation
- gateway,sdk,server: add OpenAI-compatible chat completions with streaming and sampling extensions
- gateway,server: add multi-turn tool-use support with OpenAI-compatible message format
- gateway: /v1/completions (legacy OpenAI Completions, raw-prompt)
- gateway: /v1/completions streaming (text_completion SSE)
- gateway: /v1/generate accepts seed/logprobs/logit_bias/n/best_of/lora_adapter (M8)
- gateway: /v1/responses (OpenAI Responses API, MVP)
- gateway: /v1/responses structured array input (conversation history)
- gateway+worker: per-choice OpenAI streaming for n>1 (H4, H5, M4)
- gateway: accept OpenAI multimodal content-parts; reject images (no VL model)
- gateway: add routing salt + byte-preserving key mode (M11)
- gateway: advertise lora_adapters on /v1/models + pre-validate unknown names
- gateway: fail-closed authentication (default-deny)
- gateway: meaningful system_fingerprint on chat responses (roadmap 1.3/§5)
- gateway: refactor streaming and routing with improved error handling and metrics
- gateway: register /v1/moderations as explicit 501 (roadmap 1.8, phase 3)
- gateway: serve a rendered API reference at /docs (Redoc)
- gateway: unify /v1/embeddings on the OpenAI error envelope (roadmap 1.4)
- generation: best_of — over-generate + rank by logprob, return top n
- generation: complete M4 req2 generation primitive with streaming, structured outputs, and routing
- generation: multi-candidate n>1 (non-streaming) end-to-end (roadmap 1.5)
- generation: multi-LoRA serving (one base, N adapters, per-request) (roadmap 6.2)
- generation: ship generate() primitive — Qwen3.5-4B + NEXTN/MTP + xgrammar, adapter perf at parity with raw SGLang
- generation: streaming n>1 — per-candidate SSE interleave
- helm/sie-cluster: bundle cert-manager + trust-manager (opt-in) with self-signed TLS mode
- openapi: add tool_calls support to chat completion schema
- python-sdk: expose typed params for chat n/logprobs/lora_adapter/etc (M7)
- quality-eval: add heartbeat logging and improve long-running process observability
- routing: cache-aware (prefix-hash) routing (roadmap §6.3)
- sie_bench: add Cohere as a first-class eval source
- sie_bench: add Cohere multimodal embeddings
- sie_bench: add Cohere rerank backend for native MTEB rerank tasks
- sie_bench: add OpenAI Embeddings as a first-class eval source
- sie_bench: add Voyage provider source plumbing
- sie_bench: implement Voyage text embedding runner
- sie_dashboard: add /quality/compare to diff two quality runs
- terraform: add default_tags Project=sie/Cluster on all AWS providers
- terraform: add on-demand RTX 6000 baseline pool to tester-cluster
- terraform: add uniform project=sie label across all GCP clusters
- terraform: idle-stop and on-demand wake for quality-eval runner fleet
- terraform: scale quality-eval fleet to 5+5, smart-wake, 4h timeout
- terraform: wake-runners retries + watchdog queued-jobs backstop
- tester-cluster: add on-demand L4 worker pool + capacityType node pins
- ts-sdk: handle 202 provisioning in chatCompletions + expose missing fields (H1+M6)
- worker: SGLang owns grammar; worker preflight opt-in only (H8, ADR-0002)
- worker: wire mixed-pool fairness scheduler into the pull-loop (opt-in)
- worker: WorkClassScheduler core for mixed-pool fairness (roadmap §6.1)
Bug Fixes
Section titled “Bug Fixes”- bench: declare olmocr[bench] dep for OCR-bench quality eval
- bench: drop —with-deps from playwright install (no sudo on g7e)
- bench: forward provision_timeout_s in SIEImageTextWrapper.encode
- bench: install playwright chromium for OCR-bench KaTeX rendering
- bench: playwright install —with-deps for OCR-bench chromium
- bench: raise image-task eval timeouts to fix Flickr30k nightly
- bench: respect similarity() inputs in ColBERT/ColPali wrappers
- bench: run olmocr.bench.tests off orchestrator’s asyncio loop
- centralize worker_id subject normalization (M5)
- chart: pin image-prepull DaemonSets to GPU nodes
- ci: copy assets/ into the gateway Docker build (redoc bundle)
- dashboard: close remaining ‘SIE Dashboard’ leaks in <title> and og:image:alt
- dashboard: drop edge runtime on opengraph-image for OpenNext
- dashboard: exclude favicon + OG image from auth middleware
- dashboard: keep public surfaces vague about what’s behind auth
- dashboard: pick healthy daily via coverage + health gates
- dashboard: require >=50 pairs on main-run fallback in nightly picker
- dashboard: revert NextAuth function-form, use try/catch on Resource
- dashboard: short tab title for authed users, neutral for unauth
- dev: set explicit auth opt-in for local gateway launchers (post fail-closed)
- gateway,sdk,server: add wire-level validation, improve resource cleanup, and enhance observability across request lifecycle
- gateway,sdk,server: prevent metric cardinality DoS and fix non-idempotent retry logic
- gateway,sdk,server: strengthen validation and eliminate silent failures across request lifecycle
- gateway,sdk,server: validate numeric fields and improve error handling
- gateway: add NATS config trusted producers helm override
- gateway: document ModelCapabilities in OpenAPI + refresh on profile delta-update
- gateway: generation timeouts bypass legacy request-timeout ceiling (H7)
- gateway: scope LoRA adapter capabilities per profile (M10)
- gateway: strict allow-list + 400 contract on /v1/completions (H3)
- gateway: strict allow-list + 400 contract on /v1/responses (H2)
- gateway: tighten chat sampler/token-cap + tool-history validation (M1, M13)
- gateway: trust chart-rendered sie-config pod name for NATS deltas
- generation: cancel tombstone prevents first-chunk fallback double-execution (H9)
- generation: LoRA lora_path is a top-level /generate field, not a sampling param
- generation: tighten lossy tool-control flags (M14)
- grammar: resolve tokenizer adapter for Outlines processor factories and remove anchors from regex patterns
- helm/sie-cluster: guard validateTls probe against deployments with nil labels
- helm/sie-cluster: guard validateTls probe against nil deployment labels
- helm/sie-cluster: include cert-manager mode in presence-check gate
- helm/sie-cluster: label-based cert-manager detection + bidirectional runtime check
- helm/sie-cluster: make self-signed root-CA namespace configurable
- helm/sie-cluster: one-step bundled cert-manager install with self-signed TLS
- helm/sie-cluster: probe cert-manager controllers cluster-wide
- helm/sie-cluster: regenerate Chart.lock with synced digest
- helm: trim and drop empty entries in ingress.hosts
- modal: exclude cargo target/ from sandbox image mount
- pytorch_embedding: accept and forward revision kwarg
- quality_eval: tolerate stdout noise around eval JSON envelope
- quality-eval: handle paginated jobs API and align last two filters
- release-docker,warm-cache: address review findings
- server,gateway: add GPU-aware health probes to detect and recover from wedged CUDA contexts
- server,gateway: GPU-aware health probes to detect & recover from wedged CUDA contexts
- sie_bench: register OPENAI_SOURCE so —save-targets openai actually saves
- sie_dashboard: address compare-page review nits
- sie_dashboard: label compare log links with run IDs
- sie_server: base64-decode JSON image inputs
- sie_server: enforce media bytes contract at every consumer
- sie_server: install cv2 system libs for docling extract
- streaming: no-silent-drop on chunk-queue backpressure (H6)
- terraform: detect in-flight workflow runs via explicit status query
- terraform: drop redundant Project overrides in quality-eval-l4
- terraform: drop watchdog_idle_minutes default to 5, ignore PIP drift
- terraform: grant ec2:DescribeInstanceStatus to wake role
- terraform: per-runner idle-stop via GitHub Actions runners API
- terraform: require positive activity observation before idle-stop
- terraform: scope quality-eval IAM on Role tag instead of Project
- terraform: seed GPU node-group desired_size from min_size
- terraform: watchdog backstop covers queued-status runs
- terraform: watchdog grants HANG_MINUTES grace from LaunchTime
- terraform: watchdog ignores LastBusyAt older than LaunchTime
- tester-cluster: pin worker pool nodeSelectors to gpu-type as well
Performance Improvements
Section titled “Performance Improvements”- dashboard: warm one Lambda, bump timeout, narrow S3 verdict fetch
Reverts
Section titled “Reverts”- release-docker: drop matrix consolidation, keep deps push retry
v0.3.4 (2026-05-14)
Section titled “v0.3.4 (2026-05-14)”Highlights
Section titled “Highlights”- New capabilities: default payload store to model-cache bucket /payloads; typed InputTooLongError for extract 400 INPUT_TOO_LONG
- Reliability and operations: bump dev-g6-spot to g6.2xlarge; bump dev-g6-spot to g6.2xlarge so default worker pool fits; default workers to shared queue pool; pin opencv-python-headless to drop X11 runtime deps; tolerate config conflicts in bootstrap, gate on sie-config
Features
Section titled “Features”- infra: default payload store to model-cache bucket /payloads
- sdks: typed InputTooLongError for extract 400 INPUT_TOO_LONG
Bug Fixes
Section titled “Bug Fixes”- aws-example: bump dev-g6-spot to g6.2xlarge
- aws-example: bump dev-g6-spot to g6.2xlarge so default worker pool fits
- chart: default workers to shared queue pool
- cluster.py,aws.py: address review suggestions 1 & 2
- deps: pin opencv-python-headless to drop X11 runtime deps
- gateway: tolerate config conflicts in bootstrap, gate on sie-config
- gateway: tolerate config conflicts in bootstrap, gate on sie-config ready
- sdk: widen sie-sdk requires-python to >=3.12
- terraform-aws: default ECR creation off, prefix repo names with project_name
- terraform-aws: trim slashes from ecr_repository_prefix
- terraform-google-sie: wait for identity pool before binding WI
- terraform: relax required_version from ~> 1.14.3 to >= 1.14
v0.3.3 (2026-05-13)
Section titled “v0.3.3 (2026-05-13)”Highlights
Section titled “Highlights”- New capabilities: add ColQwen3 + Nemotron ColEmbed v2 visual doc retrieval; add text classification task support; add post-download load timeout with stall-based download bounds; add scope-able workflow_dispatch with model/profile/task filters; add new INPUT_TOO_LONG ErrorCode; enforce overflow_policy in gliclass adapter
- Reliability and operations: surface empty matrix and add measurement-mode for unbaselined adapters; emit task_class in quality-adapter JSON output; score detection eval predictions from result[“objects”]; annotate empty-diff path that bypasses impact_map; capture real exit code from impact_map in resolve-impact.sh
Features
Section titled “Features”- adapters: add ColQwen3 + Nemotron ColEmbed v2 visual doc retrieval
- extraction: add text classification task support
- model-loader: add post-download load timeout with stall-based download bounds
- quality-adapter: add scope-able workflow_dispatch with model/profile/task filters
- server: add new INPUT_TOO_LONG ErrorCode
- server: enforce overflow_policy in gliclass adapter
- server: route INPUT_TOO_LONG to HTTP 400 in extract API
- server: validate overflow_policy in resolve_runtime_options
Bug Fixes
Section titled “Bug Fixes”- bench: emit task_class in quality-adapter JSON output
- bench: score detection eval predictions from result[“objects”]
- ci: annotate empty-diff path that bypasses impact_map
- ci: capture real exit code from impact_map in resolve-impact.sh
- ci: collapse adapter-equivalent profiles in quality-adapter matrix
- ci: pin mise to 2026.5.5 in loadtest workflows
- ci: surface empty matrix and add measurement-mode for unbaselined adapters
- docker: install libspatialindex-c6 in worker images
- gliclass: catch IndexError empty-tensor crash as InputTooLongError
- gliclass: raise InputTooLongError from argmax-empty backstop
- probe-chart: make 3-OLD/2-NEW sample asymmetry explicit in title
- quality-adapter: namespace quality_eval tests to avoid conftest collision
- quality-adapter: split adapter_paths on commas before —changed-dirs
- quality-adapter: split Pair column so pair_key | stops breaking the table
- quality-adapter: split Pair column to stop pair_key | breaking the table
v0.3.2 (2026-05-08)
Section titled “v0.3.2 (2026-05-08)”Highlights
Section titled “Highlights”- New capabilities: default score_pairs() in BaseAdapter + baseline reranking targets; bump cold-start schema to v6 with deserialize/warmup split; per-model perf concurrency defaults for OCR adapters; adapter-triggered quality eval on persistent L4 runner; make destroy conditional on workflow_dispatch input; nightly loadtest pipeline + baseline recorder
- Reliability and operations: raise loadtest job timeout to GH Actions ceiling (360 min / 6h); clarify experimental NATS health mode; lang tag on fenced block; fail fast on invalid scenarios; surface error/no_results rows in MD; harden parse_label against unexpected filenames; adapt OCR perf Item shape to model.inputs and fail loudly on errors
Features
Section titled “Features”- adapters: default score_pairs() in BaseAdapter + baseline reranking targets
- bench,charts: bump cold-start schema to v6 with deserialize/warmup split
- bench: per-model perf concurrency defaults for OCR adapters
- ci: adapter-triggered quality eval on persistent L4 runner
- ci: make destroy conditional on workflow_dispatch input
- ci: nightly loadtest pipeline + baseline recorder
- colbert: add score_pairs support and expand model coverage
- dashboard: add status and kind filters to runs list
- dashboard: introduce run-group concept (run = 3 scenarios)
- dashboard: loadtest results dashboard (Next.js + SST + DynamoDB)
- dashboard: render every metric in the perf-lab archive
- dashboard: scaffold loadtest dashboard (Next.js + SST)
- dashboard: status and kind filters on runs list
- dashboard: track run_status; gh-API one-time backfill
- docling: add ocr profile defaulting do_ocr=true
- gateway: expose OpenAPI contract
- gateway: unify API errors and align probe contracts
- helm: expose probes value trees for worker/gateway/config
- helm: tighten startup/readiness probes for faster pod-ready
- helm: TLS termination via cert-manager + BYO matrix docs
- helm: wire probe templates to values trees
- infra: opt-in S3 cluster model cache
- ltfr: cache-vs-no-cache compare chart, with-cache run data, and 8 single-mode chart refresh
- matrix: add task_class stamping to eval measurements
- server,bench: split deserialize/warmup in cold-start instrumentation (v6)
- server: cap torch CPU threads at worker startup
- sie_server: per-stage timing markers in lifespan for engine_boot attribution
- sie_server: split adapter.warmup() out of load() with cold-start log markers
- tools: bump cold-start bench to v5 with scenario flag
- tools: LTFR per-scenario bench tooling + results (issue #652)
- tools: ltfr-bench orchestrator (issue #652)
Bug Fixes
Section titled “Bug Fixes”- bench: adapt OCR perf Item shape to model.inputs and fail loudly on errors
- bench: address CodeRabbit review on PR #779
- bench: correctly detect v6 split presence in flattened runs[]
- bench: derive emitted gpu_load_s from v6 deserialize+warmup split when available
- chart: pass —cluster-cache to sie-server and correct populate command in docs
- charts: vertical legend so ‘image pull + container init’ and ‘node prov’ aren’t clipped
- ci+terraform: three deterministic root causes for loadtest pipeline
- ci: address CodeRabbit findings on quality-adapter PR
- ci: address CodeRabbit’s second-pass review on quality-adapter
- ci: address CodeRabbit’s third-pass review on quality-adapter
- ci: auto-clear stale terraform state lock from prior runner crashes
- ci: drop double cuda12 suffix + force codebuild for missing images
- ci: forensic dump on argo failure + LB-ENI release before destroy
- ci: gate stale-lock clear behind force_unlock input + pass —aws-region to destroy
- ci: override registry/gpu-selector/tolerations + Python heredoc
- ci: parse markdown bench output → result.json synthesis
- ci: pass WORKFLOW env to run_scenarios.sh in loadtest.yml
- ci: preflight env-var check in run_scenarios + finalize scripts
- ci: provision GH PAT secret + in-cluster github-token before bootstrap
- ci: raise loadtest job timeout to GH Actions ceiling (360 min / 6h)
- ci: read bench-config from local clone, not raw.githubusercontent.com
- ci: right-size bench pod + worker pod resources for cluster shape
- cluster: move orphan-LB sweep into cmd_destroy, drop parallel script
- cluster: use project_name (not example name) for orphan-LB VPC tag
- cluster: use project_name for orphan-LB VPC tag lookup
- dashboard,ci: keep run_status consistent between S3 and DynamoDB
- dashboard,ci: wire real Prometheus matrix shape + extend headlines
- dashboard: drop time-based legacy run grouping (was unsafe)
- dashboard: GPU util shown as 0-100 (was being multiplied by 100 again)
- dashboard: include duration_seconds in run-meta.json (was DynamoDB-only)
- dashboard: normalize array-shaped searchParams before .trim()
- deps: bump plotly to >=6.1.1 for kaleido compat
- docker: stub bundles/ and models/ in deps stage
- docling: cache DocumentConverter per (device, ocr_enabled)
- docling: mark adapter unloaded in unload()
- docling: thread device through PdfPipelineOptions accelerator_options
- gateway: address latest coderabbit contract notes
- gateway: address PR review for NATS health mode
- gateway: address probe and SDK review findings
- gateway: align CreatePoolRequest OpenAPI with runtime validation
- gateway: clarify experimental NATS health mode
- gateway: close remaining review contract gaps
- gateway: preserve embeddings timing headers
- gateway: preserve scale-from-zero request path
- gateway: reject unsupported embeddings token arrays
- helm: validate ACME server and privateKeySecretRef in validateTls
- infra: grant kms:Decrypt to workers when model cache uses SSE-KMS
- infra: normalize whitespace-only model cache string inputs
- infra: treat empty model_cache_kms_key_id as unset
- infra: use flat lifecycle key for s3-bucket module v5
- loadtest-ci: force-delete orphan elbv2 LBs before terraform destroy
- loadtest-ci: force-delete orphan LBs and stop swallowing destroy failures
- loadtest-ci: poll workflow phase instead of argo submit —wait
- loadtest-ci: poll workflow phase instead of relying on argo —wait
- ltfr-bench,notes: lang tag on fenced block; fail fast on invalid scenarios; surface error/no_results rows in MD
- ltfr-bench: hoist imports to top; guard payload.results shape
- ltfr-bench: mark request_failed rows in scenario-a/b MD tables
- ltfr-bench: preserve failure context in aggregated rows; add request_failed status
- ltfr-bench: treat no_results cells as failures in exit code
- ltfr-charts: strip legend clip-path so labels render full width
- ltfr: tighten UID/timestamp guards in capture_image_pull_events
- multi_pod_cold_start: raise on ASG terminate fail; UID-filter pull events; isolate scenario-c pod
- paddleocr_vl: pass use_cache=True to generate
- paddleocr_vl: pass use_cache=True to generate to enable KV-cache
- review: tighten score_pairs options handling and query text validation
- sie-server: include model and bundle directories in wheel distribution
- terraform: detect HF-cache EBS by NVMe model + size, not by Linux name
- terraform: set resolve_conflicts_on_* = OVERWRITE on EKS addons
- tools: drop module-level docstrings (AGENTS.md rule)
- tools: guard fig_per_cell_table aggregation against empty results
- tools: guard mean() against empty engine_boot_s in aggregate()
- tools: harden parse_label against unexpected filenames
- tools: mark cold-start-bench.py executable (EXE001)
- tools: remove module docstring from cold_start_charts.py (repo rule)
v0.3.1 (2026-04-29)
Section titled “v0.3.1 (2026-04-29)”Highlights
Section titled “Highlights”- New capabilities: add OmniDocBench OCR quality loader; support /v1/score with dense/sparse/colbert/hybrid modes; add Marqo/marqo-ecommerce-embeddings-B via open_clip backend
- Reliability and operations: add terminal failed state to model registry (sie-test#85)
Features
Section titled “Features”- bench: add OmniDocBench OCR quality loader
- bge-m3: support /v1/score with dense/sparse/colbert/hybrid modes
- siglip: add Marqo/marqo-ecommerce-embeddings-B via open_clip backend
Bug Fixes
Section titled “Bug Fixes”- server: add terminal failed state to model registry (sie-test#85)
v0.3.0 (2026-04-29)
Section titled “v0.3.0 (2026-04-29)”Highlights
Section titled “Highlights”- Breaking change: openapi.json is now a committed artifact that must be regenerated and committed when API changes are made
- New capabilities: add GLM-OCR adapter; add Qwen3-VL-Embedding-2B and Qwen3-VL-Reranker-2B multimodal adapters; add GLiNER2 and GLiNER-bi adapters; add Qwen3-Reranker-0.6B and 4B causal LM reranker support; add SigLIP 2 base-patch16-224 vision-language encoder; add minimal cache weights snapshot command for offline deployments
- Reliability and operations: retry only transient connection errors under wait_for_capacity; surface unrouteable models loudly and helm-repo-add on pristine hosts; emit identical NATS payload to bundle and _all subjects; surface mixed-profile unrouteable models and keep snapshot consistent on writes; add retry logic for deadsnakes PPA to handle Launchpad outages
- Performance: cache JPEG-encoded corpus images across queries; lazily JPEG-encode corpus images on first use; cache SDK version parse, integer audit latency, UUIDv7; cut hot-path allocations, fuse numpy decode, tighten backpressure
⚠ BREAKING CHANGES
Section titled “⚠ BREAKING CHANGES”- openapi: openapi.json is now a committed artifact that must be regenerated and committed when API changes are made
Features
Section titled “Features”- adapters: add GLM-OCR adapter
- adapters: add Qwen3-VL-Embedding-2B and Qwen3-VL-Reranker-2B multimodal adapters
- add GLiNER2 and GLiNER-bi adapters
- add Qwen3-Reranker-0.6B and 4B causal LM reranker support
- add SigLIP 2 base-patch16-224 vision-language encoder
- admin: add minimal cache weights snapshot command for offline deployments
- bench: honor SIE_BENCH_SERVER_READY_TIMEOUT in eval orchestrator
- ci: nightly loadtest gate against dedicated EKS cluster
- ci: nightly loadtest gate, ephemeral cluster per run
- extract: add Docling adapter for PDF/DOCX/HTML extraction
- extract: add Docling adapter for PDF/DOCX/HTML parsing
- extract: plumb document items and structured
dataresults - observability: add Prometheus metrics to sie-config and expand sie-gateway coverage
- observability: Prometheus metrics for sie-config and sie-gateway
- oom: implement defensive exception fan-out and improve recovery metrics
- oom: improve error semantics and budget exhaustion detection
- openapi: add static spec export and validation
- router: import Rust gateway source tree
- server: add reactive OOM recovery and proactive idle eviction
- social: daily social content pipeline with 5-source drafts + engagement
- types: add
documentinput modality across SDKs, server, and metadata
Bug Fixes
Section titled “Bug Fixes”- adapters: add input validation guards for empty/failed visual inputs
- adapters: address review findings for Qwen3-VL adapters
- adapters: clarify video placeholder, validate token IDs, fix torch_dtype key
- add client-side hour filter to search_x_posts (was date-level only)
- address CodeRabbit review feedback
- address follow-up PR review nits
- address remaining CodeRabbit feedback (round 2)
- address review findings — negative truncation guard, score() options, constant dedup
- bench: show correct unit labels for MP/s throughput in —print-gap
- bundles: declare Qwen3-VL adapters in default bundle
- ci: use Blacksmith runner in CI
- client: retry only transient connection errors under wait_for_capacity
- cluster: address PR #701 review comments
- cluster: correct kubectl flag combo and reorder LB sweep before helm uninstall
- cluster: helm uninstall before terraform destroy to clean up AWS LB leftovers
- cluster: unblock end-to-end
mise run cluster create --build - config,cluster: surface unrouteable models loudly and helm-repo-add on pristine hosts
- config: emit identical NATS payload to bundle and _all subjects
- config: surface mixed-profile unrouteable models and keep snapshot consistent on writes
- docker: add retry logic for deadsnakes PPA to handle Launchpad outages
- docker: propagate failure when all add-apt-repository retries exhausted
- docling: per-task converter, hf_revision guard, callable typing (CodeRabbit)
- docs: Update packages/sie_server/Dockerfile.cuda11
- fail closed on missing/unparsable timestamps in lookback filter
- gateway,config,sdk: resiliency, concurrency, and cross-service hash parity
- gateway,config: address PR review — 404 for unknown models, 202 on default routing, full YAML propagation
- gateway,config: harden auth, trusted NATS producers, and recovery path; drop gateway HA default
- gateway,sdk: map upstream timeouts to 503+MODEL_LOADING for SDK retry
- gateway: add GET /v1/models/{model} detail route
- gateway: address PR #716 review feedback
- gateway: align /v1/models error and list shapes
- gateway: drop double-counted REQUEST_COUNT / REQUEST_LATENCY emit
- gateway: emit X-SIE-Error-Code header on model-loading 503
- gateway: keep record_request async to match main’s call shape
- gateway: make sie-config single source of truth for bundles with live resync
- gateway: normalize model ids in NATS work subjects + docs/tooling/ha cleanup
- gateway: pre-instantiate request/demand metric families on startup
- gateway: prioritize epoch-rewind branch; harden no-thrash test; correct arch-guide on ephemeral restart
- guard score() and score_pairs() against empty input lists
- helm: default clusterRouting to “queue” on import-sie-router-rust
- helm: enable NATS + JetStream by default to match queue clusterRouting
- helm: fail fast when gateway has no bundle source
- kind-smoke: add —no-pool-isolation for static clusters + contract-drift fixes
- kind-smoke: address bot review feedback
- kind-smoke: enable configStore and harden config/gateway tests
- kind-smoke: enable JetStream on test NATS and drop duplicate subchart
- kind-smoke: wire sie-config image and helm overrides into kind cluster fixture
- kind-smoke: wire sie-config image into kind cluster fixture
- observability: address PR review blockers on metrics PR
- sdk: cluster cache prefix probe uses list, not head
- sdk: cluster cache prefix probe uses list, not head (Refs #732, #654)
- sdk: has_children filters folder-marker objects (Refs #732, #654)
- sdk: preserve caller-supplied document format over inferred (CodeRabbit)
- sdk: retry mid-flight transport disconnects, not just timeouts
- sdk: retry on connection errors and generic 503s
- sie_config: address PR review feedback
- terraform/aws: set 100GB root volume on cpu node group to avoid DiskPressure
- terraform/gcp: undo router→gateway rename on GCP Cloud Router + NAT
- tests: include sie-config in expected missing-image list
- tests: restore docker gateway smoke test after router rename
- tmux-scripts: improve robustness of session parsing and argument handling
- types: adapt to ty 0.0.32 stricter ignore handling
- use searchTerms for X tweet-scraper actor (was searchQueries)
Performance Improvements
Section titled “Performance Improvements”- bench: cache JPEG-encoded corpus images across queries
- bench: lazily JPEG-encode corpus images on first use
- docker: add —link + move ARG BUNDLE to eliminate cross-bundle layer noise
- docker: normalize mtimes so shared venv layer is dedupable
- docker: reorder stages for maximum BuildKit cache reuse
- docker: split worker venv into shared + bundle-specific layers
- gateway: cache SDK version parse, integer audit latency, UUIDv7
- gateway: cut hot-path allocations, fuse numpy decode, tighten backpressure
- gateway: fuse msgpack_numpy decode into the response path
- gateway: move score-endpoint unwrap instead of cloning
- gateway: pass msgpack items through as rmpv::Value
- gateway: publish work items concurrently + borrow shared fields
- gateway: tighten cold-pool backpressure + cheaper QPS counter
- gateway: trim per-request work on the inference hot path
v0.2.0 (2026-04-17)
Section titled “v0.2.0 (2026-04-17)”Highlights
Section titled “Highlights”- Breaking change: Removed
--modelCLI args from worker startup; useSIE_PRELOAD_MODELSenv var or--preloadflag instead - New capabilities: add ModernBERT flash dense embedding support with fallback mechanism; add OCR quality benchmarks (olmOCR-bench); add OCR quality benchmarks with olmOCR-bench; add pages/sec throughput metric for OCR perf eval; add perf metrics to OCR eval pipeline; also report query throughput in mpix/s for image queries
- Reliability and operations: add missing NATS Helm repo to release workflow; don’t set NODE_AUTH_TOKEN for OIDC npm publishes; harden affinity spill with bounds check, clamp, and debug log; make rejected requests visible to KEDA scaling metrics; remove redundant tokenizer validation and unused template parameter
⚠ BREAKING CHANGES
Section titled “⚠ BREAKING CHANGES”- workers: Removed
--modelCLI args from worker startup; useSIE_PRELOAD_MODELSenv var or--preloadflag instead
Features
Section titled “Features”- adapters: add ModernBERT flash dense embedding support with fallback mechanism
- bench: add OCR quality benchmarks (olmOCR-bench)
- bench: add OCR quality benchmarks with olmOCR-bench
- bench: add pages/sec throughput metric for OCR perf eval
- bench: add perf metrics to OCR eval pipeline
- bench: also report query throughput in mpix/s for image queries
- benchmarks: add MTEB NFCorpus evaluation results for ModernBERT-based embedders
- bench: report vision corpus throughput in mpix/s
- bench: report vision corpus throughput in mpix/s instead of items/s
- deps: migrate from pynvml to nvidia-ml-py package
- haystack: add haystack_integrations namespace aliases
- haystack: add namespace-convention aliases
- observability: add anonymous usage telemetry
- sdk: add max_concurrency param to SIEAsyncClient to prevent connection pool exhaustion
- server: add lightonai/LightOnOCR-2-1B OCR adapter with next bundle
- workers: implement model preloading at startup to reduce first-request latency
Bug Fixes
Section titled “Bug Fixes”- adapters: remove redundant tokenizer validation and unused template parameter
- address PR review — panel title, namespace variable
- bench: handle unloaded images in pixel count computation
- bench: use concurrent async requests for OCR perf eval
- bench: validate image entries before computing pixel counts
- bench: validate pixel counts before using them for image corpus throughput
- build: downgrade dockerfile syntax version to 1 for broader compatibility
- ci: add missing NATS Helm repo to release workflow
- ci: don’t set NODE_AUTH_TOKEN for OIDC npm publishes
- dashboard: queue routing dashboard accuracy and usability
- docs: add update date to portfolio header
- docs: correct PR reference in reranker reclassification note
- docs: populate reranker data and simplify table header
- docs: update stale model counts after reranker reclassification
- haystack: rename namespace alias to sie
- install uv via curl instead of COPY —from ghcr.io
- preload smoke test checks model.loaded instead of nonexistent workers field
- readme: heading format
- release: add LanceDB integrations to release-please config
- router: add overflow spill to break model affinity deadlock
- router: harden affinity spill with bounds check, clamp, and debug log
- router: make rejected requests visible to KEDA scaling metrics
- sie-bench: account for in-flight drain in throughput calculation
- sie-bench: use union wall-clock for multiprocess throughput merge
- tester-cluster: patient KEDA scale-down for worker pools
v0.1.10 (2026-04-09)
Section titled “v0.1.10 (2026-04-09)”Highlights
Section titled “Highlights”- New capabilities: add async, chunking, and streaming to Weaviate document enricher; improve DLQ routing and score response handling; implement Config Management API with NATS-based distribution and review fixes; add LanceDB integration (Python + TypeScript); queue routing dashboard + NATS exporter + router image tag; queue routing dashboard, NATS prom exporter, router image tag
- Reliability and operations: correct cluster routing condition, stream max_age units, and reconnect state ordering; add recreate strategy for router deployment when nats config restore is enabled; restore Chart.yaml deps from main, keep appVersion v-prefix; queue routing dashboard PromQL for NATS wait; configurable NATS fetch budget, Helm-wired queue params
- Performance: decouple scanner and SIE batch sizes in enrich_table; stream enrich_table batch-by-batch instead of full materialization; use Lance scanner for column projection in enrich_table; bypass FastAPI for hot proxy paths via raw ASGI middleware
Features
Section titled “Features”- add async, chunking, and streaming to Weaviate document enricher
- dlq,pull-loop: improve DLQ routing and score response handling
- implement Config Management API with NATS-based distribution and review fixes
- integrations: add LanceDB integration (Python + TypeScript)
- observability: queue routing dashboard + NATS exporter + router image tag
- observability: queue routing dashboard, NATS prom exporter, router image tag
- sdk: add get_model() and configure LanceDB release workflows
- terraform: add AWS eval-eu EKS cluster with multi-GPU support
- terraform: add evaluation cluster setup for AWS with multi-GPU support and updated configurations
- terraform: add node labels, adjust pool sizes for tester cluster
Bug Fixes
Section titled “Bug Fixes”- config,queue,nats: correct cluster routing condition, stream max_age units, and reconnect state ordering
- handle BytesIO images in LlamaIndex and validate Weaviate classify config
- helm: add recreate strategy for router deployment when nats config restore is enabled
- helm: restore Chart.yaml deps from main, keep appVersion v-prefix
- helm: use generic release-please updater for appVersion
- helm: use generic updater for both Chart.yaml version fields
- helm: use l4-spot/rtx6000-spot naming convention for spot profiles
- integrations: address CodeRabbit review findings for LanceDB PR
- observability: queue routing dashboard PromQL for NATS wait
- queue-routing: configurable NATS fetch budget, Helm-wired queue params
- queue-routing: resolve bugs, add configurable NATS params, fix score wire format
- queue-routing: score response format and DLQ fallback routing key
- release: use NPM_TOKEN for initial sie-lancedb publish
- router: use “scores” key in queue-mode score responses
- terraform: add GPU subnet coverage validation
- terraform: relax AZ validation and clarify defaults
- terraform: review fixes for tester cluster infra
- terraform: switch tester-cluster to us-east-2
- terraform: Switch tester-cluster to us-east-2 and update deployment docs
- terraform: validate gpu_node_groups for duplicate and reserved names
- test: add buildx builder pause recovery and improve build error diagnostics
- update adapter tests and address code review feedback
- use OCI registry URI for helm chart in README
Performance Improvements
Section titled “Performance Improvements”- lancedb: decouple scanner and SIE batch sizes in enrich_table
- lancedb: stream enrich_table batch-by-batch instead of full materialization
- lancedb: use Lance scanner for column projection in enrich_table
- router: bypass FastAPI for hot proxy paths via raw ASGI middleware
- router: reduce thread pool pressure by inlining small deserialization
- router: remove msgpack_numpy global patch and BaseHTTPMiddleware
- router: replace stdlib json with orjson for 3-10x faster serialization
- sdk+router: lazy msgpack_numpy.patch and pure ASGI middleware
v0.1.9 (2026-04-02)
Section titled “v0.1.9 (2026-04-02)”Highlights
Section titled “Highlights”- Reliability and operations: increase docker smoke test timeouts and add retry; include $platform in worker image tag format; revert pool names to machine profile names; remove —provenance flag (requires public repo)
Bug Fixes
Section titled “Bug Fixes”- helm: include $platform in worker image tag format
- helm: revert pool names to machine profile names
- increase docker smoke test timeouts and add retry
- remove —provenance flag (requires public repo)
v0.1.8 (2026-04-01)
Section titled “v0.1.8 (2026-04-01)”Highlights
Section titled “Highlights”- Reliability and operations: add sie-qdrant and sie-weaviate to release-please config; point sync-terraform default repos to production; remove —provenance from npm publish for private repo; correct image.tag comment to reflect actual format; remove duplicate platform suffix from worker image tag
Bug Fixes
Section titled “Bug Fixes”- add sie-qdrant and sie-weaviate to release-please config
- ci: point sync-terraform default repos to production
- ci: remove —provenance from npm publish for private repo
- helm: correct image.tag comment to reflect actual format
- helm: remove duplicate platform suffix from worker image tag
- remove internal-only references from COMPATIBILITY.md
v0.1.7 (2026-04-01)
Section titled “v0.1.7 (2026-04-01)”Highlights
Section titled “Highlights”- New capabilities: add profiling script for sparse encoding hot path; add GitHub Actions workflow to sync Terraform modules to registry repos; apply QoL improvements from PR #484 review comments; switch default GPU from g5 (A10G) to g6 (L4); add rerank/score support to TEI runner; implement configurable document length limits and custom prefix token registration
- Reliability and operations: restore triggering ref for source checkout; restore quality by enabling causal attention and QK-normalization; restore dev-l4-spot zones to us-central1 for GPU availability; check /metrics endpoint in test_prometheus_metrics_exist; add per-attempt timeout to lease renewal fetch
- Performance: optimize MoE expert dispatch with sorted-expert routing; batch MaxSim scoring across documents on GPU; batch sparse aggregation with segment_reduce and fuse relu; batch split_embeddings + validate ColBERT performance
Features
Section titled “Features”- adapters: add profiling script for sparse encoding hot path
- add GitHub Actions workflow to sync Terraform modules to registry repos
- apply QoL improvements from PR #484 review comments
- aws: switch default GPU from g5 (A10G) to g6 (L4)
- bench: add rerank/score support to TEI runner
- colbert: implement configurable document length limits and custom prefix token registration
- deploy: move namespace, SA, and HF token secret management to Helm chart
- deploy: prepare Terraform modules for public registry publishing
- deploy: rewrite example module sources to registry references
- deploy: rewrite Helm and internal references for public release
- deploy: two-artifact model — GCP Terraform infra-only, batteries-included Helm chart
- docker: add —docker-platform flag to docker build task
- extend create_pool API/SDK with minimum_worker_count and bundle
- helm: add batteries-included sub-chart dependencies to sie-cluster
- helm: add image pre-pull DaemonSet for GPU worker pools
- helm: add step to build Helm chart dependencies in Kind smoke tests
- helm: default router to image-embedded model configs
- helm: enable image pre-pull DaemonSet by default
- helm: port health gates from Terraform to Helm post-install hooks
- helm: remove prometheus alias, bump to v0.2.0, standardize chart
- infra: add Modal GPU sandbox for remote benchmark execution
- infra: add rollout warning and explicit image_type for GCFS
- infra: enable GCFS image streaming on GPU node pools
- infra: set min_node_count=1 on L4 spot GPU node pools
- integrations: add Qdrant integration
- integrations: add Qdrant integration with native sparse vector support
- integrations: add Weaviate v4 integration with Go module spec
- multiprocess loadtest + SDK aiohttp migration
- sdk: add version negotiation headers between SDK and server
- sdk: default wait_for_capacity=True and timeout=900s
- sdk: version negotiation header (SDK ↔ server)
- sie-bench: add dataset/input_type fields for mTEB corpus inputs
- sie-bench: built-in multiprocess loadtest mode
- skills: add eval-model skill for HF model assessment
- skills: add eval-model skill for HF model integration assessment
- sync Terraform modules to registry repos
- tei-runner: add /embed_sparse support for sparse models
- tei: add /embed_sparse support and auto-detect pooling mode
- terraform/aws: restore cluster autoscaler helm release to infra module
- terraform/aws: strip k8s resources, restructure as infra module with examples
- terraform: add cluster name and artifact registry variables; update node pool configuration
- terraform: add EBS CSI driver, NVIDIA device plugin, default StorageClass
- terraform: strip gcp k8s/ layer; examples use infra-only module
- tools: add ColBERT query vs document profiling script
- tools: add dense P50 latency profiling script
Bug Fixes
Section titled “Bug Fixes”- adapters: sort IDF unique_ids to satisfy SparseVector contract
- add missing production example to tf validate; fix tempfile leak; remove module docstring
- address PR #478 review feedback
- address PR review — GPU alert formula, kubectl parsing, CI path filter
- address review feedback for npm publish
- address review findings - race prevention, cleanup, lighter checkout
- alloy: add stage.cri{} before stage.json to unwrap CRI log envelopes
- alloy: explicitly set configMap name and key for sub-chart wiring
- alloy: scope pod discovery to current node via field selector
- bench: complete g5 to g6 migration in AWS eval configs and GPU mapping
- benchmarks: use TEI /embed_all for ColBERT multi-vector models
- bench: skip loading candidates_model for single-model servers
- chart: update home URL and Helm install command in README
- CI compatibility and consistent env var usage
- CI compatibility for sync-terraform workflow
- ci: add contents: read permission to publish-pypi-oidc job
- ci: add helm repo add + dep build to kind-smoke workflow
- ci: g5 refactored to g6 already
- cluster: build concrete helm command in status from infra_outputs
- cluster: guard helm/kubectl post-create log when outputs are empty
- deploy: clean terraform init artifacts before push
- deploy: correct smoke test TypedDict access and helm dry-run args
- deploy: correct StatefulSet rollout semantics, PDB scope, and KEDA pause
- deploy: remove dangling kubernetes_namespace_v1.sie references from health_gates.tf
- deploy: restore triggering ref for source checkout
- deploy: update default destination repos for GCP and AWS modules
- deploy: use triggering ref for source checkout in sync-terraform
- disable LoRA adapter layers after loading to prevent quality corruption
- docs: clarify optional image push in AWS and GCP README files
- docs: update Helm chart path in AWS and GCP README files
- fix integration test
- helm,hook: deploy/helm/sie-cluster/templates/hooks/prometheus-ready-test.yaml
- helm: add before-hook-creation to Job delete policies; document count==0 expectation
- helm: address coderabbit findings on health gate hooks
- helm: address non-blocking review findings from PR #336
- helm: address review findings in batteries-included sub-chart PR
- helm: address reviewer suggestions for health gate hooks
- helm: address second-pass review findings
- helm: aggregate buckets by le in p95 latency alert
- helm: bump chart version to 0.1.1 (patch, not minor)
- helm: clarify prometheusAddress comment — ignored when sub-chart is installed
- helm: correct kube-prometheus-stack semver constraint and remove hardcoded grafana password
- helm: correct misleading validation comment in router-deployment.yaml
- helm: don’t emit ScaledObject CRDs unless KEDA is confirmed present
- helm: downscope KEDA RBAC to Role/RoleBinding; remove runtime apk installs
- helm: fix loki service URL and extract alloy config to file
- helm: fix three blocking review issues in sub-chart dependencies
- helm: improve temporary values file handling in helm_template function
- helm: improve, simplify, and modularize sie-cluster chart
- helm: move ‘app.kubernetes.io/part-of’ label to selector labels for consistency
- helm: remove autoscaling.enabled from values-aws.yaml
- helm: render KEDA ScaledObjects via post-install hook to avoid CRD chicken-and-egg
- helm: replace hardcoded namespace in provisioning alert rules
- helm: require non-empty hfToken.value when hfToken.create is true
- helm: sub-chart naming, Loki compactor, event exporter ECR, Grafana folders
- helm: use autoscaling.prometheusAddress in prometheus hook; remove stub health_gates.tf
- helm: use full FQDN for Prometheus service in KEDA and health gates
- helm: use router.service.port in NOTES.txt instead of hardcoded 8080
- infra: update min_node_count default in top-level GCP module
- normalize SDK version warned-set key to major.minor
- pool error types, add pool/progress test coverage
- profiling: add flash variant registry, device validation, top-level import
- profiling: sync GPU before tensor timing, move script to tools/
- profiling: use in-place relu_ to match production code path
- qwen3: restore quality by enabling causal attention and QK-normalization
- readme: correct helm chart path
- readme: correct helm install command
- release: track all package versions via release-please extra-files
- release: track TS SDK version.ts via release-please
- replace corrupted bge-m3 NanoFiQA2018 target + set bfloat16 precision
- review items
- router: increase pool lease TTL to survive rolling upgrades
- router: resolve default pool GPU for scale-up when gpu/pool omitted
- router: use effective_pool instead of pool_name for default pool GPU extraction
- sdk: defer aiohttp session creation to fix “no running event loop” in SIEAsyncClient
- sie_bench: improve —print-gap report accuracy and readability
- tei-runner: validate /embed_all returns per-token embeddings
- tei-runner: validate output_type in TEIRunner init
- terraform/aws: add full -backend-config flags to production init command
- terraform/aws: add precondition asserting >=2 GPU-capable AZs exist
- terraform/aws: address review findings post-restructure
- terraform/aws: correct helm chart path in dev-g5-spot example comment
- terraform/aws: filter VPC AZs to only zones offering the GPU instance type
- terraform/aws: fix invalid splat on instance type offerings locations
- terraform/aws: remove provider aws block from child module
- terraform/aws: use var.project_name in VPC subnet cluster tags
- terraform: add validation for GPU node pool zones to ensure they match the configured region
- terraform: restore dev-l4-spot zones to us-central1 for GPU availability
- terraform: update GPU instance type description for clarity and add dev-g6-spot example
- terraform: update stale k8s module references in comments
- terraform: upgrade AWS modules and fix deprecations
- test: check /metrics endpoint in test_prometheus_metrics_exist
- test: update EKS tests from g5 to g6 after GPU instance type change
- ts-sdk: add per-attempt timeout to lease renewal fetch
- use prepack instead of prepublishOnly
- validate minimum_worker_count input and soften docstrings
Performance Improvements
Section titled “Performance Improvements”- adapter: optimize MoE expert dispatch with sorted-expert routing
- adapters: batch MaxSim scoring across documents on GPU
- adapters: batch sparse aggregation with segment_reduce and fuse relu
- adapters: batch split_embeddings + validate ColBERT performance
- adapters: batch split_embeddings in ColBERT adapters
- adapters: eliminate GPU overhead from IDF query encode path
- florence2: greedy decoding for OCR (-23% P50)
- florence2: switch OCR configs from beam search to greedy decoding
- server,bench: add batch coalescing, query warmup, and benchmark stability improvements
- server: dispatch immediately when worker is idle
- server: optimize BertFlashAdapter inference path (+35% corpus throughput)
- server: reduce batch wait timeout 10ms \u2192 2ms for lower Doc P50
v0.1.6 (2026-03-12)
Section titled “v0.1.6 (2026-03-12)”Highlights
Section titled “Highlights”- Breaking change: remove
florence2andglinerstandalone bundles — extraction adapters (gliner, glirel, gliclass) are now included in thedefaultbundle - New capabilities: add native MTEB reranking task support with MRR metric; encode-dense matrix eval — 3 models × 8 tasks; add date-prefixed versioning for chronological filename ordering; reorganize and expand model size lookup table with alphabetical ordering; add detailed perf metrics, metric filter, and threshold selector; add marimo benchmark dashboard notebook
- Reliability and operations: restore /var/cache/apt mounts, keep /var/lib/apt removed; add HF_TOKEN auth and config kwargs and fix stella models; add dense projection support to Qwen2FlashAdapter; apply query_template from runtime options in SentenceTransformerDenseAdapter; replace pip install with uv add in docker error messages
- Performance: vectorize GTE sparse encode path; vectorize tokenization and packing for gte-multilingual-base; vectorize tokenization and packing to reduce throughput gap; switch Qwen/GTE models to flash attention adapter
Breaking Changes
Section titled “Breaking Changes”- bundles: remove
florence2andglinerstandalone bundles — extraction adapters (gliner, glirel, gliclass) are now included in thedefaultbundle
Features
Section titled “Features”- bench: add native MTEB reranking task support with MRR metric
- bench: encode-dense matrix eval — 3 models × 8 tasks
- benchmarks: add date-prefixed versioning for chronological filename ordering
- benchmarks: reorganize and expand model size lookup table with alphabetical ordering
- benchview: add detailed perf metrics, metric filter, and threshold selector
- benchview: add marimo benchmark dashboard notebook
- benchview: add perf metric selector to Model Size tab
- benchview: detailed perf metrics, metric filter, threshold selector
- router,bench,sdk: improve throughput with inflight tracking, batching, and connection pooling
- server: typed request parsing with msgspec
- sie_server: add gliner, glirel, and gliclass extraction dependencies to the default bundle
Bug Fixes
Section titled “Bug Fixes”- adapter: add dense projection support to Qwen2FlashAdapter
- adapter: apply query_template from runtime options in SentenceTransformerDenseAdapter
- apply CodeRabbit auto-fixes
- bench: replace pip install with uv add in docker error messages
- benchview: add missing statistics import and use _median helper
- bundles: include sglang bundle in default cluster and eval-matrix configs
- client: update websocket header parameter name from extra_headers to additional_headers
- colbert: enable native mode fallback for non-CUDA devices and add Matryoshka truncation
- deps: cap timm upper bound and fix lazy handler init
- docker: clear stale apt lists before update to prevent 404s
- docker: remove all apt cache mounts from Dockerfiles
- docker: remove no-op /var/lib/apt cache mount from apt RUN blocks
- docker: restore /var/cache/apt mounts, keep /var/lib/apt removed
- helm: increase CPU worker pool memory limits for expanded default bundle
- model: add missing query_template to stella_en_400M_v5
- models: switch all-MiniLM-L6-v2 to SentenceTransformerDenseAdapter
- multilingual-e5-large-instruct: use instruct query template, NFCorpus 0.3521 → 0.3567
- replace invalid HTML entities in SVG with XML numeric entities
- rope_flash: clear cached _rope_dummy on unload and use torch.cat for packing
- router: also resolve pool-derived GPU names to spot variants
- router: resolve bare GPU types to spot variants for KEDA scaling
- sdk: resolve sync/async client inconsistencies in score() and encode()
- server: centralize request validation to prevent 500s from malformed items
- server: use BertFlashAdapter for e5-small-v2, resolve e5 perf anomalies
- server: use BertFlashAdapter for intfloat/e5-small-v2 and remove stale benchmarks
- set compute_precision to bfloat16 for stella_en_1.5B_v5
- sie_server: add HF_TOKEN auth and config kwargs and fix stella models
- sie_server: resolve BGE-M3 linear weights loading for HF model IDs and fix test fixtures
- sie_server: support NV-Embed-v2 with PyTorch embedding adapter
- splade: align special token filtering and guard empty batches
- test: always rebuild Docker images to pick up code changes
- typecheck: move ty type checker from mise tool to uv dependency
Performance Improvements
Section titled “Performance Improvements”- adapters: vectorize GTE sparse encode path
- rope_flash: vectorize tokenization and packing for gte-multilingual-base
- rope_flash: vectorize tokenization and packing to reduce throughput gap
- server: switch Qwen/GTE models to flash attention adapter
- splade: vectorize tokenization and sparse aggregation (1.5x throughput)
- splade: vectorize tokenization and sparse aggregation in SPLADEFlashAdapter
v0.1.5 (2026-02-27)
Section titled “v0.1.5 (2026-02-27)”Highlights
Section titled “Highlights”- New capabilities: add GLiNER v2.5 model configs; stream request bodies through proxy instead of buffering; add classification model configs for GLiClass-large and cross-encoder NLI
- Reliability and operations: release pipeline cache collision and smoke test timeout; strip content-length header from streamed proxy responses
- Performance: stream response body to eliminate bytes.join bottleneck
Features
Section titled “Features”- models: add GLiNER v2.5 model configs
- router: stream request bodies through proxy instead of buffering
- sie_server: add classification model configs for GLiClass-large and cross-encoder NLI
Bug Fixes
Section titled “Bug Fixes”- release pipeline cache collision and smoke test timeout
- router: strip content-length header from streamed proxy responses
Performance Improvements
Section titled “Performance Improvements”- router: stream response body to eliminate bytes.join bottleneck
v0.1.4 (2026-02-27)
Section titled “v0.1.4 (2026-02-27)”Highlights
Section titled “Highlights”- Reliability and operations: revert sharing=locked, add cache-read-only for build step
Bug Fixes
Section titled “Bug Fixes”- revert sharing=locked, add cache-read-only for build step
v0.1.3 (2026-02-26)
Section titled “v0.1.3 (2026-02-26)”Highlights
Section titled “Highlights”- Reliability and operations: revert token lifetime extension, re-auth before push instead; revert token lifetime, re-auth before push
Bug Fixes
Section titled “Bug Fixes”- revert token lifetime extension, re-auth before push instead
- revert token lifetime, re-auth before push
v0.1.2 (2026-02-26)
Section titled “v0.1.2 (2026-02-26)”Highlights
Section titled “Highlights”- Reliability and operations: release image builds failing from GCP token expiry
Bug Fixes
Section titled “Bug Fixes”- release image builds failing from GCP token expiry
v0.1.1 (2026-02-26)
Section titled “v0.1.1 (2026-02-26)”Highlights
Section titled “Highlights”- Reliability and operations: update bundle definitions to replace legacy and gte-qwen2 with gliner
Bug Fixes
Section titled “Bug Fixes”- bundles: update bundle definitions to replace legacy and gte-qwen2 with gliner
v0.1.0 (2026-02-26)
Section titled “v0.1.0 (2026-02-26)”Highlights
Section titled “Highlights”- Breaking change: HTTP 409 dependency conflict responses are removed from all API endpoints; the DEPENDENCY_CONFLICT error code no longer exists; .beads/ issue tracking data removed from repository
- New capabilities: add X-SIE-Worker response header for per-worker metrics tracking; add encode-image-text measurements to benchmarks dir; add encode-multivector perf measurements; add encode-multivector performance measurements; add encode-visual-document perf measurements; add encode-visual-document performance measurements
- Reliability and operations: increase helm install timeout from 10m to 15m; add trailing empty line to gitignore; align release-images workflow with docker task flags; register GLiClass and DeBERTa models in bundles; build and deploy gliner bundle in Kind smoke tests
- Performance: add connection pooling load test results (Feb 24); pool httpx client and add X-SIE-Worker header in router proxy; pool httpx client in router proxy to eliminate per-request TCP overhead; move transformers imports to module level
⚠ BREAKING CHANGES
Section titled “⚠ BREAKING CHANGES”- deps: HTTP 409 dependency conflict responses are removed from all API endpoints; the DEPENDENCY_CONFLICT error code no longer exists
- .beads/ issue tracking data removed from repository
- deps: model config files no longer support the
dependenciesfield
Features
Section titled “Features”- add X-SIE-Worker response header for per-worker metrics tracking
- benchmarks: add encode-image-text measurements to benchmarks dir
- benchmarks: add encode-multivector perf measurements
- benchmarks: add encode-multivector performance measurements
- benchmarks: add encode-visual-document perf measurements
- benchmarks: add encode-visual-document performance measurements
- benchmarks: add extract-detection L4-SPOT performance measurements
- benchmarks: add extract-kie-docvqa measurements to benchmarks dir
- benchmarks: add extract-relation L4-SPOT performance measurement
- benchmarks: add score-colbert perf measurements
- benchmarks: add score-colbert performance measurements
- models: add encode-image-text measurements
- models: add extract-detection measurements
- models: add extract-kie-docvqa measurements
- models: add extract-relation measurements
- router: add structured audit logging for API requests
Bug Fixes
Section titled “Bug Fixes”- .claude: add trailing empty line to gitignore
- align release-images workflow with docker task flags
- bundles: register GLiClass and DeBERTa models in bundles
- ci: build and deploy gliner bundle in Kind smoke tests
- colbert: remove CUDA requirement and improve device compatibility
- eval: read ‘sie_id’ instead of ‘name’ from model configs in runner
- extract: use dict access for Entity TypedDict in sort
- gliner: relax stale transformers<4.52 pin
- increase helm install timeout from 10m to 15m
- reduce cpu-gliner resource requests for Kind CI
- router: read ‘sie_id’ instead of ‘name’ from model configs
- server: migrate NLI adapter to classifications and improve API consistency
- server: migrate nli_classification adapter and improve type annotations
- server: populate classifications instead of entities in GLiClass adapter
- use manifest mode for release-please and reset to v0.0.0
- use nested .gitignore for .claude/ directory
Performance Improvements
Section titled “Performance Improvements”- add connection pooling load test results (Feb 24)
- pool httpx client and add X-SIE-Worker header in router proxy
- pool httpx client in router proxy to eliminate per-request TCP overhead
- pytorch-embedding: move transformers imports to module level
- server: use uvloop as default event loop for uvicorn
Reverts
Section titled “Reverts”- keep CONTRIBUTING.md clone URLs pointing to sie.git
Miscellaneous Chores
Section titled “Miscellaneous Chores”- remove beads, agent prompts, mypy refs; consolidate ty config
Code Refactoring
Section titled “Code Refactoring”- deps: move adapter dependencies from per-adapter pyproject.toml to bundle YAML
- deps: remove model-level dependencies feature