How Does SIE Compare to Infinity?
SIE (Superlinked Inference Engine) and Infinity are both open-source servers for self-hosting text embedding and reranking models. Infinity is a lightweight, fast single-model server with a focus on OpenAI-compatible API endpoints. SIE is a broader inference platform with multi-model support, LoRA hot-loading, GPU cluster management via Terraform and Helm, and first-class support for document processing workloads.
Quick comparison
| SIE | Infinity | |
|---|---|---|
| Model types | Embeddings, rerankers, OCR, extraction | Embeddings, rerankers, re-rank, CLIP |
| Multi-model per deployment | ✓ (shared GPU cluster) | Limited (one model per instance typical) |
| LoRA hot-loading | ✓ | ✗ |
| GPU cluster (Terraform + Helm) | ✓ | Manual |
| AWS / GCP Terraform modules | ✓ | ✗ |
| SDK | ✓ (sie-sdk) | OpenAI-compatible REST |
| OpenAI-compatible API | ✓ | ✓ (primary design goal) |
| Dynamic batching | ✓ | ✓ |
| INT8 / quantisation | ✓ | ✓ |
| Licence | Apache 2.0 | MIT |
| Backed by | Superlinked | Michael Feil (open source) |
What is Infinity?
Infinity is a high-throughput embedding inference server created by Michael Feil. Its primary design goals are:
- OpenAI API compatibility: drop-in replacement for OpenAI’s embedding endpoint, making it easy to swap without changing client code
- Speed: aggressive batching, CUDA optimisations, and Flash Attention for high throughput
- Simplicity: minimal configuration, designed to be started with a single Docker command
docker run michaelf34/infinity:latest \ v2 --model-name-or-path BAAI/bge-m3 --port 7997It’s a strong choice for teams that need a quick self-hosted replacement for the OpenAI embeddings API.
When should you use Infinity?
Infinity is a good fit when:
- You want an OpenAI API drop-in: your existing code uses
openai.embeddings.create()and you want to swap to self-hosted without changing client code - You need a single model served simply and quickly
- Your team prefers minimal configuration over infrastructure tooling
- You’re deploying on existing infrastructure and don’t need Terraform/Helm automation
When should you use SIE?
SIE is the better choice when:
- You need multiple models in one deployment (embedding + reranker + OCR)
- You want LoRA adapter hot-loading: swap domain-specific adapters per-request without server restart
- You’re deploying on AWS or GCP and want managed Terraform modules for the full cluster
- You need document processing capabilities (OCR, extraction) alongside embeddings
- You want a production-grade SDK rather than raw HTTP calls
- You need SOC2 Type 2 certified infrastructure
- You want built-in monitoring and GPU utilisation metrics
Performance comparison
Both servers implement dynamic batching and CUDA-optimised inference. For single-model, single-GPU benchmarks, Infinity and SIE achieve comparable throughput. Both are bottlenecked by the GPU, not the server layer.
The performance difference emerges at scale:
- Multi-model workloads: SIE’s shared GPU memory pool is more efficient than running separate Infinity instances per model
- Cluster scale: SIE’s auto-scaling handles traffic spikes; Infinity requires manual scaling
- Concurrent mixed workloads: encoding + reranking in the same pipeline benefits from SIE’s coordinated batching
See the SIE vs TEI vs OpenAI benchmark for detailed throughput and cost data.
Migration path: Infinity → SIE
If you’re using Infinity and want to move to SIE, the transition is straightforward. SIE exposes an OpenAI-compatible endpoint, so client code changes are minimal:
# Before (Infinity or OpenAI)from openai import OpenAIclient = OpenAI(base_url="http://localhost:7997", api_key="dummy")response = client.embeddings.create(model="BAAI/bge-m3", input=texts)vectors = [e.embedding for e in response.data]
# After (SIE SDK — more features, same data)from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://localhost:8080")results = client.encode("BAAI/bge-m3", [Item(text=t) for t in texts])vectors = [r["dense"] for r in results]Or keep using the OpenAI-compatible REST endpoint with the same client code, just update the base_url.
SIE vs Infinity vs TEI summary
| Use case | Recommended |
|---|---|
| Quick OpenAI drop-in, single model | Infinity |
| Single model, HuggingFace ecosystem | TEI |
| Production, multi-model, AWS/GCP | SIE |
| LoRA domain adaptation | SIE |
| Document processing + embeddings | SIE |
| Minimal devops, just need it working | Infinity or TEI |
Frequently asked questions
Is Infinity actively maintained? Yes. Infinity is actively developed and has a growing community. It’s a legitimate production choice for single-model embedding serving.
Does SIE support the OpenAI embeddings API format?
Yes. SIE exposes an OpenAI-compatible /v1/embeddings endpoint, so you can use it as a drop-in replacement without changing OpenAI client code.
Can I run SIE and Infinity in the same pipeline? In theory yes, but in practice you’d choose one. Both solve the same problem: self-hosted GPU inference for embedding models.