Inference

How Does SIE Compare to Infinity?

SIE (Superlinked Inference Engine) and Infinity are both open-source servers for self-hosting text embedding and reranking models. Infinity is a lightweight, fast single-model server with a focus on OpenAI-compatible API endpoints. SIE is a broader inference platform with multi-model support, LoRA hot-loading, GPU cluster management via Terraform and Helm, and first-class support for document processing workloads.

Quick comparison

	SIE	Infinity
Model types	Embeddings, rerankers, OCR, extraction	Embeddings, rerankers, re-rank, CLIP
Multi-model per deployment	✓ (shared GPU cluster)	Limited (one model per instance typical)
LoRA hot-loading	✓	✗
GPU cluster (Terraform + Helm)	✓	Manual
AWS / GCP Terraform modules	✓	✗
SDK	✓ (`sie-sdk`)	OpenAI-compatible REST
OpenAI-compatible API	✓	✓ (primary design goal)
Dynamic batching	✓	✓
INT8 / quantisation	✓	✓
Licence	Apache 2.0	MIT
Backed by	Superlinked	Michael Feil (open source)

What is Infinity?

Infinity is a high-throughput embedding inference server created by Michael Feil. Its primary design goals are:

OpenAI API compatibility: drop-in replacement for OpenAI’s embedding endpoint, making it easy to swap without changing client code
Speed: aggressive batching, CUDA optimisations, and Flash Attention for high throughput
Simplicity: minimal configuration, designed to be started with a single Docker command

docker run michaelf34/infinity:latest \
  v2 --model-name-or-path BAAI/bge-m3 --port 7997

It’s a strong choice for teams that need a quick self-hosted replacement for the OpenAI embeddings API.

When should you use Infinity?

Infinity is a good fit when:

You want an OpenAI API drop-in: your existing code uses openai.embeddings.create() and you want to swap to self-hosted without changing client code
You need a single model served simply and quickly
Your team prefers minimal configuration over infrastructure tooling
You’re deploying on existing infrastructure and don’t need Terraform/Helm automation

When should you use SIE?

SIE is the better choice when:

You need multiple models in one deployment (embedding + reranker + OCR)
You want LoRA adapter hot-loading: swap domain-specific adapters per-request without server restart
You’re deploying on AWS or GCP and want managed Terraform modules for the full cluster
You need document processing capabilities (OCR, extraction) alongside embeddings
You want a production-grade SDK rather than raw HTTP calls
You need SOC2 Type 2 certified infrastructure
You want built-in monitoring and GPU utilisation metrics

Performance comparison

Both servers implement dynamic batching and CUDA-optimised inference. For single-model, single-GPU benchmarks, Infinity and SIE achieve comparable throughput. Both are bottlenecked by the GPU, not the server layer.

The performance difference emerges at scale:

Multi-model workloads: SIE’s shared GPU memory pool is more efficient than running separate Infinity instances per model
Cluster scale: SIE’s auto-scaling handles traffic spikes; Infinity requires manual scaling
Concurrent mixed workloads: encoding + reranking in the same pipeline benefits from SIE’s coordinated batching

See the SIE vs TEI vs OpenAI benchmark for detailed throughput and cost data.

Migration path: Infinity → SIE

If you’re using Infinity and want to move to SIE, the transition is straightforward. SIE exposes an OpenAI-compatible endpoint, so client code changes are minimal:

# Before (Infinity or OpenAI)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:7997", api_key="dummy")
response = client.embeddings.create(model="BAAI/bge-m3", input=texts)
vectors = [e.embedding for e in response.data]

# After (SIE SDK — more features, same data)
from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")
results = client.encode("BAAI/bge-m3", [Item(text=t) for t in texts])
vectors = [r["dense"] for r in results]

Or keep using the OpenAI-compatible REST endpoint with the same client code, just update the base_url.

SIE vs Infinity vs TEI summary

Use case	Recommended
Quick OpenAI drop-in, single model	Infinity
Single model, HuggingFace ecosystem	TEI
Production, multi-model, AWS/GCP	SIE
LoRA domain adaptation	SIE
Document processing + embeddings	SIE
Minimal devops, just need it working	Infinity or TEI

Frequently asked questions

Is Infinity actively maintained? Yes. Infinity is actively developed and has a growing community. It’s a legitimate production choice for single-model embedding serving.

Does SIE support the OpenAI embeddings API format? Yes. SIE exposes an OpenAI-compatible /v1/embeddings endpoint, so you can use it as a drop-in replacement without changing OpenAI client code.

Can I run SIE and Infinity in the same pipeline? In theory yes, but in practice you’d choose one. Both solve the same problem: self-hosted GPU inference for embedding models.