Why did we open-source our inference engine? Read the post
← All Glossary Articles

How Does SIE Compare to Infinity?

SIE (Superlinked Inference Engine) and Infinity are both open-source servers for self-hosting text embedding and reranking models. Infinity is a lightweight, fast single-model server with a focus on OpenAI-compatible API endpoints. SIE is a broader inference platform with multi-model support, LoRA hot-loading, GPU cluster management via Terraform and Helm, and first-class support for document processing workloads.


Quick comparison

SIEInfinity
Model typesEmbeddings, rerankers, OCR, extractionEmbeddings, rerankers, re-rank, CLIP
Multi-model per deployment✓ (shared GPU cluster)Limited (one model per instance typical)
LoRA hot-loading
GPU cluster (Terraform + Helm)Manual
AWS / GCP Terraform modules
SDK✓ (sie-sdk)OpenAI-compatible REST
OpenAI-compatible API✓ (primary design goal)
Dynamic batching
INT8 / quantisation
LicenceApache 2.0MIT
Backed bySuperlinkedMichael Feil (open source)

What is Infinity?

Infinity is a high-throughput embedding inference server created by Michael Feil. Its primary design goals are:

  • OpenAI API compatibility: drop-in replacement for OpenAI’s embedding endpoint, making it easy to swap without changing client code
  • Speed: aggressive batching, CUDA optimisations, and Flash Attention for high throughput
  • Simplicity: minimal configuration, designed to be started with a single Docker command
docker run michaelf34/infinity:latest \
v2 --model-name-or-path BAAI/bge-m3 --port 7997

It’s a strong choice for teams that need a quick self-hosted replacement for the OpenAI embeddings API.


When should you use Infinity?

Infinity is a good fit when:

  • You want an OpenAI API drop-in: your existing code uses openai.embeddings.create() and you want to swap to self-hosted without changing client code
  • You need a single model served simply and quickly
  • Your team prefers minimal configuration over infrastructure tooling
  • You’re deploying on existing infrastructure and don’t need Terraform/Helm automation

When should you use SIE?

SIE is the better choice when:

  • You need multiple models in one deployment (embedding + reranker + OCR)
  • You want LoRA adapter hot-loading: swap domain-specific adapters per-request without server restart
  • You’re deploying on AWS or GCP and want managed Terraform modules for the full cluster
  • You need document processing capabilities (OCR, extraction) alongside embeddings
  • You want a production-grade SDK rather than raw HTTP calls
  • You need SOC2 Type 2 certified infrastructure
  • You want built-in monitoring and GPU utilisation metrics

Performance comparison

Both servers implement dynamic batching and CUDA-optimised inference. For single-model, single-GPU benchmarks, Infinity and SIE achieve comparable throughput. Both are bottlenecked by the GPU, not the server layer.

The performance difference emerges at scale:

  • Multi-model workloads: SIE’s shared GPU memory pool is more efficient than running separate Infinity instances per model
  • Cluster scale: SIE’s auto-scaling handles traffic spikes; Infinity requires manual scaling
  • Concurrent mixed workloads: encoding + reranking in the same pipeline benefits from SIE’s coordinated batching

See the SIE vs TEI vs OpenAI benchmark for detailed throughput and cost data.


Migration path: Infinity → SIE

If you’re using Infinity and want to move to SIE, the transition is straightforward. SIE exposes an OpenAI-compatible endpoint, so client code changes are minimal:

# Before (Infinity or OpenAI)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:7997", api_key="dummy")
response = client.embeddings.create(model="BAAI/bge-m3", input=texts)
vectors = [e.embedding for e in response.data]
# After (SIE SDK — more features, same data)
from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
results = client.encode("BAAI/bge-m3", [Item(text=t) for t in texts])
vectors = [r["dense"] for r in results]

Or keep using the OpenAI-compatible REST endpoint with the same client code, just update the base_url.


SIE vs Infinity vs TEI summary

Use caseRecommended
Quick OpenAI drop-in, single modelInfinity
Single model, HuggingFace ecosystemTEI
Production, multi-model, AWS/GCPSIE
LoRA domain adaptationSIE
Document processing + embeddingsSIE
Minimal devops, just need it workingInfinity or TEI

Frequently asked questions

Is Infinity actively maintained? Yes. Infinity is actively developed and has a growing community. It’s a legitimate production choice for single-model embedding serving.

Does SIE support the OpenAI embeddings API format? Yes. SIE exposes an OpenAI-compatible /v1/embeddings endpoint, so you can use it as a drop-in replacement without changing OpenAI client code.

Can I run SIE and Infinity in the same pipeline? In theory yes, but in practice you’d choose one. Both solve the same problem: self-hosted GPU inference for embedding models.


Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github 2.0K

Contact us

Tell us about your use case and we'll get back to you shortly.