google/siglip-so400m-patch14-224

SigLIP model pre-trained on WebLi at resolution 224x224. It was introduced in the paper Sigmoid Loss for Language Image Pre-Training by Zhai et al. and first released in this repository.

Architecture

SigLIP

Parameters

877M

Tasks

Encode

Outputs

Dense

Dimensions

Dense: 1,152

Max Sequence Length

64 tokens

License

apache-2.0

View on HuggingFace →

Benchmarks

Flickr30kI2TRetrieval

general retrieval en

Image-to-text retrieval: retrieve captions from images

Corpus: 31,783 Queries: 1,000

Quality

ndcg at 10 0.8382

map at 10 0.7479

mrr at 10 0.9353

Performance L4-SPOT b1 c8

Corpus 223 tok/s

Corpus p50 395.0ms

Query 11.5 img/s

Query p50 392.1ms

Performance L4 b1 c16

Corpus 577 tok/s

Corpus p50 393.2ms

Query 7.0 mpix/s

Query p50 372.1ms

Reference →

Benchmarks

Flickr30kI2TRetrieval

Self-hosted inference for search & document processing