google/siglip-so400m-patch14-384

SigLIP model pre-trained on WebLi at resolution 384x384. It was introduced in the paper Sigmoid Loss for Language Image Pre-Training by Zhai et al. and first released in this repository.

Architecture

SigLIP

Parameters

878M

Tasks

Encode

Outputs

Dense

Dimensions

Dense: 1,152

Max Sequence Length

64 tokens

License

apache-2.0

View on HuggingFace →

Benchmarks

Flickr30kI2TRetrieval

general retrieval en

Image-to-text retrieval: retrieve captions from images

Corpus: 31,783 Queries: 1,000

Quality

ndcg at 10 0.9001

map at 10 0.8364

mrr at 10 0.9663

Performance L4-SPOT b1 c8

Corpus 202 tok/s

Corpus p50 523.6ms

Query 9.7 img/s

Query p50 711.3ms

Performance L4 b1 c16

Corpus 597 tok/s

Corpus p50 381.3ms

Query 5.7 mpix/s

Query p50 459.0ms

Reference →

Benchmarks

Flickr30kI2TRetrieval

Self-hosted inference for search & document processing