google/siglip-so400m-patch14-224
SigLIP model pre-trained on WebLi at resolution 224x224. It was introduced in the paper Sigmoid Loss for Language Image Pre-Training by Zhai et al. and first released in this repository.
Overview
Benchmarks
Flickr30kI2TRetrieval
Image-to-text retrieval: retrieve captions from images
Corpus: 31,783 Queries: 1,000
Quality
ndcg at 10 0.8383
map at 10 0.7481
mrr at 10 0.9353
Performance L4-SPOT b1 c8
Corpus 223 tok/s
Corpus p50 395.0ms
Query 11.5 img/s
Query p50 392.1ms
Performance L4 b1 c16
Corpus 689 tok/s
Corpus p50 173.8ms
Query 10.6 mpix/s
Query p50 135.9ms