What are Convolutional Neural Networks (CNNs)?
A Convolutional Neural Network (CNN) is a deep learning architecture designed for processing grid-structured data, most commonly images. It uses convolutional layers to automatically learn spatial feature detectors (edges, textures, shapes) from data, rather than requiring hand-crafted features. CNNs are the foundation of modern computer vision and are increasingly used in multimodal search.
Why do CNNs matter for search and inference?
CNNs are directly relevant to inference pipelines when working with image or document data:
- Multimodal search: image embedding models (e.g. SigLIP, CLIP, Florence-2) use CNN or Vision Transformer backbones to encode images into vectors for semantic search
- Document processing: OCR and layout models use CNN layers to detect text regions and characters in scanned documents
- Visual RAG: retrieving information from product images, charts, or scanned PDFs requires CNN-based encoders
SIE supports multimodal models including Florence-2 and SigLIP, enabling image-and-text search pipelines entirely within your own infrastructure.
How does a CNN work?
CNNs process images through three main types of layers:
Convolutional layers
A learned filter (kernel) slides across the input image, computing dot products at each position to produce a feature map. Each filter detects a different pattern: early layers detect edges; deeper layers detect complex shapes.
Output[i,j] = Σ (Input[i+m, j+n] × Filter[m,n])Pooling layers
Downsample feature maps by taking the max or average value in each region. This reduces spatial dimensions, controls overfitting, and builds in local translation invariance.
Fully connected layers
Flatten the spatial feature maps and connect every neuron to the next layer, similar to a standard neural network. These layers combine features detected by earlier layers into final predictions.
CNN architecture evolution
| Architecture | Year | Key innovation |
|---|---|---|
| LeNet | 1998 | First successful CNN for digits |
| AlexNet | 2012 | Deep CNN + ReLU + GPU training |
| VGG | 2014 | Uniform 3×3 convolutions, depth |
| ResNet | 2015 | Residual connections, 152 layers |
| EfficientNet | 2019 | Compound scaling |
| Vision Transformer (ViT) | 2020 | Attention replaces convolutions |
Modern multimodal models like Florence-2 and CLIP combine CNN-style image encoders with transformer-based text encoders.
CNNs vs Vision Transformers (ViTs)
| CNN | Vision Transformer | |
|---|---|---|
| Inductive bias | Strong (locality, translation invariance) | Weak (learns from data) |
| Data efficiency | High | Lower (needs more data) |
| Global context | Limited (grows with depth) | From layer 1 (attention) |
| Speed | Fast | Slower for small images |
| State of the art | Competitive | Leading on large datasets |
Many production image embedding models use hybrid architectures combining both.
How do CNNs enable multimodal search with SIE?
SIE supports image encoding models with CNN and ViT backbones for multimodal search pipelines:
from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# Encode wine label images + tasting notes togetherimage_results = client.encode( "microsoft/Florence-2-base", [Item(images=[{"data": img, "format": "jpeg"}]) for img in images],)image_vectors = [r["dense"] for r in image_results]text_results = client.encode("BAAI/bge-m3", [Item(text=t) for t in tasting_notes])text_vectors = [r["dense"] for r in text_results]
# Store both in a vector DB for cross-modal searchSee the Wine Recommender example for a full multimodal search pipeline using Florence-2 and SigLIP.
Frequently asked questions
Do CNNs only work with images? No. 1D convolutions are used for time series and text (though transformers have largely replaced CNNs for NLP). 3D convolutions process video.
What is transfer learning for CNNs? Training a CNN from scratch requires large datasets. Transfer learning takes a pre-trained model (e.g. ResNet trained on ImageNet) and fine-tunes it on a smaller domain-specific dataset, dramatically reducing data and compute requirements.
How are CNNs used in OCR? OCR models use CNNs to detect and extract text regions in document images, typically feeding into a sequence model (RNN or transformer) that decodes the character sequence.