Deep Learning

What are Convolutional Neural Networks (CNNs)?

A Convolutional Neural Network (CNN) is a deep learning architecture designed for processing grid-structured data, most commonly images. It uses convolutional layers to automatically learn spatial feature detectors (edges, textures, shapes) from data, rather than requiring hand-crafted features. CNNs are the foundation of modern computer vision and are increasingly used in multimodal search.

Why do CNNs matter for search and inference?

CNNs are directly relevant to inference pipelines when working with image or document data:

Multimodal search: image embedding models (e.g. SigLIP, CLIP, Florence-2) use CNN or Vision Transformer backbones to encode images into vectors for semantic search
Document processing: OCR and layout models use CNN layers to detect text regions and characters in scanned documents
Visual RAG: retrieving information from product images, charts, or scanned PDFs requires CNN-based encoders

SIE supports multimodal models including Florence-2 and SigLIP, enabling image-and-text search pipelines entirely within your own infrastructure.

How does a CNN work?

CNNs process images through three main types of layers:

Convolutional layers

A learned filter (kernel) slides across the input image, computing dot products at each position to produce a feature map. Each filter detects a different pattern: early layers detect edges; deeper layers detect complex shapes.

Output[i,j] = Σ (Input[i+m, j+n] × Filter[m,n])

Pooling layers

Downsample feature maps by taking the max or average value in each region. This reduces spatial dimensions, controls overfitting, and builds in local translation invariance.

Fully connected layers

Flatten the spatial feature maps and connect every neuron to the next layer, similar to a standard neural network. These layers combine features detected by earlier layers into final predictions.

CNN architecture evolution

Architecture	Year	Key innovation
LeNet	1998	First successful CNN for digits
AlexNet	2012	Deep CNN + ReLU + GPU training
VGG	2014	Uniform 3×3 convolutions, depth
ResNet	2015	Residual connections, 152 layers
EfficientNet	2019	Compound scaling
Vision Transformer (ViT)	2020	Attention replaces convolutions

Modern multimodal models like Florence-2 and CLIP combine CNN-style image encoders with transformer-based text encoders.

CNNs vs Vision Transformers (ViTs)

	CNN	Vision Transformer
Inductive bias	Strong (locality, translation invariance)	Weak (learns from data)
Data efficiency	High	Lower (needs more data)
Global context	Limited (grows with depth)	From layer 1 (attention)
Speed	Fast	Slower for small images
State of the art	Competitive	Leading on large datasets

Many production image embedding models use hybrid architectures combining both.

How do CNNs enable multimodal search with SIE?

SIE supports image encoding models with CNN and ViT backbones for multimodal search pipelines:

from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

# Encode wine label images + tasting notes together
image_results = client.encode(
    "microsoft/Florence-2-base",
    [Item(images=[{"data": img, "format": "jpeg"}]) for img in images],
)
image_vectors = [r["dense"] for r in image_results]
text_results = client.encode("BAAI/bge-m3", [Item(text=t) for t in tasting_notes])
text_vectors = [r["dense"] for r in text_results]

# Store both in a vector DB for cross-modal search

See the Wine Recommender example for a full multimodal search pipeline using Florence-2 and SigLIP.

Frequently asked questions

Do CNNs only work with images? No. 1D convolutions are used for time series and text (though transformers have largely replaced CNNs for NLP). 3D convolutions process video.

What is transfer learning for CNNs? Training a CNN from scratch requires large datasets. Transfer learning takes a pre-trained model (e.g. ResNet trained on ImageNet) and fine-tunes it on a smaller domain-specific dataset, dramatically reducing data and compute requirements.

How are CNNs used in OCR? OCR models use CNNs to detect and extract text regions in document images, typically feeding into a sequence model (RNN or transformer) that decodes the character sequence.