Why did we open-source our inference engine? Read the post
← All Glossary Articles

What are Convolutional Neural Networks (CNNs)?

A Convolutional Neural Network (CNN) is a deep learning architecture designed for processing grid-structured data, most commonly images. It uses convolutional layers to automatically learn spatial feature detectors (edges, textures, shapes) from data, rather than requiring hand-crafted features. CNNs are the foundation of modern computer vision and are increasingly used in multimodal search.


Why do CNNs matter for search and inference?

CNNs are directly relevant to inference pipelines when working with image or document data:

  • Multimodal search: image embedding models (e.g. SigLIP, CLIP, Florence-2) use CNN or Vision Transformer backbones to encode images into vectors for semantic search
  • Document processing: OCR and layout models use CNN layers to detect text regions and characters in scanned documents
  • Visual RAG: retrieving information from product images, charts, or scanned PDFs requires CNN-based encoders

SIE supports multimodal models including Florence-2 and SigLIP, enabling image-and-text search pipelines entirely within your own infrastructure.


How does a CNN work?

CNNs process images through three main types of layers:

Convolutional layers

A learned filter (kernel) slides across the input image, computing dot products at each position to produce a feature map. Each filter detects a different pattern: early layers detect edges; deeper layers detect complex shapes.

Output[i,j] = Σ (Input[i+m, j+n] × Filter[m,n])

Pooling layers

Downsample feature maps by taking the max or average value in each region. This reduces spatial dimensions, controls overfitting, and builds in local translation invariance.

Fully connected layers

Flatten the spatial feature maps and connect every neuron to the next layer, similar to a standard neural network. These layers combine features detected by earlier layers into final predictions.


CNN architecture evolution

ArchitectureYearKey innovation
LeNet1998First successful CNN for digits
AlexNet2012Deep CNN + ReLU + GPU training
VGG2014Uniform 3×3 convolutions, depth
ResNet2015Residual connections, 152 layers
EfficientNet2019Compound scaling
Vision Transformer (ViT)2020Attention replaces convolutions

Modern multimodal models like Florence-2 and CLIP combine CNN-style image encoders with transformer-based text encoders.


CNNs vs Vision Transformers (ViTs)

CNNVision Transformer
Inductive biasStrong (locality, translation invariance)Weak (learns from data)
Data efficiencyHighLower (needs more data)
Global contextLimited (grows with depth)From layer 1 (attention)
SpeedFastSlower for small images
State of the artCompetitiveLeading on large datasets

Many production image embedding models use hybrid architectures combining both.


How do CNNs enable multimodal search with SIE?

SIE supports image encoding models with CNN and ViT backbones for multimodal search pipelines:

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# Encode wine label images + tasting notes together
image_results = client.encode(
"microsoft/Florence-2-base",
[Item(images=[{"data": img, "format": "jpeg"}]) for img in images],
)
image_vectors = [r["dense"] for r in image_results]
text_results = client.encode("BAAI/bge-m3", [Item(text=t) for t in tasting_notes])
text_vectors = [r["dense"] for r in text_results]
# Store both in a vector DB for cross-modal search

See the Wine Recommender example for a full multimodal search pipeline using Florence-2 and SigLIP.


Frequently asked questions

Do CNNs only work with images? No. 1D convolutions are used for time series and text (though transformers have largely replaced CNNs for NLP). 3D convolutions process video.

What is transfer learning for CNNs? Training a CNN from scratch requires large datasets. Transfer learning takes a pre-trained model (e.g. ResNet trained on ImageNet) and fine-tunes it on a smaller domain-specific dataset, dramatically reducing data and compute requirements.

How are CNNs used in OCR? OCR models use CNNs to detect and extract text regions in document images, typically feeding into a sequence model (RNN or transformer) that decodes the character sequence.


Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github 2.0K

Contact us

Tell us about your use case and we'll get back to you shortly.