---
title: CNNs
description: Explore Convolutional Neural Networks (CNNs) and how they reshaped computer vision. Learn about network architecture, including convolutional, pooling, and fully connected layers, image recognition processes, and real-world applications in object detection, semantic segmentation, and medical imaging.
canonical_url: https://superlinked.com/glossary/convolutional-neural-networks
last_updated: 2026-06-11
---

# What are Convolutional Neural Networks (CNNs)?

A Convolutional Neural Network (CNN) is a deep learning architecture designed for processing grid-structured data, most commonly images. It uses convolutional layers to automatically learn spatial feature detectors (edges, textures, shapes) from data, rather than requiring hand-crafted features. CNNs are the foundation of modern computer vision and are increasingly used in multimodal search.

---

## Why do CNNs matter for search and inference?

CNNs are directly relevant to inference pipelines when working with image or document data:

- **Multimodal search**: image embedding models (e.g. SigLIP, CLIP, Florence-2) use CNN or Vision Transformer backbones to encode images into vectors for semantic search
- **Document processing**: OCR and layout models use CNN layers to detect text regions and characters in scanned documents
- **Visual RAG**: retrieving information from product images, charts, or scanned PDFs requires CNN-based encoders

SIE supports multimodal models including Florence-2 and SigLIP, enabling image-and-text search pipelines entirely within your own infrastructure.

---

## How does a CNN work?

CNNs process images through three main types of layers:

### Convolutional layers
A learned filter (kernel) slides across the input image, computing dot products at each position to produce a feature map. Each filter detects a different pattern: early layers detect edges; deeper layers detect complex shapes.

```
Output[i,j] = Σ (Input[i+m, j+n] × Filter[m,n])
```

### Pooling layers
Downsample feature maps by taking the max or average value in each region. This reduces spatial dimensions, controls overfitting, and builds in local translation invariance.

### Fully connected layers
Flatten the spatial feature maps and connect every neuron to the next layer, similar to a standard neural network. These layers combine features detected by earlier layers into final predictions.

---

## CNN architecture evolution

| Architecture | Year | Key innovation |
|---|---|---|
| LeNet | 1998 | First successful CNN for digits |
| AlexNet | 2012 | Deep CNN + ReLU + GPU training |
| VGG | 2014 | Uniform 3×3 convolutions, depth |
| ResNet | 2015 | Residual connections, 152 layers |
| EfficientNet | 2019 | Compound scaling |
| Vision Transformer (ViT) | 2020 | Attention replaces convolutions |

Modern multimodal models like Florence-2 and CLIP combine CNN-style image encoders with transformer-based text encoders.

---

## CNNs vs Vision Transformers (ViTs)

| | CNN | Vision Transformer |
|---|---|---|
| Inductive bias | Strong (locality, translation invariance) | Weak (learns from data) |
| Data efficiency | High | Lower (needs more data) |
| Global context | Limited (grows with depth) | From layer 1 (attention) |
| Speed | Fast | Slower for small images |
| State of the art | Competitive | Leading on large datasets |

Many production image embedding models use hybrid architectures combining both.

---

## How do CNNs enable multimodal search with SIE?

SIE supports image encoding models with CNN and ViT backbones for multimodal search pipelines:

```python
from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

# Encode wine label images + tasting notes together
image_results = client.encode(
    "microsoft/Florence-2-base",
    [Item(images=[{"data": img, "format": "jpeg"}]) for img in images],
)
image_vectors = [r["dense"] for r in image_results]
text_results = client.encode("BAAI/bge-m3", [Item(text=t) for t in tasting_notes])
text_vectors = [r["dense"] for r in text_results]

# Store both in a vector DB for cross-modal search
```

See the [Wine Recommender example](/docs/examples/wine-recommender) for a full multimodal search pipeline using Florence-2 and SigLIP.

---

## Frequently asked questions

**Do CNNs only work with images?**
No. 1D convolutions are used for time series and text (though transformers have largely replaced CNNs for NLP). 3D convolutions process video.

**What is transfer learning for CNNs?**
Training a CNN from scratch requires large datasets. Transfer learning takes a pre-trained model (e.g. ResNet trained on ImageNet) and fine-tunes it on a smaller domain-specific dataset, dramatically reducing data and compute requirements.

**How are CNNs used in OCR?**
OCR models use CNNs to detect and extract text regions in document images, typically feeding into a sequence model (RNN or transformer) that decodes the character sequence.

---

## Related resources

- [Wine Recommender multimodal search example](/docs/examples/wine-recommender)
- [Browse multimodal models on SIE](/models)
- [What is a transformer?](/glossary/transformers)
- [What is a neural network?](/glossary/neural-networks)
- [What is RAG?](/glossary/what-is-rag)
