---
title: Backpropagation
description: "Master the backpropagation algorithm: how neural networks learn through gradient descent, chain rule, and error propagation. Complete guide with examples and implementations."
canonical_url: https://superlinked.com/glossary/backpropagation
last_updated: 2026-06-11
---

# What is Backpropagation?

Backpropagation is the algorithm that trains neural networks by computing how much each weight contributed to the prediction error and adjusting it accordingly. It works backwards from the output layer to the input layer using the chain rule of calculus, calculating gradients that guide weight updates via gradient descent.

---

## Why does backpropagation matter?

Before backpropagation, training deep neural networks was computationally intractable. The algorithm made it possible to efficiently compute gradients for millions (or billions) of parameters simultaneously, unlocking modern deep learning. Every embedding model used in semantic search and RAG, including those hosted on SIE, was trained using backpropagation.

---

## How does backpropagation work?

Training with backpropagation runs in four steps each iteration:

**1. Forward pass**: input data flows through the network layer by layer, producing a prediction.

**2. Loss calculation**: a loss function (e.g. cross-entropy, MSE) measures how wrong the prediction was.

**3. Backward pass**: starting from the loss, gradients are computed backwards through every layer using the chain rule:

```
∂Loss/∂Weight = ∂Loss/∂Output × ∂Output/∂Activation × ∂Activation/∂Weight
```

**4. Weight update**: each weight is nudged in the direction that reduces the loss:

```
new_weight = old_weight - learning_rate × gradient
```

This cycle repeats over thousands of batches until the model converges.

---

## What is the chain rule and why does it matter here?

The chain rule of calculus allows derivatives of composite functions to be broken into a product of simpler derivatives. In a deep network with many stacked layers, this means the gradient at any layer can be computed by multiplying local gradients together, making backpropagation computationally feasible regardless of depth.

---

## What are the main challenges with backpropagation?

**Vanishing gradients**: in deep networks, gradients can shrink to near-zero as they propagate backwards, preventing early layers from learning. Addressed with ReLU activations, residual connections (ResNets), and batch normalisation.

**Exploding gradients**: gradients can grow uncontrollably, especially in RNNs. Addressed with gradient clipping.

**Dying ReLU**: neurons can get stuck outputting zero. Addressed with LeakyReLU or ELU activations.

---

## Batch sizes: full batch vs mini-batch vs online

| Mode | Data per update | Gradient noise | Speed |
|---|---|---|---|
| Full batch | Entire dataset | Low | Slow |
| Mini-batch | 32-512 samples | Medium | Fast (GPU-friendly) |
| Online | 1 sample | High | Very fast per step |

Mini-batch is the standard for training modern embedding and language models. It balances stable gradients with efficient GPU utilisation.

---

## How does backpropagation relate to embedding models?

Every embedding model available on SIE (BGE-M3, E5, Jina, and others) was trained via backpropagation on large text corpora, typically using contrastive losses that pull similar texts closer and push dissimilar texts apart in vector space. Understanding backpropagation helps you reason about what fine-tuning or LoRA adaptation changes in a model.

---

## Frequently asked questions

**Is backpropagation used in transformer models?**
Yes. Transformers use backpropagation for training, combined with the Adam optimiser and gradient clipping. The architecture differs, but the training algorithm is the same.

**What's the difference between backpropagation and gradient descent?**
Backpropagation computes the gradients. Gradient descent uses those gradients to update the weights. They work together: backpropagation is the "how do we measure the error signal" step; gradient descent is the "how do we apply it" step.

**Does backpropagation happen at inference time?**
No. Backpropagation only occurs during training. At inference time (when SIE encodes your documents), the model weights are frozen and only a forward pass runs.

---

## Related resources

- [Browse embedding models trained via backpropagation](/models)
- [What is a neural network?](/glossary/neural-networks)
- [What is an optimizer?](/glossary/optimizer)
- [What is a loss function?](/glossary/loss-function)
- [What is a transformer?](/glossary/transformers)
