Model Training

What is Backpropagation?

Backpropagation is the algorithm that trains neural networks by computing how much each weight contributed to the prediction error and adjusting it accordingly. It works backwards from the output layer to the input layer using the chain rule of calculus, calculating gradients that guide weight updates via gradient descent.

Why does backpropagation matter?

Before backpropagation, training deep neural networks was computationally intractable. The algorithm made it possible to efficiently compute gradients for millions (or billions) of parameters simultaneously, unlocking modern deep learning. Every embedding model used in semantic search and RAG, including those hosted on SIE, was trained using backpropagation.

How does backpropagation work?

Training with backpropagation runs in four steps each iteration:

1. Forward pass: input data flows through the network layer by layer, producing a prediction.

2. Loss calculation: a loss function (e.g. cross-entropy, MSE) measures how wrong the prediction was.

3. Backward pass: starting from the loss, gradients are computed backwards through every layer using the chain rule:

∂Loss/∂Weight = ∂Loss/∂Output × ∂Output/∂Activation × ∂Activation/∂Weight

4. Weight update: each weight is nudged in the direction that reduces the loss:

new_weight = old_weight - learning_rate × gradient

This cycle repeats over thousands of batches until the model converges.

What is the chain rule and why does it matter here?

The chain rule of calculus allows derivatives of composite functions to be broken into a product of simpler derivatives. In a deep network with many stacked layers, this means the gradient at any layer can be computed by multiplying local gradients together, making backpropagation computationally feasible regardless of depth.

What are the main challenges with backpropagation?

Vanishing gradients: in deep networks, gradients can shrink to near-zero as they propagate backwards, preventing early layers from learning. Addressed with ReLU activations, residual connections (ResNets), and batch normalisation.

Exploding gradients: gradients can grow uncontrollably, especially in RNNs. Addressed with gradient clipping.

Dying ReLU: neurons can get stuck outputting zero. Addressed with LeakyReLU or ELU activations.

Batch sizes: full batch vs mini-batch vs online

Mode	Data per update	Gradient noise	Speed
Full batch	Entire dataset	Low	Slow
Mini-batch	32-512 samples	Medium	Fast (GPU-friendly)
Online	1 sample	High	Very fast per step

Mini-batch is the standard for training modern embedding and language models. It balances stable gradients with efficient GPU utilisation.

How does backpropagation relate to embedding models?

Every embedding model available on SIE (BGE-M3, E5, Jina, and others) was trained via backpropagation on large text corpora, typically using contrastive losses that pull similar texts closer and push dissimilar texts apart in vector space. Understanding backpropagation helps you reason about what fine-tuning or LoRA adaptation changes in a model.

Frequently asked questions

Is backpropagation used in transformer models? Yes. Transformers use backpropagation for training, combined with the Adam optimiser and gradient clipping. The architecture differs, but the training algorithm is the same.

What’s the difference between backpropagation and gradient descent? Backpropagation computes the gradients. Gradient descent uses those gradients to update the weights. They work together: backpropagation is the “how do we measure the error signal” step; gradient descent is the “how do we apply it” step.

Does backpropagation happen at inference time? No. Backpropagation only occurs during training. At inference time (when SIE encodes your documents), the model weights are frozen and only a forward pass runs.