Model Training

What is a Loss Function?

A loss function (also called a cost function or objective function) measures how wrong a model’s predictions are compared to the true values. During training, the optimiser minimises the loss by adjusting model weights via backpropagation. The choice of loss function determines what the model is optimised for, and directly shapes the representations it learns.

Why does the loss function matter?

The loss function is the signal that training optimises. A model becomes good at exactly what you measure, so choosing the wrong loss leads to models that are technically “optimal” by the metric but useless in practice.

For embedding models, the loss function determines the structure of the vector space. Contrastive and triplet losses train models where semantically similar texts cluster together, which is why embedding models work for semantic search.

Loss functions for regression

Mean Squared Error (MSE)

MSE = (1/n) Σ (y_pred - y_true)²

Penalises large errors heavily due to squaring. Sensitive to outliers. Default choice for regression.

Mean Absolute Error (MAE)

MAE = (1/n) Σ |y_pred - y_true|

More robust to outliers. Gradient is constant (not zero near minimum), which can make convergence less smooth.

Huber Loss Combines MSE and MAE: quadratic for small errors, linear for large ones. Best of both worlds for noisy regression tasks.

Loss functions for classification

Binary Cross-Entropy

Loss = -[y·log(p) + (1-y)·log(1-p)]

Standard loss for binary classification. Penalises confident wrong predictions heavily.

Categorical Cross-Entropy

Loss = -Σ yᵢ · log(pᵢ)

Standard loss for multi-class classification. Works with softmax output.

Focal Loss Downweights easy examples so the model focuses on hard ones. Designed for class-imbalanced datasets.

Loss functions for embedding models

This is where loss functions get directly relevant to search infrastructure. Embedding models for semantic search are trained with metric learning losses that shape the vector space:

Contrastive Loss Pulls positive pairs (semantically similar) together and pushes negative pairs apart:

Loss = y·d(a,b)² + (1-y)·max(margin - d(a,b), 0)²

Triplet Loss Takes an anchor, positive (similar), and negative (dissimilar) example. Minimises anchor-positive distance, maximises anchor-negative distance with a margin:

Loss = max(d(anchor, positive) - d(anchor, negative) + margin, 0)

Multiple Negatives Ranking Loss (MNRL) Uses all other examples in a batch as negatives. Efficient and effective. Used to train many of the best open-source embedding models including those available on SIE.

InfoNCE / NT-Xent Contrastive loss used in self-supervised learning (SimCSE, CLIP). Maximises similarity of positive pairs relative to all negatives in the batch.

How does loss function choice affect embedding quality?

Loss	Typical use	Properties
Contrastive	Pair-level similarity	Requires balanced pairs
Triplet	Retrieval, face recognition	Needs hard negative mining
MNRL	Sentence embeddings (STS, retrieval)	In-batch negatives, efficient
InfoNCE	Self-supervised pretraining	Large batch = better performance

Most SOTA embedding models (BGE-M3, E5, GTE) use variants of MNRL or InfoNCE during fine-tuning. SIE hosts these models so you get the benefit of carefully chosen loss functions without training from scratch.

Frequently asked questions

Why can’t I use accuracy as a loss function? Accuracy is not differentiable; it jumps discretely as predictions cross the decision threshold. Loss functions must be smooth and differentiable for gradient descent to work.

What loss function should I use for re-ranking? For learning-to-rank models, ranking-specific losses (LambdaRank, LambdaMART, ListNet) optimise metrics like NDCG and MRR directly. For cross-encoder rerankers, binary cross-entropy (relevant vs not-relevant) is most common.

Does SIE use loss functions at inference time? No. Loss functions are only used during training. At inference time, SIE runs a forward pass through frozen model weights to produce embeddings.