---
title: Loss Functions
description: "Master loss functions in machine learning: MSE, cross-entropy, MAE & more. Complete guide to choosing the right loss function for neural network training and model optimization."
canonical_url: https://superlinked.com/glossary/loss-function
last_updated: 2026-06-11
---

# What is a Loss Function?

A loss function (also called a cost function or objective function) measures how wrong a model's predictions are compared to the true values. During training, the optimiser minimises the loss by adjusting model weights via backpropagation. The choice of loss function determines what the model is optimised for, and directly shapes the representations it learns.

---

## Why does the loss function matter?

The loss function is the signal that training optimises. A model becomes good at exactly what you measure, so choosing the wrong loss leads to models that are technically "optimal" by the metric but useless in practice.

For embedding models, the loss function determines the structure of the vector space. Contrastive and triplet losses train models where semantically similar texts cluster together, which is why embedding models work for semantic search.

---

## Loss functions for regression

**Mean Squared Error (MSE)**
```
MSE = (1/n) Σ (y_pred - y_true)²
```
Penalises large errors heavily due to squaring. Sensitive to outliers. Default choice for regression.

**Mean Absolute Error (MAE)**
```
MAE = (1/n) Σ |y_pred - y_true|
```
More robust to outliers. Gradient is constant (not zero near minimum), which can make convergence less smooth.

**Huber Loss**
Combines MSE and MAE: quadratic for small errors, linear for large ones. Best of both worlds for noisy regression tasks.

---

## Loss functions for classification

**Binary Cross-Entropy**
```
Loss = -[y·log(p) + (1-y)·log(1-p)]
```
Standard loss for binary classification. Penalises confident wrong predictions heavily.

**Categorical Cross-Entropy**
```
Loss = -Σ yᵢ · log(pᵢ)
```
Standard loss for multi-class classification. Works with softmax output.

**Focal Loss**
Downweights easy examples so the model focuses on hard ones. Designed for class-imbalanced datasets.

---

## Loss functions for embedding models

This is where loss functions get directly relevant to search infrastructure. Embedding models for semantic search are trained with **metric learning losses** that shape the vector space:

**Contrastive Loss**
Pulls positive pairs (semantically similar) together and pushes negative pairs apart:
```
Loss = y·d(a,b)² + (1-y)·max(margin - d(a,b), 0)²
```

**Triplet Loss**
Takes an anchor, positive (similar), and negative (dissimilar) example. Minimises anchor-positive distance, maximises anchor-negative distance with a margin:
```
Loss = max(d(anchor, positive) - d(anchor, negative) + margin, 0)
```

**Multiple Negatives Ranking Loss (MNRL)**
Uses all other examples in a batch as negatives. Efficient and effective. Used to train many of the best open-source embedding models including those available on SIE.

**InfoNCE / NT-Xent**
Contrastive loss used in self-supervised learning (SimCSE, CLIP). Maximises similarity of positive pairs relative to all negatives in the batch.

---

## How does loss function choice affect embedding quality?

| Loss | Typical use | Properties |
|---|---|---|
| Contrastive | Pair-level similarity | Requires balanced pairs |
| Triplet | Retrieval, face recognition | Needs hard negative mining |
| MNRL | Sentence embeddings (STS, retrieval) | In-batch negatives, efficient |
| InfoNCE | Self-supervised pretraining | Large batch = better performance |

Most SOTA embedding models (BGE-M3, E5, GTE) use variants of MNRL or InfoNCE during fine-tuning. SIE hosts these models so you get the benefit of carefully chosen loss functions without training from scratch.

---

## Frequently asked questions

**Why can't I use accuracy as a loss function?**
Accuracy is not differentiable; it jumps discretely as predictions cross the decision threshold. Loss functions must be smooth and differentiable for gradient descent to work.

**What loss function should I use for re-ranking?**
For learning-to-rank models, ranking-specific losses (LambdaRank, LambdaMART, ListNet) optimise metrics like NDCG and MRR directly. For cross-encoder rerankers, binary cross-entropy (relevant vs not-relevant) is most common.

**Does SIE use loss functions at inference time?**
No. Loss functions are only used during training. At inference time, SIE runs a forward pass through frozen model weights to produce embeddings.

---

## Related resources

- [What is backpropagation?](/glossary/backpropagation)
- [What is an optimizer?](/glossary/optimizer)
- [What is a reranker?](/glossary/what-is-a-reranker)
- [Browse embedding models on SIE](/models)
- [What is semantic search?](/glossary/what-is-semantic-search)