---
title: Optimizer
description: "Complete guide to machine learning optimizers: SGD, Adam, RMSprop & gradient descent. Learn neural network training algorithms with real-world examples."
canonical_url: https://superlinked.com/glossary/optimizer
last_updated: 2026-06-11
---

# What is an Optimizer in Machine Learning?

An optimizer is the algorithm that updates a neural network's weights during training to minimise the loss function. It uses the gradients computed by backpropagation to determine how much and in which direction to adjust each weight. The choice of optimizer affects training speed, stability, and the quality of the final model.

---

## Why does the optimizer matter?

Gradient descent alone (moving weights in the direction of steepest loss reduction) works in principle but is slow and unstable in practice. Modern optimizers add mechanisms like momentum, adaptive learning rates, and variance correction that make training dramatically faster and more reliable.

Every embedding model hosted on SIE was trained using an optimizer (typically Adam or AdamW). Understanding optimizers helps you reason about fine-tuning behaviour and training stability when adapting models for your domain.

---

## How does gradient descent work?

The basic update rule:

```
w = w - η × ∂L/∂w
```

Where:
- `w` = weight
- `η` = learning rate (step size)
- `∂L/∂w` = gradient of the loss with respect to the weight

**The problem:** a fixed learning rate is either too large (training diverges) or too small (training is painfully slow). And gradients oscillate in narrow valleys of the loss surface.

---

## Main optimizer algorithms

### SGD with Momentum
Adds a velocity term that accumulates past gradients, smoothing oscillations and speeding up convergence in consistent directions:

```
v = β·v - η·∇L
w = w + v
```

Good for fine-tuned control but requires careful learning rate tuning.

### Adam (Adaptive Moment Estimation)
The most widely used optimizer. Maintains per-parameter adaptive learning rates using estimates of first (mean) and second (variance) moments of gradients:

```
m = β₁·m + (1-β₁)·∇L       # first moment (mean)
v = β₂·v + (1-β₂)·∇L²      # second moment (variance)
w = w - η·m̂/√(v̂ + ε)
```

Default parameters (β₁=0.9, β₂=0.999) work well across most tasks. Adam converges quickly and is robust to learning rate choice.

### AdamW
Adam with **weight decay decoupled** from the gradient update, the standard fix for Adam's tendency to under-regularise. AdamW is the default for training transformer-based models including most embedding models:

```
w = w - η·m̂/√(v̂ + ε) - η·λ·w
```

Where λ is the weight decay coefficient.

### Learning rate schedulers
Optimizers are typically paired with a learning rate schedule:

| Schedule | Behaviour | Common use |
|---|---|---|
| Constant | Fixed η throughout | Simple baselines |
| Linear warmup + decay | Ramps up then decays | Transformer fine-tuning |
| Cosine annealing | Smooth cosine decay | Long training runs |
| Reduce on plateau | Drops η when loss stalls | General purpose |

---

## Optimizer comparison

| Optimizer | Adaptive LR | Momentum | Best for |
|---|---|---|---|
| SGD | ✗ | Optional | Vision models with tuning |
| SGD + Momentum | ✗ | ✓ | Stable, well-understood |
| Adam | ✓ | ✓ | Most deep learning tasks |
| AdamW | ✓ | ✓ | Transformer fine-tuning (default) |
| Adafactor | ✓ | Optional | Memory-efficient (large models) |

---

## Optimizers and LoRA fine-tuning

When fine-tuning an embedding model with LoRA (the approach SIE uses for domain adaptation), AdamW with linear warmup is the standard recipe:

```python
optimizer = AdamW(lora_parameters, lr=2e-4, weight_decay=0.01)
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=100,
    num_training_steps=total_steps
)
```

Only the LoRA adapter parameters are updated; the base model weights are frozen. This dramatically reduces memory and compute requirements.

---

## Frequently asked questions

**What learning rate should I use?**
For AdamW fine-tuning transformer embedding models, 1e-4 to 5e-4 is a common starting range. Always use warmup steps to prevent early instability.

**Why does Adam sometimes generalise worse than SGD?**
Adam's adaptive learning rates can cause it to find sharp minima that generalise poorly. SGD with momentum finds flatter minima on some tasks. For embedding model fine-tuning, AdamW typically generalises well due to weight decay.

**Does the optimizer affect inference?**
No. The optimizer is only used during training. At inference time (when SIE encodes documents), only the forward pass runs through frozen weights.

---

## Related resources

- [What is backpropagation?](/glossary/backpropagation)
- [What is a loss function?](/glossary/loss-function)
- [What is a LoRA adapter?](/glossary/what-is-a-lora-adapter)
- [What is a neural network?](/glossary/neural-networks)
- [Browse models on SIE](/models)
