Model Training

What is an Optimizer in Machine Learning?

An optimizer is the algorithm that updates a neural network’s weights during training to minimise the loss function. It uses the gradients computed by backpropagation to determine how much and in which direction to adjust each weight. The choice of optimizer affects training speed, stability, and the quality of the final model.

Why does the optimizer matter?

Gradient descent alone (moving weights in the direction of steepest loss reduction) works in principle but is slow and unstable in practice. Modern optimizers add mechanisms like momentum, adaptive learning rates, and variance correction that make training dramatically faster and more reliable.

Every embedding model hosted on SIE was trained using an optimizer (typically Adam or AdamW). Understanding optimizers helps you reason about fine-tuning behaviour and training stability when adapting models for your domain.

How does gradient descent work?

The basic update rule:

w = w - η × ∂L/∂w

Where:

w = weight
η = learning rate (step size)
∂L/∂w = gradient of the loss with respect to the weight

The problem: a fixed learning rate is either too large (training diverges) or too small (training is painfully slow). And gradients oscillate in narrow valleys of the loss surface.

Main optimizer algorithms

SGD with Momentum

Adds a velocity term that accumulates past gradients, smoothing oscillations and speeding up convergence in consistent directions:

v = β·v - η·∇L
w = w + v

Good for fine-tuned control but requires careful learning rate tuning.

Adam (Adaptive Moment Estimation)

The most widely used optimizer. Maintains per-parameter adaptive learning rates using estimates of first (mean) and second (variance) moments of gradients:

m = β₁·m + (1-β₁)·∇L       # first moment (mean)
v = β₂·v + (1-β₂)·∇L²      # second moment (variance)
w = w - η·m̂/√(v̂ + ε)

Default parameters (β₁=0.9, β₂=0.999) work well across most tasks. Adam converges quickly and is robust to learning rate choice.

AdamW

Adam with weight decay decoupled from the gradient update, the standard fix for Adam’s tendency to under-regularise. AdamW is the default for training transformer-based models including most embedding models:

w = w - η·m̂/√(v̂ + ε) - η·λ·w

Where λ is the weight decay coefficient.

Learning rate schedulers

Optimizers are typically paired with a learning rate schedule:

Schedule	Behaviour	Common use
Constant	Fixed η throughout	Simple baselines
Linear warmup + decay	Ramps up then decays	Transformer fine-tuning
Cosine annealing	Smooth cosine decay	Long training runs
Reduce on plateau	Drops η when loss stalls	General purpose

Optimizer comparison

Optimizer	Adaptive LR	Momentum	Best for
SGD	✗	Optional	Vision models with tuning
SGD + Momentum	✗	✓	Stable, well-understood
Adam	✓	✓	Most deep learning tasks
AdamW	✓	✓	Transformer fine-tuning (default)
Adafactor	✓	Optional	Memory-efficient (large models)

Optimizers and LoRA fine-tuning

When fine-tuning an embedding model with LoRA (the approach SIE uses for domain adaptation), AdamW with linear warmup is the standard recipe:

optimizer = AdamW(lora_parameters, lr=2e-4, weight_decay=0.01)
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=100,
    num_training_steps=total_steps
)

Only the LoRA adapter parameters are updated; the base model weights are frozen. This dramatically reduces memory and compute requirements.

Frequently asked questions

What learning rate should I use? For AdamW fine-tuning transformer embedding models, 1e-4 to 5e-4 is a common starting range. Always use warmup steps to prevent early instability.

Why does Adam sometimes generalise worse than SGD? Adam’s adaptive learning rates can cause it to find sharp minima that generalise poorly. SGD with momentum finds flatter minima on some tasks. For embedding model fine-tuning, AdamW typically generalises well due to weight decay.

Does the optimizer affect inference? No. The optimizer is only used during training. At inference time (when SIE encodes documents), only the forward pass runs through frozen weights.