Why did we open-source our inference engine? Read the post
← All Glossary Articles

What is an Optimizer in Machine Learning?

An optimizer is the algorithm that updates a neural network’s weights during training to minimise the loss function. It uses the gradients computed by backpropagation to determine how much and in which direction to adjust each weight. The choice of optimizer affects training speed, stability, and the quality of the final model.


Why does the optimizer matter?

Gradient descent alone (moving weights in the direction of steepest loss reduction) works in principle but is slow and unstable in practice. Modern optimizers add mechanisms like momentum, adaptive learning rates, and variance correction that make training dramatically faster and more reliable.

Every embedding model hosted on SIE was trained using an optimizer (typically Adam or AdamW). Understanding optimizers helps you reason about fine-tuning behaviour and training stability when adapting models for your domain.


How does gradient descent work?

The basic update rule:

w = w - η × ∂L/∂w

Where:

  • w = weight
  • η = learning rate (step size)
  • ∂L/∂w = gradient of the loss with respect to the weight

The problem: a fixed learning rate is either too large (training diverges) or too small (training is painfully slow). And gradients oscillate in narrow valleys of the loss surface.


Main optimizer algorithms

SGD with Momentum

Adds a velocity term that accumulates past gradients, smoothing oscillations and speeding up convergence in consistent directions:

v = β·v - η·∇L
w = w + v

Good for fine-tuned control but requires careful learning rate tuning.

Adam (Adaptive Moment Estimation)

The most widely used optimizer. Maintains per-parameter adaptive learning rates using estimates of first (mean) and second (variance) moments of gradients:

m = β₁·m + (1-β₁)·∇L # first moment (mean)
v = β₂·v + (1-β₂)·∇L² # second moment (variance)
w = w - η·m̂/√(v̂ + ε)

Default parameters (β₁=0.9, β₂=0.999) work well across most tasks. Adam converges quickly and is robust to learning rate choice.

AdamW

Adam with weight decay decoupled from the gradient update, the standard fix for Adam’s tendency to under-regularise. AdamW is the default for training transformer-based models including most embedding models:

w = w - η·m̂/√(v̂ + ε) - η·λ·w

Where λ is the weight decay coefficient.

Learning rate schedulers

Optimizers are typically paired with a learning rate schedule:

ScheduleBehaviourCommon use
ConstantFixed η throughoutSimple baselines
Linear warmup + decayRamps up then decaysTransformer fine-tuning
Cosine annealingSmooth cosine decayLong training runs
Reduce on plateauDrops η when loss stallsGeneral purpose

Optimizer comparison

OptimizerAdaptive LRMomentumBest for
SGDOptionalVision models with tuning
SGD + MomentumStable, well-understood
AdamMost deep learning tasks
AdamWTransformer fine-tuning (default)
AdafactorOptionalMemory-efficient (large models)

Optimizers and LoRA fine-tuning

When fine-tuning an embedding model with LoRA (the approach SIE uses for domain adaptation), AdamW with linear warmup is the standard recipe:

optimizer = AdamW(lora_parameters, lr=2e-4, weight_decay=0.01)
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=100,
num_training_steps=total_steps
)

Only the LoRA adapter parameters are updated; the base model weights are frozen. This dramatically reduces memory and compute requirements.


Frequently asked questions

What learning rate should I use? For AdamW fine-tuning transformer embedding models, 1e-4 to 5e-4 is a common starting range. Always use warmup steps to prevent early instability.

Why does Adam sometimes generalise worse than SGD? Adam’s adaptive learning rates can cause it to find sharp minima that generalise poorly. SGD with momentum finds flatter minima on some tasks. For embedding model fine-tuning, AdamW typically generalises well due to weight decay.

Does the optimizer affect inference? No. The optimizer is only used during training. At inference time (when SIE encodes documents), only the forward pass runs through frozen weights.


Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github 2.0K

Contact us

Tell us about your use case and we'll get back to you shortly.