Deep Learning

What is a Transformer?

A transformer is a neural network architecture based entirely on self-attention mechanisms, introduced in the 2017 paper “Attention Is All You Need.” It processes entire sequences in parallel, unlike RNNs which process step by step, enabling efficient training on long sequences. Transformers are the foundation of every modern large language model, embedding model, and reranker used in search and AI systems.

Why do transformers matter for inference?

Every embedding model and reranker hosted on SIE is a transformer. Understanding the architecture helps you reason about:

Why larger context windows improve document retrieval quality
How encoder-only models (BERT-style) differ from decoder-only LLMs (GPT-style)
What fine-tuning and LoRA adaptation actually change in the model
Trade-offs between model size, latency, and accuracy

How does a transformer work?

The transformer processes input tokens through a stack of identical layers, each containing two sub-components:

1. Multi-head self-attention

Self-attention allows every token to attend to every other token in the sequence simultaneously, computing a weighted average of all token representations based on relevance:

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V

Where Q (query), K (key), and V (value) are linear projections of the input. The attention score between two tokens is their dot product, normalised by sequence length and passed through softmax.

Multi-head attention runs this process in parallel across H attention heads, each learning to attend to different aspects of the input (syntax, semantics, co-reference, etc.).

2. Feed-forward network

After attention, each token position passes through a small two-layer MLP independently:

FFN(x) = max(0, xW₁ + b₁)W₂ + b₂

This adds non-linearity and capacity beyond what attention alone can represent.

Both sub-layers use residual connections (add input to output) and layer normalisation to stabilise training.

Encoder-only vs decoder-only vs encoder-decoder

Architecture	Models	Best for
Encoder-only	BERT, RoBERTa, BGE, E5	Embedding, classification, reranking
Decoder-only	GPT, LLaMA, Mistral	Text generation
Encoder-decoder	T5, BART, mT5	Translation, summarisation

For semantic search and RAG, encoder-only models are the right choice. They produce rich bidirectional representations of the full input. SIE hosts encoder-only embedding and reranking models.

What is self-attention and why is it powerful?

Self-attention solves the core limitation of RNNs: the inability to efficiently capture long-range dependencies. In an RNN, information from early tokens must pass through every intermediate step to reach later ones. In a transformer, every token attends directly to every other token, so the distance between tokens doesn’t matter.

This means:

“The bank by the river” and “the bank processed the loan”: the word “bank” gets different representations based on context
A legal clause 500 tokens earlier can directly influence the representation of a term at the end of the document

Positional encoding

Transformers process all tokens in parallel and have no inherent sense of order. Positional encodings add position information to each token’s embedding:

Sinusoidal (original): fixed mathematical functions of position
Learned positional embeddings: trainable position vectors (BERT)
Rotary Position Embedding (RoPE): encodes relative positions; used in modern embedding models and LLMs
ALiBi: adds a linear bias to attention scores based on distance; enables length generalisation

BGE-M3 and other modern embedding models use RoPE, which contributes to their ability to handle 8,192 token inputs effectively.

Transformer scaling and embedding models

Transformer quality scales predictably with model size, data, and compute. For embedding models:

Model size	Example	Latency	Quality
Small (~30M)	bge-small-en	Very fast	Good
Base (~110M)	bge-base-en	Fast	Better
Large (~335M)	bge-large-en	Medium	High
XL (~570M)	BGE-M3	Slower	State of the art

SIE’s GPU batching and cluster deployment make serving larger, higher-quality models at production scale practical.

Frequently asked questions

What is the difference between a transformer and an LLM? A large language model (LLM) is a very large decoder-only transformer trained on massive text corpora for next-token prediction. The transformer is the architecture; LLM describes a specific scale and training approach.

Why are transformer embedding models better than older approaches? Transformers produce contextual embeddings, where the representation of each word depends on the entire surrounding context. Older methods (Word2Vec, GloVe) produce static embeddings where each word always has the same vector regardless of context.

How does LoRA work with transformers? LoRA (Low-Rank Adaptation) adds small trainable matrices to the attention layers (Q, K, V projections), keeping the base weights frozen. Only the LoRA matrices are updated during fine-tuning, reducing trainable parameters by 100-1000x. SIE supports hot-loading LoRA adapters without server restart.