What is a Transformer?
A transformer is a neural network architecture based entirely on self-attention mechanisms, introduced in the 2017 paper “Attention Is All You Need.” It processes entire sequences in parallel, unlike RNNs which process step by step, enabling efficient training on long sequences. Transformers are the foundation of every modern large language model, embedding model, and reranker used in search and AI systems.
Why do transformers matter for inference?
Every embedding model and reranker hosted on SIE is a transformer. Understanding the architecture helps you reason about:
- Why larger context windows improve document retrieval quality
- How encoder-only models (BERT-style) differ from decoder-only LLMs (GPT-style)
- What fine-tuning and LoRA adaptation actually change in the model
- Trade-offs between model size, latency, and accuracy
How does a transformer work?
The transformer processes input tokens through a stack of identical layers, each containing two sub-components:
1. Multi-head self-attention
Self-attention allows every token to attend to every other token in the sequence simultaneously, computing a weighted average of all token representations based on relevance:
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · VWhere Q (query), K (key), and V (value) are linear projections of the input. The attention score between two tokens is their dot product, normalised by sequence length and passed through softmax.
Multi-head attention runs this process in parallel across H attention heads, each learning to attend to different aspects of the input (syntax, semantics, co-reference, etc.).
2. Feed-forward network
After attention, each token position passes through a small two-layer MLP independently:
FFN(x) = max(0, xW₁ + b₁)W₂ + b₂This adds non-linearity and capacity beyond what attention alone can represent.
Both sub-layers use residual connections (add input to output) and layer normalisation to stabilise training.
Encoder-only vs decoder-only vs encoder-decoder
| Architecture | Models | Best for |
|---|---|---|
| Encoder-only | BERT, RoBERTa, BGE, E5 | Embedding, classification, reranking |
| Decoder-only | GPT, LLaMA, Mistral | Text generation |
| Encoder-decoder | T5, BART, mT5 | Translation, summarisation |
For semantic search and RAG, encoder-only models are the right choice. They produce rich bidirectional representations of the full input. SIE hosts encoder-only embedding and reranking models.
What is self-attention and why is it powerful?
Self-attention solves the core limitation of RNNs: the inability to efficiently capture long-range dependencies. In an RNN, information from early tokens must pass through every intermediate step to reach later ones. In a transformer, every token attends directly to every other token, so the distance between tokens doesn’t matter.
This means:
- “The bank by the river” and “the bank processed the loan”: the word “bank” gets different representations based on context
- A legal clause 500 tokens earlier can directly influence the representation of a term at the end of the document
Positional encoding
Transformers process all tokens in parallel and have no inherent sense of order. Positional encodings add position information to each token’s embedding:
- Sinusoidal (original): fixed mathematical functions of position
- Learned positional embeddings: trainable position vectors (BERT)
- Rotary Position Embedding (RoPE): encodes relative positions; used in modern embedding models and LLMs
- ALiBi: adds a linear bias to attention scores based on distance; enables length generalisation
BGE-M3 and other modern embedding models use RoPE, which contributes to their ability to handle 8,192 token inputs effectively.
Transformer scaling and embedding models
Transformer quality scales predictably with model size, data, and compute. For embedding models:
| Model size | Example | Latency | Quality |
|---|---|---|---|
| Small (~30M) | bge-small-en | Very fast | Good |
| Base (~110M) | bge-base-en | Fast | Better |
| Large (~335M) | bge-large-en | Medium | High |
| XL (~570M) | BGE-M3 | Slower | State of the art |
SIE’s GPU batching and cluster deployment make serving larger, higher-quality models at production scale practical.
Frequently asked questions
What is the difference between a transformer and an LLM? A large language model (LLM) is a very large decoder-only transformer trained on massive text corpora for next-token prediction. The transformer is the architecture; LLM describes a specific scale and training approach.
Why are transformer embedding models better than older approaches? Transformers produce contextual embeddings, where the representation of each word depends on the entire surrounding context. Older methods (Word2Vec, GloVe) produce static embeddings where each word always has the same vector regardless of context.
How does LoRA work with transformers? LoRA (Low-Rank Adaptation) adds small trainable matrices to the attention layers (Q, K, V projections), keeping the base weights frozen. Only the LoRA matrices are updated during fine-tuning, reducing trainable parameters by 100-1000x. SIE supports hot-loading LoRA adapters without server restart.