---
title: Transformers
description: A comprehensive guide to Transformer neural networks, exploring the architecture that reshaped natural language processing. Learn about self-attention mechanisms, encoder-decoder structures, and how Transformers overcome traditional RNN limitations. Discover their applications in language modeling, machine translation, and emerging limitations in computational complexity and interpretability.
canonical_url: https://superlinked.com/glossary/transformers
last_updated: 2026-06-11
---

# What is a Transformer?

A transformer is a neural network architecture based entirely on self-attention mechanisms, introduced in the 2017 paper "Attention Is All You Need." It processes entire sequences in parallel, unlike RNNs which process step by step, enabling efficient training on long sequences. Transformers are the foundation of every modern large language model, embedding model, and reranker used in search and AI systems.

---

## Why do transformers matter for inference?

Every embedding model and reranker hosted on SIE is a transformer. Understanding the architecture helps you reason about:

- Why larger context windows improve document retrieval quality
- How encoder-only models (BERT-style) differ from decoder-only LLMs (GPT-style)
- What fine-tuning and LoRA adaptation actually change in the model
- Trade-offs between model size, latency, and accuracy

---

## How does a transformer work?

The transformer processes input tokens through a stack of identical layers, each containing two sub-components:

### 1. Multi-head self-attention
Self-attention allows every token to attend to every other token in the sequence simultaneously, computing a weighted average of all token representations based on relevance:

```
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V
```

Where Q (query), K (key), and V (value) are linear projections of the input. The attention score between two tokens is their dot product, normalised by sequence length and passed through softmax.

**Multi-head attention** runs this process in parallel across H attention heads, each learning to attend to different aspects of the input (syntax, semantics, co-reference, etc.).

### 2. Feed-forward network
After attention, each token position passes through a small two-layer MLP independently:

```
FFN(x) = max(0, xW₁ + b₁)W₂ + b₂
```

This adds non-linearity and capacity beyond what attention alone can represent.

Both sub-layers use **residual connections** (add input to output) and **layer normalisation** to stabilise training.

---

## Encoder-only vs decoder-only vs encoder-decoder

| Architecture | Models | Best for |
|---|---|---|
| Encoder-only | BERT, RoBERTa, BGE, E5 | Embedding, classification, reranking |
| Decoder-only | GPT, LLaMA, Mistral | Text generation |
| Encoder-decoder | T5, BART, mT5 | Translation, summarisation |

**For semantic search and RAG**, encoder-only models are the right choice. They produce rich bidirectional representations of the full input. SIE hosts encoder-only embedding and reranking models.

---

## What is self-attention and why is it powerful?

Self-attention solves the core limitation of RNNs: the inability to efficiently capture long-range dependencies. In an RNN, information from early tokens must pass through every intermediate step to reach later ones. In a transformer, every token attends directly to every other token, so the distance between tokens doesn't matter.

This means:
- "The bank by the river" and "the bank processed the loan": the word "bank" gets different representations based on context
- A legal clause 500 tokens earlier can directly influence the representation of a term at the end of the document

---

## Positional encoding

Transformers process all tokens in parallel and have no inherent sense of order. **Positional encodings** add position information to each token's embedding:

- **Sinusoidal (original)**: fixed mathematical functions of position
- **Learned positional embeddings**: trainable position vectors (BERT)
- **Rotary Position Embedding (RoPE)**: encodes relative positions; used in modern embedding models and LLMs
- **ALiBi**: adds a linear bias to attention scores based on distance; enables length generalisation

BGE-M3 and other modern embedding models use RoPE, which contributes to their ability to handle 8,192 token inputs effectively.

---

## Transformer scaling and embedding models

Transformer quality scales predictably with model size, data, and compute. For embedding models:

| Model size | Example | Latency | Quality |
|---|---|---|---|
| Small (~30M) | bge-small-en | Very fast | Good |
| Base (~110M) | bge-base-en | Fast | Better |
| Large (~335M) | bge-large-en | Medium | High |
| XL (~570M) | BGE-M3 | Slower | State of the art |

SIE's GPU batching and cluster deployment make serving larger, higher-quality models at production scale practical.

---

## Frequently asked questions

**What is the difference between a transformer and an LLM?**
A large language model (LLM) is a very large decoder-only transformer trained on massive text corpora for next-token prediction. The transformer is the architecture; LLM describes a specific scale and training approach.

**Why are transformer embedding models better than older approaches?**
Transformers produce contextual embeddings, where the representation of each word depends on the entire surrounding context. Older methods (Word2Vec, GloVe) produce static embeddings where each word always has the same vector regardless of context.

**How does LoRA work with transformers?**
LoRA (Low-Rank Adaptation) adds small trainable matrices to the attention layers (Q, K, V projections), keeping the base weights frozen. Only the LoRA matrices are updated during fine-tuning, reducing trainable parameters by 100-1000x. SIE supports hot-loading LoRA adapters without server restart.

---

## Related resources

- [What is self-hosted inference?](/glossary/what-is-self-hosted-inference)
- [What is a LoRA adapter?](/glossary/what-is-a-lora-adapter)
- [What is semantic search?](/glossary/what-is-semantic-search)
- [Browse transformer-based models on SIE](/models)
- [What is a reranker?](/glossary/what-is-a-reranker)