---
title: RNNs
description: Explore Recurrent Neural Networks (RNNs) and their role in processing sequential data. Learn about LSTM architecture, variants like GRUs and Bidirectional LSTMs, and applications in NLP, speech recognition, and time series analysis.
canonical_url: https://superlinked.com/glossary/recurrent-neural-networks
last_updated: 2026-06-11
---

# What are Recurrent Neural Networks (RNNs)?

A Recurrent Neural Network (RNN) is a neural network architecture designed for sequential data, where the output at each step depends on both the current input and a hidden state carried forward from previous steps. LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units) are the dominant RNN variants. While largely superseded by transformers for NLP, RNNs remain relevant for time series and streaming inference tasks.

---

## Why do RNNs matter?

Understanding RNNs provides essential context for why transformers, the backbone of modern embedding models, were designed the way they were. Every limitation of RNNs maps to a design decision in the transformer architecture.

RNNs are still used in:
- **Streaming inference**: processing token-by-token with minimal state (lower memory than transformers)
- **Time series forecasting**: sequential numerical data where transformers may be overkill
- **Edge deployment**: state space models (Mamba, S4) are RNN-inspired architectures gaining traction for efficient inference

---

## How does an RNN work?

At each time step, an RNN takes the current input `xₜ` and the previous hidden state `hₜ₋₁`, and produces a new hidden state `hₜ`:

```
hₜ = tanh(Wₓ·xₜ + Wₕ·hₜ₋₁ + b)
```

The hidden state acts as a "memory" that carries information about previous inputs forward through the sequence. The same weights (`Wₓ`, `Wₕ`) are applied at every time step; this is **weight sharing across time**.

---

## What problems do vanilla RNNs have?

**Vanishing gradients**: when backpropagating through many time steps, gradients shrink exponentially, making it impossible for the network to learn long-range dependencies. In practice, vanilla RNNs struggle with sequences longer than ~10-20 steps.

**Exploding gradients**: less common but gradients can also grow uncontrollably. Fixed with gradient clipping.

**Sequential processing**: inputs must be processed one step at a time, preventing parallelisation during training. This makes RNNs slow to train on GPUs compared to transformers.

---

## How do LSTMs solve the vanishing gradient problem?

LSTMs replace the simple hidden state with a **cell state** (a separate memory highway that can preserve information for many steps) and three **gates** that control what information flows in, out, and is forgotten:

- **Forget gate**: decides what to erase from cell state: `fₜ = σ(Wf·[hₜ₋₁, xₜ] + bf)`
- **Input gate**: decides what new information to add: `iₜ = σ(Wi·[hₜ₋₁, xₜ] + bi)`
- **Output gate**: decides what to expose as hidden state: `oₜ = σ(Wo·[hₜ₋₁, xₜ] + bo)`

The cell state can maintain gradients across hundreds of steps without decay, enabling learning of long-range dependencies.

---

## RNNs vs Transformers for sequence modelling

| | RNN / LSTM | Transformer |
|---|---|---|
| Handles long sequences | Limited (LSTM better) | ✓ (full attention) |
| Parallelisable (training) | ✗ (sequential) | ✓ |
| Memory usage | Low (constant state) | High (O(n²) attention) |
| Streaming inference | ✓ (step-by-step) | Requires full sequence |
| State of the art (NLP) | ✗ | ✓ |

For embedding models and text understanding, transformers dominate. For streaming or memory-constrained deployments, RNN-inspired architectures are resurging via state space models.

---

## RNNs and embedding models

Early text embedding models used bidirectional LSTMs to encode sentences. These have been entirely replaced by transformer-based models (BERT, BGE, E5) which produce much higher quality representations (especially for long documents) due to full attention over all tokens simultaneously.

All embedding models on SIE use transformer architectures. However, the computational concepts from RNNs (hidden states, sequence processing, memory) remain relevant to understanding how sequence models work.

---

## Frequently asked questions

**Are LSTMs still worth learning?**
Yes, for conceptual understanding and for time series / streaming tasks. For NLP text embedding, transformers have entirely replaced them.

**What are GRUs and how do they differ from LSTMs?**
GRUs (Gated Recurrent Units) simplify the LSTM by merging the forget and input gates into a single update gate. They're slightly faster to train with comparable performance on many tasks.

**What are state space models (SSMs)?**
SSMs (Mamba, S4, H3) are a new class of sequence models that combine RNN-like sequential processing with properties enabling parallel training. They're promising for long-context, memory-efficient inference and an active research area.

---

## Related resources

- [What is a transformer?](/glossary/transformers)
- [What is a neural network?](/glossary/neural-networks)
- [What is backpropagation?](/glossary/backpropagation)
- [Browse embedding models on SIE](/models)
- [What is self-hosted inference?](/glossary/what-is-self-hosted-inference)
