What is Bag of Words?
Bag of Words (BoW) is a text representation technique that converts a document into a vector of word counts or frequencies, ignoring word order and grammar. Each dimension of the vector corresponds to a word in the vocabulary, and its value reflects how often that word appears. It is one of the foundational methods in natural language processing and information retrieval.
Why does Bag of Words matter?
BoW was the dominant approach for text classification, search, and topic modelling for decades before neural embeddings. It underpins TF-IDF weighting and forms the conceptual basis for BM25, the sparse retrieval algorithm still widely used in hybrid search today. Understanding BoW is essential for understanding why dense semantic search was developed as an improvement.
How does Bag of Words work?
Given a corpus of documents:
- Build a vocabulary: collect all unique words across all documents
- Vectorise each document: create a vector of length |vocabulary|, with each position recording word frequency
Example:
- Doc A: “self-hosted inference engine”
- Doc B: “inference engine deployment”
| self-hosted | inference | engine | deployment | |
|---|---|---|---|---|
| Doc A | 1 | 1 | 1 | 0 |
| Doc B | 0 | 1 | 1 | 1 |
These vectors can then be compared using cosine similarity or used as features in a classifier.
What is TF-IDF and how does it improve Bag of Words?
Raw word counts treat all words equally. TF-IDF (Term Frequency-Inverse Document Frequency) weights each word by how distinctive it is to a document:
- TF (Term Frequency): how often the word appears in this document
- IDF (Inverse Document Frequency): how rare the word is across all documents
Common words like “the” get low IDF scores and are effectively down-weighted. Distinctive words get high scores. This makes TF-IDF much more useful than raw counts for retrieval and classification.
Bag of Words vs semantic embeddings
| Bag of Words / TF-IDF | Dense embeddings | |
|---|---|---|
| Captures word order | ✗ | ✓ (contextual) |
| Handles synonyms | ✗ | ✓ |
| Handles rare terms | ✓ | ✗ |
| Vector size | Large (vocab size) | Small (768-4096) |
| Requires model | ✗ | ✓ |
| Best for | Keyword matching | Semantic search |
Dense embedding models like BGE-M3 address BoW’s main weakness (no semantic understanding), but BoW-style sparse retrieval remains useful in hybrid search precisely because it handles exact keyword matching that semantic models miss.
How does Bag of Words relate to modern hybrid search?
BM25, used in systems like Elasticsearch, Solr, and OpenSearch, is essentially an improved TF-IDF/BoW approach with length normalisation. In hybrid search, BM25 handles the sparse retrieval path while dense vectors handle the semantic path. BGE-M3 can produce sparse vectors that mimic BM25-style term weighting, allowing you to run hybrid search from a single model.
Frequently asked questions
Is Bag of Words still used in production? Yes, primarily via BM25 in keyword search and hybrid search pipelines. Many production retrieval systems combine BM25 (sparse) with dense embeddings for the best of both worlds.
What are the main limitations of Bag of Words? It ignores word order (“dog bites man” and “man bites dog” produce identical vectors), can’t handle synonyms, and creates very high-dimensional sparse vectors for large vocabularies.
What’s the difference between Bag of Words and Word2Vec? BoW produces discrete term-count vectors. Word2Vec produces dense continuous vectors that capture semantic meaning. Word2Vec was an early step towards the modern embedding models used in semantic search today.