Unsupervised Learning

What is Dimensionality Reduction?

Dimensionality reduction is the process of representing high-dimensional data in fewer dimensions while preserving as much meaningful structure as possible. It reduces computational cost, mitigates the curse of dimensionality, and makes data easier to visualise and cluster. Common techniques include PCA (linear) and UMAP/t-SNE (non-linear).

Why does dimensionality reduction matter for search?

Dense embedding vectors are typically 768-4096 dimensions. Storing and searching millions of such vectors is expensive. Dimensionality reduction can:

Reduce storage cost: a 256-dim vector uses 3× less memory than a 768-dim one
Speed up ANN search: lower dimensions mean faster distance computations
Improve clustering quality: high-dimensional spaces suffer from the curse of dimensionality, where all points become roughly equidistant
Enable visualisation: reduce to 2D or 3D for exploring embedding space structure

The trade-off is accuracy: compression loses some information, so reduced vectors are slightly less discriminative than full-dimensional ones.

What is the curse of dimensionality?

In high-dimensional spaces, counterintuitive things happen:

The volume of space grows exponentially with dimensions, so data becomes increasingly sparse
Distance metrics (cosine, Euclidean) lose discriminative power, and all pairwise distances become similar
Models need exponentially more data to cover the space adequately

This is why techniques like PCA are applied before clustering high-dimensional embeddings, and why embedding models don’t use unnecessarily large dimensions.

How does PCA work?

Principal Component Analysis (PCA) finds the directions (principal components) of maximum variance in the data and projects onto the top-k components:

from sklearn.decomposition import PCA

pca = PCA(n_components=256)
reduced_vectors = pca.fit_transform(embedding_matrix)  # e.g. (10000, 768) → (10000, 256)

# Explained variance tells you how much information is retained
print(f"Variance retained: {pca.explained_variance_ratio_.sum():.2%}")

PCA is linear: it preserves global structure well but may not capture complex non-linear relationships.

PCA vs t-SNE vs UMAP

	PCA	t-SNE	UMAP
Type	Linear	Non-linear	Non-linear
Preserves	Global variance	Local structure	Local + some global
Speed	Fast	Slow	Medium
Scalable	✓	✗	✓
Out-of-sample	✓	✗	✓
Best for	Compression, preprocessing	2D visualisation	Visualisation + compression

For production dimensionality reduction of embedding vectors, UMAP or PCA is preferred over t-SNE (which doesn’t support new data points without refitting).

Matryoshka embedding models and built-in dimensionality reduction

Modern embedding models like BGE-M3 use Matryoshka Representation Learning (MRL), a training technique that makes early dimensions of the vector more information-dense than later ones. This allows you to truncate vectors to a smaller size at inference time without a separate PCA step:

from sie_sdk import SIEClient
from sie_sdk.types import Item
import numpy as np

client = SIEClient("http://localhost:8080")

# Full 1024-dim vectors (batch)
results = client.encode("BAAI/bge-m3", [Item(text=t) for t in texts])
full_vectors = np.stack([r["dense"] for r in results])

# Truncate to 256 dims — works well due to MRL training
reduced_vectors = full_vectors[:, :256]

SIE supports Matryoshka-capable models, giving you built-in flexible dimensionality without post-processing.

Frequently asked questions

How many dimensions should I reduce to? Depends on the accuracy-cost trade-off. For most retrieval tasks, 256-512 dimensions retain 95%+ of retrieval quality vs full dimensions. Test on your specific dataset using recall@k metrics.

Does dimensionality reduction affect retrieval accuracy? Yes, there’s always some loss. The question is whether it’s acceptable. Modern Matryoshka models minimise this loss by design.

Can I apply PCA to vectors already stored in a vector database? You’d need to re-encode or transform existing vectors. It’s easier to apply reduction at encoding time and re-index.