Unsupervised Learning

What is Clustering in Machine Learning?

Clustering is an unsupervised learning technique that groups similar data points together without using predefined labels. The algorithm discovers structure in data by identifying natural groupings based on similarity. Common algorithms include K-means, hierarchical clustering, and DBSCAN. Clustering is widely used in data exploration, customer segmentation, anomaly detection, and document organisation.

Why does clustering matter?

Clustering is useful whenever you want to find structure in unlabelled data. In the context of search and retrieval, clustering is used to:

Organise document corpora: group similar documents for topic modelling or faceted navigation
Analyse embedding spaces: understand how a model has structured its representations
Speed up retrieval: cluster-based indexing (used in some ANN algorithms) partitions the vector space for faster search
Discover anomalies: data points that don’t fit any cluster are likely outliers

How does K-means clustering work?

K-means is the most common clustering algorithm:

Choose K, the number of clusters
Randomly initialise K centroids
Assign each point to its nearest centroid
Recompute each centroid as the mean of its assigned points
Repeat steps 3-4 until centroids stop moving

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(document_vectors)
labels = kmeans.labels_

K-means works well in embedding space: you can cluster dense vectors produced by SIE to organise a document corpus.

How do you choose the right number of clusters?

Elbow method: plot inertia (sum of squared distances to centroids) against K. The “elbow” where inertia stops dropping sharply suggests a good K.

Silhouette score: measures how well each point fits its assigned cluster vs neighbouring clusters. Score ranges from -1 to 1; higher is better.

Domain knowledge: for known taxonomies (e.g. product categories), let business logic guide K.

K-means vs DBSCAN vs hierarchical clustering

Algorithm	Cluster shape	Handles noise	Requires K	Best for
K-means	Spherical	✗	✓	Large, balanced datasets
DBSCAN	Arbitrary	✓	✗	Irregular shapes, outlier detection
Hierarchical	Arbitrary	✗	✗ (dendrogram)	Small datasets, visual exploration

For clustering in high-dimensional embedding space, K-means with a dimension reduction step (e.g. PCA or UMAP) is a common production approach.

Clustering in vector search and RAG

Clustering appears in vector search infrastructure in two main ways:

IVF indexing (Inverted File Index): used in FAISS and Qdrant, IVF partitions vectors into clusters at index time. At query time, only the nearest clusters are searched, dramatically reducing the search space and improving latency.

Document organisation for RAG: clustering your document corpus before building a RAG pipeline can help you ensure diverse retrieval and identify gaps in your knowledge base.

SIE produces the dense vectors; your vector database handles the clustering-based index. See the Qdrant integration guide for a full pipeline.

Frequently asked questions

Does clustering work well with high-dimensional vectors? High-dimensional spaces suffer from the “curse of dimensionality”: distances become less meaningful. Reducing to 64-256 dimensions with PCA before clustering often improves results. Alternatively, use UMAP for a non-linear reduction that better preserves local structure.

What’s the difference between clustering and classification? Classification is supervised: it assigns labels based on training examples. Clustering is unsupervised: it discovers groups without labels.

Can I cluster the output of embedding models? Yes. Dense vectors from SIE embedding models are well-suited to clustering. Because the embedding space is already structured by semantic similarity, clusters tend to correspond to meaningful topics or document types.