Why did we open-source our inference engine? Read the post
← All Glossary Articles

What is Clustering in Machine Learning?

Clustering is an unsupervised learning technique that groups similar data points together without using predefined labels. The algorithm discovers structure in data by identifying natural groupings based on similarity. Common algorithms include K-means, hierarchical clustering, and DBSCAN. Clustering is widely used in data exploration, customer segmentation, anomaly detection, and document organisation.


Why does clustering matter?

Clustering is useful whenever you want to find structure in unlabelled data. In the context of search and retrieval, clustering is used to:

  • Organise document corpora: group similar documents for topic modelling or faceted navigation
  • Analyse embedding spaces: understand how a model has structured its representations
  • Speed up retrieval: cluster-based indexing (used in some ANN algorithms) partitions the vector space for faster search
  • Discover anomalies: data points that don’t fit any cluster are likely outliers

How does K-means clustering work?

K-means is the most common clustering algorithm:

  1. Choose K, the number of clusters
  2. Randomly initialise K centroids
  3. Assign each point to its nearest centroid
  4. Recompute each centroid as the mean of its assigned points
  5. Repeat steps 3-4 until centroids stop moving
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(document_vectors)
labels = kmeans.labels_

K-means works well in embedding space: you can cluster dense vectors produced by SIE to organise a document corpus.


How do you choose the right number of clusters?

Elbow method: plot inertia (sum of squared distances to centroids) against K. The “elbow” where inertia stops dropping sharply suggests a good K.

Silhouette score: measures how well each point fits its assigned cluster vs neighbouring clusters. Score ranges from -1 to 1; higher is better.

Domain knowledge: for known taxonomies (e.g. product categories), let business logic guide K.


K-means vs DBSCAN vs hierarchical clustering

AlgorithmCluster shapeHandles noiseRequires KBest for
K-meansSphericalLarge, balanced datasets
DBSCANArbitraryIrregular shapes, outlier detection
HierarchicalArbitrary✗ (dendrogram)Small datasets, visual exploration

For clustering in high-dimensional embedding space, K-means with a dimension reduction step (e.g. PCA or UMAP) is a common production approach.


Clustering in vector search and RAG

Clustering appears in vector search infrastructure in two main ways:

IVF indexing (Inverted File Index): used in FAISS and Qdrant, IVF partitions vectors into clusters at index time. At query time, only the nearest clusters are searched, dramatically reducing the search space and improving latency.

Document organisation for RAG: clustering your document corpus before building a RAG pipeline can help you ensure diverse retrieval and identify gaps in your knowledge base.

SIE produces the dense vectors; your vector database handles the clustering-based index. See the Qdrant integration guide for a full pipeline.


Frequently asked questions

Does clustering work well with high-dimensional vectors? High-dimensional spaces suffer from the “curse of dimensionality”: distances become less meaningful. Reducing to 64-256 dimensions with PCA before clustering often improves results. Alternatively, use UMAP for a non-linear reduction that better preserves local structure.

What’s the difference between clustering and classification? Classification is supervised: it assigns labels based on training examples. Clustering is unsupervised: it discovers groups without labels.

Can I cluster the output of embedding models? Yes. Dense vectors from SIE embedding models are well-suited to clustering. Because the embedding space is already structured by semantic similarity, clusters tend to correspond to meaningful topics or document types.


Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github 2.0K

Contact us

Tell us about your use case and we'll get back to you shortly.