Supervised Learning

What are Support Vector Machines (SVMs)?

A Support Vector Machine (SVM) is a supervised learning algorithm that finds the hyperplane that maximally separates two classes of data. It works by identifying the training examples closest to the decision boundary (the “support vectors”) and maximising the margin between them. SVMs are effective for high-dimensional data, work well with small datasets, and can handle non-linear boundaries via the kernel trick.

Why do SVMs matter?

SVMs were the dominant classification algorithm for text and image tasks before deep learning. They remain relevant as:

Linear classifiers on top of embeddings: SVMs with a linear kernel are fast and accurate when applied to dense embedding vectors from models like BGE-M3
Small dataset classification: SVMs generalise well with limited labelled examples due to their maximum-margin objective
Interpretable baselines: simpler to explain to stakeholders than neural networks

How does an SVM work?

For linearly separable data, an SVM finds the hyperplane w·x + b = 0 that maximises the margin, the distance between the hyperplane and the nearest examples of each class (the support vectors).

Maximise: 2/‖w‖     (margin width)
Subject to: yᵢ(w·xᵢ + b) ≥ 1   for all training examples

This is a constrained optimisation problem solved via quadratic programming. Only the support vectors (points on the margin boundary) influence the decision boundary. Other training examples are irrelevant once training is complete.

What is the kernel trick?

For non-linearly separable data, the kernel trick maps inputs into a higher-dimensional space where they become linearly separable, without explicitly computing the transformation.

Common kernels:

Kernel	Formula	Best for
Linear	K(x, z) = x·z	High-dimensional data, text
RBF (Gaussian)	K(x, z) = exp(-γ‖x-z‖²)	General non-linear problems
Polynomial	K(x, z) = (x·z + c)ᵈ	Image classification
Sigmoid	K(x, z) = tanh(αx·z + c)	Neural network approximation

The RBF kernel is the most common choice for non-linear SVMs. The γ hyperparameter controls how tightly the model fits individual points.

Hard margin vs soft margin SVMs

Real-world data is rarely cleanly separable. The soft margin SVM allows some misclassifications via a slack variable ξᵢ, controlled by hyperparameter C:

High C: small margin, few misclassifications allowed, risk of overfitting
Low C: large margin, more misclassifications tolerated, better generalisation

from sklearn.svm import SVC

# Tune C and kernel via cross-validation
svm = SVC(kernel='rbf', C=1.0, gamma='scale')
svm.fit(X_train, y_train)

SVMs on top of embedding vectors

A highly practical pattern: use SIE to encode documents into dense vectors, then train a linear SVM classifier on top:

import numpy as np
from sie_sdk import SIEClient
from sie_sdk.types import Item
from sklearn.svm import LinearSVC

client = SIEClient("http://localhost:8080")

# Encode with a high-quality embedding model
X_train = np.stack(
    [r["dense"] for r in client.encode("BAAI/bge-m3", [Item(text=t) for t in train_texts])]
)
X_test = np.stack(
    [r["dense"] for r in client.encode("BAAI/bge-m3", [Item(text=t) for t in test_texts])]
)

# Linear SVM on embedding vectors
clf = LinearSVC(C=1.0, max_iter=2000)
clf.fit(X_train, y_train)

Because BGE-M3 vectors already capture rich semantic structure, a simple linear SVM often performs comparably to much more complex classifiers.

SVMs vs logistic regression vs neural networks

	SVM	Logistic Regression	Neural Network
Decision boundary	Maximum margin	Log-likelihood	Learned non-linear
Kernel support	✓	✗ (linear)	✓ (implicit)
Small data	Excellent	Good	Poor
Large data	Slow (O(n²-³))	Fast	Fast (GPU)
Interpretability	Medium	High	Low

For large-scale text classification on top of embeddings, logistic regression is usually faster and comparable in accuracy. SVMs shine when data is scarce or high-dimensional.

Frequently asked questions

What are support vectors? The training examples that lie on or within the margin boundary, the points closest to the decision boundary. They are the only examples that influence the SVM’s final decision boundary; other training points could be removed without changing the model.

Can SVMs do multi-class classification? SVMs are inherently binary. Multi-class classification uses either One-vs-Rest (train N classifiers, each distinguishing one class from all others) or One-vs-One (train N(N-1)/2 classifiers for each pair).

Are SVMs still used in production? Yes, particularly as lightweight classifiers on top of embedding features, for anomaly detection, and in domains where interpretability and small data performance matter.