What are Support Vector Machines (SVMs)?
A Support Vector Machine (SVM) is a supervised learning algorithm that finds the hyperplane that maximally separates two classes of data. It works by identifying the training examples closest to the decision boundary (the “support vectors”) and maximising the margin between them. SVMs are effective for high-dimensional data, work well with small datasets, and can handle non-linear boundaries via the kernel trick.
Why do SVMs matter?
SVMs were the dominant classification algorithm for text and image tasks before deep learning. They remain relevant as:
- Linear classifiers on top of embeddings: SVMs with a linear kernel are fast and accurate when applied to dense embedding vectors from models like BGE-M3
- Small dataset classification: SVMs generalise well with limited labelled examples due to their maximum-margin objective
- Interpretable baselines: simpler to explain to stakeholders than neural networks
How does an SVM work?
For linearly separable data, an SVM finds the hyperplane w·x + b = 0 that maximises the margin, the distance between the hyperplane and the nearest examples of each class (the support vectors).
Maximise: 2/‖w‖ (margin width)Subject to: yᵢ(w·xᵢ + b) ≥ 1 for all training examplesThis is a constrained optimisation problem solved via quadratic programming. Only the support vectors (points on the margin boundary) influence the decision boundary. Other training examples are irrelevant once training is complete.
What is the kernel trick?
For non-linearly separable data, the kernel trick maps inputs into a higher-dimensional space where they become linearly separable, without explicitly computing the transformation.
Common kernels:
| Kernel | Formula | Best for |
|---|---|---|
| Linear | K(x, z) = x·z | High-dimensional data, text |
| RBF (Gaussian) | K(x, z) = exp(-γ‖x-z‖²) | General non-linear problems |
| Polynomial | K(x, z) = (x·z + c)ᵈ | Image classification |
| Sigmoid | K(x, z) = tanh(αx·z + c) | Neural network approximation |
The RBF kernel is the most common choice for non-linear SVMs. The γ hyperparameter controls how tightly the model fits individual points.
Hard margin vs soft margin SVMs
Real-world data is rarely cleanly separable. The soft margin SVM allows some misclassifications via a slack variable ξᵢ, controlled by hyperparameter C:
- High C: small margin, few misclassifications allowed, risk of overfitting
- Low C: large margin, more misclassifications tolerated, better generalisation
from sklearn.svm import SVC
# Tune C and kernel via cross-validationsvm = SVC(kernel='rbf', C=1.0, gamma='scale')svm.fit(X_train, y_train)SVMs on top of embedding vectors
A highly practical pattern: use SIE to encode documents into dense vectors, then train a linear SVM classifier on top:
import numpy as npfrom sie_sdk import SIEClientfrom sie_sdk.types import Itemfrom sklearn.svm import LinearSVC
client = SIEClient("http://localhost:8080")
# Encode with a high-quality embedding modelX_train = np.stack( [r["dense"] for r in client.encode("BAAI/bge-m3", [Item(text=t) for t in train_texts])])X_test = np.stack( [r["dense"] for r in client.encode("BAAI/bge-m3", [Item(text=t) for t in test_texts])])
# Linear SVM on embedding vectorsclf = LinearSVC(C=1.0, max_iter=2000)clf.fit(X_train, y_train)Because BGE-M3 vectors already capture rich semantic structure, a simple linear SVM often performs comparably to much more complex classifiers.
SVMs vs logistic regression vs neural networks
| SVM | Logistic Regression | Neural Network | |
|---|---|---|---|
| Decision boundary | Maximum margin | Log-likelihood | Learned non-linear |
| Kernel support | ✓ | ✗ (linear) | ✓ (implicit) |
| Small data | Excellent | Good | Poor |
| Large data | Slow (O(n²-³)) | Fast | Fast (GPU) |
| Interpretability | Medium | High | Low |
For large-scale text classification on top of embeddings, logistic regression is usually faster and comparable in accuracy. SVMs shine when data is scarce or high-dimensional.
Frequently asked questions
What are support vectors? The training examples that lie on or within the margin boundary, the points closest to the decision boundary. They are the only examples that influence the SVM’s final decision boundary; other training points could be removed without changing the model.
Can SVMs do multi-class classification? SVMs are inherently binary. Multi-class classification uses either One-vs-Rest (train N classifiers, each distinguishing one class from all others) or One-vs-One (train N(N-1)/2 classifiers for each pair).
Are SVMs still used in production? Yes, particularly as lightweight classifiers on top of embedding features, for anomaly detection, and in domains where interpretability and small data performance matter.