What is Feature Scaling and Normalisation?
Feature scaling transforms numerical features to a comparable range so that no single feature dominates a model due to its magnitude. The two most common techniques are min-max scaling (normalisation), which maps values to [0, 1], and standardisation (z-score scaling), which centres data at zero with unit variance. Most distance-based and gradient-based models require feature scaling to perform well.
Why does feature scaling matter?
Consider a dataset with document length (range: 10-50,000 words) and number of images (range: 0-20). Without scaling, the document length feature dominates Euclidean distance calculations simply because its values are larger, not because it’s more informative.
Models particularly sensitive to feature scale:
- K-Nearest Neighbours: distance-based, heavily affected
- Support Vector Machines: margin maximisation is scale-sensitive
- Neural networks / gradient descent: large feature ranges cause unstable learning
- PCA: maximises variance, so large-scale features dominate components
Models that don’t require scaling:
- Tree-based models (decision trees, random forests, gradient boosting): split on thresholds, scale-invariant
Min-max scaling (normalisation)
Maps all values to the range [0, 1]:
x_scaled = (x - x_min) / (x_max - x_min)from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()X_scaled = scaler.fit_transform(X_train)Best for: neural network inputs, when you know the approximate min/max of the data. Weakness: sensitive to outliers, since a single extreme value compresses all other values towards 0.
Standardisation (Z-score scaling)
Centres data at zero and scales to unit variance:
x_scaled = (x - mean) / stdfrom sklearn.preprocessing import StandardScaler
scaler = StandardScaler()X_scaled = scaler.fit_transform(X_train)Best for: most general cases, especially when you don’t know the range. Weakness: doesn’t bound values, so outliers remain as outliers.
Min-max vs standardisation: which to use?
| Scenario | Recommended |
|---|---|
| Known bounded range (e.g. pixel values 0-255) | Min-max |
| Unknown range, possible outliers | Standardisation |
| Neural network training | Either (standardisation more common) |
| SVM | Standardisation |
| PCA | Standardisation |
| Data has extreme outliers | RobustScaler (uses median/IQR) |
Feature scaling and embedding vectors
Dense embedding vectors produced by SIE’s encoding models are already unit-normalised (L2 norm = 1) for cosine similarity search. You don’t need to apply additional scaling to embedding vectors.
However, when combining embedding similarity scores with other tabular features (e.g. document recency, click-through rate) in a re-ranking model, you’ll need to scale the non-embedding features to be comparable with the similarity scores.
A critical rule: fit on training data, transform test data
Always fit the scaler on training data only, then apply the same transformation to validation and test data:
# Correctscaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train) # fit + transformX_test_scaled = scaler.transform(X_test) # transform only
# Wrong — leaks test statistics into trainingX_all_scaled = scaler.fit_transform(X_all)Fitting on the full dataset causes data leakage and produces optimistically biased evaluation metrics.
Frequently asked questions
Do I need to scale features for tree-based models? No. Decision trees, random forests, and gradient boosted trees split on thresholds, which are scale-invariant. Scaling doesn’t hurt, but it doesn’t help either.
What is RobustScaler?
RobustScaler uses the median and interquartile range instead of mean and standard deviation, making it resistant to outliers:
x_scaled = (x - median) / IQR
Does feature scaling change the information content? No. Scaling is a monotonic transformation that changes magnitude but preserves rank order and relative differences. The information is identical.