Pre-processing

What is Feature Scaling and Normalisation?

Feature scaling transforms numerical features to a comparable range so that no single feature dominates a model due to its magnitude. The two most common techniques are min-max scaling (normalisation), which maps values to [0, 1], and standardisation (z-score scaling), which centres data at zero with unit variance. Most distance-based and gradient-based models require feature scaling to perform well.

Why does feature scaling matter?

Consider a dataset with document length (range: 10-50,000 words) and number of images (range: 0-20). Without scaling, the document length feature dominates Euclidean distance calculations simply because its values are larger, not because it’s more informative.

Models particularly sensitive to feature scale:

K-Nearest Neighbours: distance-based, heavily affected
Support Vector Machines: margin maximisation is scale-sensitive
Neural networks / gradient descent: large feature ranges cause unstable learning
PCA: maximises variance, so large-scale features dominate components

Models that don’t require scaling:

Tree-based models (decision trees, random forests, gradient boosting): split on thresholds, scale-invariant

Min-max scaling (normalisation)

Maps all values to the range [0, 1]:

x_scaled = (x - x_min) / (x_max - x_min)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_train)

Best for: neural network inputs, when you know the approximate min/max of the data. Weakness: sensitive to outliers, since a single extreme value compresses all other values towards 0.

Standardisation (Z-score scaling)

Centres data at zero and scales to unit variance:

x_scaled = (x - mean) / std

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

Best for: most general cases, especially when you don’t know the range. Weakness: doesn’t bound values, so outliers remain as outliers.

Min-max vs standardisation: which to use?

Scenario	Recommended
Known bounded range (e.g. pixel values 0-255)	Min-max
Unknown range, possible outliers	Standardisation
Neural network training	Either (standardisation more common)
SVM	Standardisation
PCA	Standardisation
Data has extreme outliers	RobustScaler (uses median/IQR)

Feature scaling and embedding vectors

Dense embedding vectors produced by SIE’s encoding models are already unit-normalised (L2 norm = 1) for cosine similarity search. You don’t need to apply additional scaling to embedding vectors.

However, when combining embedding similarity scores with other tabular features (e.g. document recency, click-through rate) in a re-ranking model, you’ll need to scale the non-embedding features to be comparable with the similarity scores.

A critical rule: fit on training data, transform test data

Always fit the scaler on training data only, then apply the same transformation to validation and test data:

# Correct
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # fit + transform
X_test_scaled = scaler.transform(X_test)          # transform only

# Wrong — leaks test statistics into training
X_all_scaled = scaler.fit_transform(X_all)

Fitting on the full dataset causes data leakage and produces optimistically biased evaluation metrics.

Frequently asked questions

Do I need to scale features for tree-based models? No. Decision trees, random forests, and gradient boosted trees split on thresholds, which are scale-invariant. Scaling doesn’t hurt, but it doesn’t help either.

What is RobustScaler? RobustScaler uses the median and interquartile range instead of mean and standard deviation, making it resistant to outliers: x_scaled = (x - median) / IQR

Does feature scaling change the information content? No. Scaling is a monotonic transformation that changes magnitude but preserves rank order and relative differences. The information is identical.