Why did we open-source our inference engine? Read the post
← All Glossary Articles

What is Feature Scaling and Normalisation?

Feature scaling transforms numerical features to a comparable range so that no single feature dominates a model due to its magnitude. The two most common techniques are min-max scaling (normalisation), which maps values to [0, 1], and standardisation (z-score scaling), which centres data at zero with unit variance. Most distance-based and gradient-based models require feature scaling to perform well.


Why does feature scaling matter?

Consider a dataset with document length (range: 10-50,000 words) and number of images (range: 0-20). Without scaling, the document length feature dominates Euclidean distance calculations simply because its values are larger, not because it’s more informative.

Models particularly sensitive to feature scale:

  • K-Nearest Neighbours: distance-based, heavily affected
  • Support Vector Machines: margin maximisation is scale-sensitive
  • Neural networks / gradient descent: large feature ranges cause unstable learning
  • PCA: maximises variance, so large-scale features dominate components

Models that don’t require scaling:

  • Tree-based models (decision trees, random forests, gradient boosting): split on thresholds, scale-invariant

Min-max scaling (normalisation)

Maps all values to the range [0, 1]:

x_scaled = (x - x_min) / (x_max - x_min)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_train)

Best for: neural network inputs, when you know the approximate min/max of the data. Weakness: sensitive to outliers, since a single extreme value compresses all other values towards 0.


Standardisation (Z-score scaling)

Centres data at zero and scales to unit variance:

x_scaled = (x - mean) / std
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

Best for: most general cases, especially when you don’t know the range. Weakness: doesn’t bound values, so outliers remain as outliers.


Min-max vs standardisation: which to use?

ScenarioRecommended
Known bounded range (e.g. pixel values 0-255)Min-max
Unknown range, possible outliersStandardisation
Neural network trainingEither (standardisation more common)
SVMStandardisation
PCAStandardisation
Data has extreme outliersRobustScaler (uses median/IQR)

Feature scaling and embedding vectors

Dense embedding vectors produced by SIE’s encoding models are already unit-normalised (L2 norm = 1) for cosine similarity search. You don’t need to apply additional scaling to embedding vectors.

However, when combining embedding similarity scores with other tabular features (e.g. document recency, click-through rate) in a re-ranking model, you’ll need to scale the non-embedding features to be comparable with the similarity scores.


A critical rule: fit on training data, transform test data

Always fit the scaler on training data only, then apply the same transformation to validation and test data:

# Correct
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit + transform
X_test_scaled = scaler.transform(X_test) # transform only
# Wrong — leaks test statistics into training
X_all_scaled = scaler.fit_transform(X_all)

Fitting on the full dataset causes data leakage and produces optimistically biased evaluation metrics.


Frequently asked questions

Do I need to scale features for tree-based models? No. Decision trees, random forests, and gradient boosted trees split on thresholds, which are scale-invariant. Scaling doesn’t hurt, but it doesn’t help either.

What is RobustScaler? RobustScaler uses the median and interquartile range instead of mean and standard deviation, making it resistant to outliers: x_scaled = (x - median) / IQR

Does feature scaling change the information content? No. Scaling is a monotonic transformation that changes magnitude but preserves rank order and relative differences. The information is identical.


Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github 2.0K

Contact us

Tell us about your use case and we'll get back to you shortly.