Traditional ML

What is Gradient Boosting?

Gradient boosting is an ensemble learning technique that builds a sequence of decision trees, where each new tree learns to correct the errors of the previous ones. It minimises a differentiable loss function by fitting each tree to the negative gradient (residuals) of the loss. XGBoost, LightGBM, and CatBoost are the dominant implementations and consistently achieve top results on tabular data benchmarks.

Why does gradient boosting matter?

Gradient boosted trees are the most widely used algorithm for structured/tabular data in production ML. They:

Outperform random forests on most tabular benchmarks
Handle mixed feature types, missing values, and non-linear relationships naturally
Are fast at inference (tree traversal is cheap)
Provide interpretable feature importance

In search and RAG systems, gradient boosting is commonly used as a learning-to-rank (LTR) model that combines embedding similarity with metadata signals to re-order retrieval results.

How does gradient boosting work?

Unlike random forests (parallel trees), gradient boosting builds trees sequentially:

Start with a simple prediction (e.g. mean of target)
Compute residuals (how wrong the current prediction is)
Fit a new tree to predict the residuals
Add the new tree to the ensemble (with a learning rate shrinkage)
Repeat for N rounds

F₀(x) = initial prediction (e.g. mean)
F₁(x) = F₀(x) + η × tree₁(x)   ← tree fits residuals of F₀
F₂(x) = F₁(x) + η × tree₂(x)   ← tree fits residuals of F₁
...
Fₙ(x) = Fₙ₋₁(x) + η × treeₙ(x)

Where η is the learning rate, controlling how much each tree contributes.

What is AdaBoost and how does it differ?

AdaBoost (Adaptive Boosting) is an earlier boosting algorithm that reweights training examples rather than fitting residuals; misclassified examples get higher weight in the next round. Gradient boosting generalised this idea to any differentiable loss function.

	AdaBoost	Gradient Boosting
Fits on	Reweighted examples	Residuals (negative gradient)
Loss function	Exponential	Any differentiable loss
Flexibility	Limited	High
Sensitivity to noise	High	Medium

XGBoost vs LightGBM vs CatBoost

	XGBoost	LightGBM	CatBoost
Speed	Fast	Very fast	Medium
Memory	Medium	Low	Medium
Categorical features	Manual encoding	Native support	Native support (ordered)
GPU support	✓	✓	✓
Best for	General use	Large datasets, speed	Datasets with many categoricals

LightGBM uses leaf-wise tree growth instead of depth-wise, making it faster for large datasets. CatBoost’s ordered boosting reduces target leakage for categorical features.

Gradient boosting as a re-ranker in search pipelines

A common production pattern combines embedding retrieval with a gradient boosted re-ranker:

First stage: retrieve top-100 candidates via semantic search (SIE embeddings + vector DB)
Feature extraction: for each (query, document) pair, compute features:
- Embedding cosine similarity
- BM25 score
- Document recency
- Source authority
- Metadata match signals
Re-rank: a gradient boosted model (LambdaMART/XGBoost) scores each candidate using these features
Serve: return top-k re-ranked results

This is the architecture used by major search engines (Microsoft, LinkedIn) and can be implemented efficiently with SIE providing the embedding features.

Regularisation in gradient boosting

Gradient boosting can overfit, especially with many rounds. Key regularisation parameters:

Learning rate (η): smaller = less overfitting, more rounds needed
Max depth: limits tree complexity
Min samples per leaf: prevents very specific splits
Subsampling: train each tree on a random subset of data/features
L1/L2 regularisation: penalise large leaf weights (XGBoost)

Frequently asked questions

How many boosting rounds should I use? Use early stopping: monitor validation loss and stop when it stops improving. Typically 100-1000 rounds depending on dataset size.

Is gradient boosting better than deep learning for tabular data? Generally yes for small-to-medium structured datasets. Deep learning (tabular transformers, MLP) can compete on very large datasets but requires more tuning.

What is LambdaMART? LambdaMART is a gradient boosting algorithm specialised for learning-to-rank tasks. It uses a ranking-specific loss (NDCG, MRR) instead of regression or classification losses, making it ideal for re-ranking in search systems.