What is Gradient Boosting?
Gradient boosting is an ensemble learning technique that builds a sequence of decision trees, where each new tree learns to correct the errors of the previous ones. It minimises a differentiable loss function by fitting each tree to the negative gradient (residuals) of the loss. XGBoost, LightGBM, and CatBoost are the dominant implementations and consistently achieve top results on tabular data benchmarks.
Why does gradient boosting matter?
Gradient boosted trees are the most widely used algorithm for structured/tabular data in production ML. They:
- Outperform random forests on most tabular benchmarks
- Handle mixed feature types, missing values, and non-linear relationships naturally
- Are fast at inference (tree traversal is cheap)
- Provide interpretable feature importance
In search and RAG systems, gradient boosting is commonly used as a learning-to-rank (LTR) model that combines embedding similarity with metadata signals to re-order retrieval results.
How does gradient boosting work?
Unlike random forests (parallel trees), gradient boosting builds trees sequentially:
- Start with a simple prediction (e.g. mean of target)
- Compute residuals (how wrong the current prediction is)
- Fit a new tree to predict the residuals
- Add the new tree to the ensemble (with a learning rate shrinkage)
- Repeat for N rounds
F₀(x) = initial prediction (e.g. mean)F₁(x) = F₀(x) + η × tree₁(x) ← tree fits residuals of F₀F₂(x) = F₁(x) + η × tree₂(x) ← tree fits residuals of F₁...Fₙ(x) = Fₙ₋₁(x) + η × treeₙ(x)Where η is the learning rate, controlling how much each tree contributes.
What is AdaBoost and how does it differ?
AdaBoost (Adaptive Boosting) is an earlier boosting algorithm that reweights training examples rather than fitting residuals; misclassified examples get higher weight in the next round. Gradient boosting generalised this idea to any differentiable loss function.
| AdaBoost | Gradient Boosting | |
|---|---|---|
| Fits on | Reweighted examples | Residuals (negative gradient) |
| Loss function | Exponential | Any differentiable loss |
| Flexibility | Limited | High |
| Sensitivity to noise | High | Medium |
XGBoost vs LightGBM vs CatBoost
| XGBoost | LightGBM | CatBoost | |
|---|---|---|---|
| Speed | Fast | Very fast | Medium |
| Memory | Medium | Low | Medium |
| Categorical features | Manual encoding | Native support | Native support (ordered) |
| GPU support | ✓ | ✓ | ✓ |
| Best for | General use | Large datasets, speed | Datasets with many categoricals |
LightGBM uses leaf-wise tree growth instead of depth-wise, making it faster for large datasets. CatBoost’s ordered boosting reduces target leakage for categorical features.
Gradient boosting as a re-ranker in search pipelines
A common production pattern combines embedding retrieval with a gradient boosted re-ranker:
- First stage: retrieve top-100 candidates via semantic search (SIE embeddings + vector DB)
- Feature extraction: for each (query, document) pair, compute features:
- Embedding cosine similarity
- BM25 score
- Document recency
- Source authority
- Metadata match signals
- Re-rank: a gradient boosted model (LambdaMART/XGBoost) scores each candidate using these features
- Serve: return top-k re-ranked results
This is the architecture used by major search engines (Microsoft, LinkedIn) and can be implemented efficiently with SIE providing the embedding features.
Regularisation in gradient boosting
Gradient boosting can overfit, especially with many rounds. Key regularisation parameters:
- Learning rate (η): smaller = less overfitting, more rounds needed
- Max depth: limits tree complexity
- Min samples per leaf: prevents very specific splits
- Subsampling: train each tree on a random subset of data/features
- L1/L2 regularisation: penalise large leaf weights (XGBoost)
Frequently asked questions
How many boosting rounds should I use? Use early stopping: monitor validation loss and stop when it stops improving. Typically 100-1000 rounds depending on dataset size.
Is gradient boosting better than deep learning for tabular data? Generally yes for small-to-medium structured datasets. Deep learning (tabular transformers, MLP) can compete on very large datasets but requires more tuning.
What is LambdaMART? LambdaMART is a gradient boosting algorithm specialised for learning-to-rank tasks. It uses a ranking-specific loss (NDCG, MRR) instead of regression or classification losses, making it ideal for re-ranking in search systems.