---
title: Decision Trees
description: An in-depth guide to decision forests in machine learning, covering random forests and gradient boosted trees. Learn about tree construction, ensemble methods, hyperparameter tuning, and real-world applications in finance, healthcare, and marketing. Perfect for data scientists and ML engineers working with tabular data and seeking interpretable, high-performance models.
canonical_url: https://superlinked.com/glossary/decision-trees-and-forests
last_updated: 2026-06-11
---

# What are Decision Trees and Random Forests?

A decision tree is a supervised learning model that splits data into branches based on feature thresholds, arriving at a prediction at each leaf node. A random forest is an ensemble of many decision trees trained on random subsets of data and features, whose predictions are averaged to produce a more accurate and robust result.

---

## Why do decision trees and forests matter?

Tree-based models are the dominant approach for tabular data. They're fast to train, interpretable (for single trees), handle mixed feature types without preprocessing, and don't require feature scaling. Random forests and gradient boosted trees (XGBoost, LightGBM) consistently achieve state-of-the-art performance on structured datasets.

In inference pipelines, tree models often appear as lightweight re-ranking or routing layers on top of embedding-based retrieval.

---

## How does a decision tree work?

A decision tree recursively splits the data by finding the feature and threshold that best separates the target classes:

```
Is document_length > 500 words?
├── Yes → Is contains_table = True?
│         ├── Yes → Category: Technical Document
│         └── No  → Category: Report
└── No  → Category: Short Form
```

The splitting criterion is typically:
- **Gini impurity** (for classification): measures class mixing in a node
- **Information gain / entropy**: measures reduction in uncertainty
- **MSE** (for regression): measures variance reduction

Trees grow until a stopping criterion is met: max depth, min samples per leaf, or no further improvement.

---

## What is a random forest?

A random forest builds many decision trees, each trained on:
1. A **bootstrap sample** of the training data (random rows with replacement)
2. A **random subset of features** at each split

Predictions are aggregated by majority vote (classification) or mean (regression). The randomness reduces correlation between trees, which dramatically reduces variance compared to a single tree.

```python
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf.fit(X_train, y_train)
```

---

## Decision tree vs random forest vs gradient boosting

| | Decision Tree | Random Forest | Gradient Boosting |
|---|---|---|---|
| Variance | High | Low | Low |
| Bias | Low | Low | Low |
| Speed (train) | Fast | Medium | Slower |
| Speed (inference) | Very fast | Fast | Fast |
| Interpretability | High | Medium | Low |
| Best for | Prototyping, rules | General tabular | High-accuracy tabular |

---

## Feature importance in tree models

Random forests provide feature importance scores, the average reduction in impurity contributed by each feature across all trees. This is a useful signal for feature selection in preprocessing pipelines.

```python
importances = rf.feature_importances_
# Use to identify which document metadata fields are most predictive
```

---

## How do tree models complement embedding-based retrieval?

In search and RAG systems, tree models are often used for:

- **Query routing**: classify query intent to route to different retrieval strategies
- **Result re-ranking**: use tabular features (recency, source authority, metadata) alongside embedding similarity scores in a lightweight ranker
- **Document classification**: pre-classify documents before indexing to enable filtered retrieval

Embedding models (self-hosted via SIE) handle semantic understanding; tree models handle structured metadata signals.

---

## Frequently asked questions

**Why does a single decision tree overfit?**
Trees grown without constraints will memorise training data (fitting each leaf to individual examples). Regularisation via max depth, min samples per leaf, and pruning controls this.

**How many trees should a random forest have?**
Performance typically plateaus around 100-500 trees. More trees increase compute cost without proportionate accuracy gains. Use cross-validation to find the sweet spot.

**What is out-of-bag (OOB) error?**
Because each tree is trained on a bootstrap sample (~63% of data), the remaining ~37% (out-of-bag samples) can be used as a validation set for free, with no separate hold-out needed.

---

## Related resources

- [What is gradient boosting?](/glossary/gradient-boosting-and-adaptive-boosting)
- [What is feature selection?](/glossary/feature-selection)
- [What is feature engineering?](/glossary/feature-engineering)
- [What is classification?](/glossary/binary-and-multi-class-classification)
- [What is RAG?](/glossary/what-is-rag)
