What is Feature Selection?
Feature selection is the process of choosing the most informative subset of features from a dataset to use in a model, and discarding the rest. It reduces overfitting, lowers computational cost, improves model interpretability, and can improve generalisation by removing noisy or redundant signals. The three main approaches are filter methods, wrapper methods, and embedded methods.
Why does feature selection matter?
More features are not always better. Irrelevant or redundant features add noise, increase training time, and can cause a model to overfit. Feature selection focuses the model on the signals that actually matter, improving accuracy, speed, and explainability.
In document processing and retrieval pipelines, feature selection determines which metadata fields, text statistics, or derived features are worth including in ranking models alongside embedding similarity.
What are the three main approaches?
Filter methods
Rank features independently of any model using statistical measures:
| Method | Measures | Type |
|---|---|---|
| Correlation | Linear relationship with target | Numerical |
| Chi-squared | Dependency between categorical feature and target | Categorical |
| Mutual information | Any statistical dependency | Both |
| Variance threshold | Removes near-constant features | Both |
from sklearn.feature_selection import SelectKBest, mutual_info_classif
selector = SelectKBest(score_func=mutual_info_classif, k=20)X_selected = selector.fit_transform(X_train, y_train)Filter methods are fast but don’t account for feature interactions.
Wrapper methods
Evaluate subsets of features by training a model and measuring performance:
- Forward selection: start empty, add features one at a time
- Backward elimination: start with all features, remove least useful
- Recursive Feature Elimination (RFE): repeatedly trains a model and prunes the weakest feature
Wrapper methods are more accurate but computationally expensive, since each iteration requires retraining the model.
Embedded methods
Feature selection happens as part of model training:
- L1 regularisation (Lasso): shrinks irrelevant feature weights to zero
- Tree-based importance: random forests and gradient boosted trees rank features by their contribution to splits
- Elastic net: combines L1 and L2 regularisation
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)rf.fit(X_train, y_train)importances = rf.feature_importances_Embedded methods are efficient and account for feature interactions within the model.
Filter vs wrapper vs embedded methods
| Filter | Wrapper | Embedded | |
|---|---|---|---|
| Model-dependent | ✗ | ✓ | ✓ |
| Computational cost | Low | High | Medium |
| Accounts for interactions | ✗ | ✓ | ✓ |
| Best for | Initial exploration | Small feature sets | Large datasets |
Feature selection in retrieval and re-ranking
When building a re-ranking model on top of embedding retrieval, feature selection determines which signals to include alongside cosine similarity:
Likely informative features:
- Embedding similarity score
- Document recency (days since publication)
- Source authority (domain trust score)
- Query-document metadata match (same category, same language)
- Document length (as a quality signal)
Likely uninformative:
- Raw document ID
- File creation timestamp (vs publication date)
- Formatting metadata irrelevant to content
Frequently asked questions
What’s the difference between feature selection and feature extraction? Feature selection picks from existing features. Feature extraction creates new representations (e.g. PCA components, embedding vectors). Embedding models perform feature extraction.
Can I use multiple selection methods together? Yes. A common pipeline is: variance threshold (remove constants) → correlation filter (remove redundant) → RFE or embedded selection (final selection). Each step removes different types of irrelevant features.
How do you evaluate whether feature selection improved the model? Compare cross-validated accuracy/F1/recall on the selected feature set vs the full feature set. Also compare inference speed and model size.