Pre-processing

What is Feature Selection?

Feature selection is the process of choosing the most informative subset of features from a dataset to use in a model, and discarding the rest. It reduces overfitting, lowers computational cost, improves model interpretability, and can improve generalisation by removing noisy or redundant signals. The three main approaches are filter methods, wrapper methods, and embedded methods.

Why does feature selection matter?

More features are not always better. Irrelevant or redundant features add noise, increase training time, and can cause a model to overfit. Feature selection focuses the model on the signals that actually matter, improving accuracy, speed, and explainability.

In document processing and retrieval pipelines, feature selection determines which metadata fields, text statistics, or derived features are worth including in ranking models alongside embedding similarity.

What are the three main approaches?

Filter methods

Rank features independently of any model using statistical measures:

Method	Measures	Type
Correlation	Linear relationship with target	Numerical
Chi-squared	Dependency between categorical feature and target	Categorical
Mutual information	Any statistical dependency	Both
Variance threshold	Removes near-constant features	Both

from sklearn.feature_selection import SelectKBest, mutual_info_classif

selector = SelectKBest(score_func=mutual_info_classif, k=20)
X_selected = selector.fit_transform(X_train, y_train)

Filter methods are fast but don’t account for feature interactions.

Wrapper methods

Evaluate subsets of features by training a model and measuring performance:

Forward selection: start empty, add features one at a time
Backward elimination: start with all features, remove least useful
Recursive Feature Elimination (RFE): repeatedly trains a model and prunes the weakest feature

Wrapper methods are more accurate but computationally expensive, since each iteration requires retraining the model.

Embedded methods

Feature selection happens as part of model training:

L1 regularisation (Lasso): shrinks irrelevant feature weights to zero
Tree-based importance: random forests and gradient boosted trees rank features by their contribution to splits
Elastic net: combines L1 and L2 regularisation

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
importances = rf.feature_importances_

Embedded methods are efficient and account for feature interactions within the model.

Filter vs wrapper vs embedded methods

	Filter	Wrapper	Embedded
Model-dependent	✗	✓	✓
Computational cost	Low	High	Medium
Accounts for interactions	✗	✓	✓
Best for	Initial exploration	Small feature sets	Large datasets

Feature selection in retrieval and re-ranking

When building a re-ranking model on top of embedding retrieval, feature selection determines which signals to include alongside cosine similarity:

Likely informative features:

Embedding similarity score
Document recency (days since publication)
Source authority (domain trust score)
Query-document metadata match (same category, same language)
Document length (as a quality signal)

Likely uninformative:

Raw document ID
File creation timestamp (vs publication date)
Formatting metadata irrelevant to content

Frequently asked questions

What’s the difference between feature selection and feature extraction? Feature selection picks from existing features. Feature extraction creates new representations (e.g. PCA components, embedding vectors). Embedding models perform feature extraction.

Can I use multiple selection methods together? Yes. A common pipeline is: variance threshold (remove constants) → correlation filter (remove redundant) → RFE or embedded selection (final selection). Each step removes different types of irrelevant features.

How do you evaluate whether feature selection improved the model? Compare cross-validated accuracy/F1/recall on the selected feature set vs the full feature set. Also compare inference speed and model size.