Why did we open-source our inference engine? Read the post
← All Glossary Articles

What is Feature Selection?

Feature selection is the process of choosing the most informative subset of features from a dataset to use in a model, and discarding the rest. It reduces overfitting, lowers computational cost, improves model interpretability, and can improve generalisation by removing noisy or redundant signals. The three main approaches are filter methods, wrapper methods, and embedded methods.


Why does feature selection matter?

More features are not always better. Irrelevant or redundant features add noise, increase training time, and can cause a model to overfit. Feature selection focuses the model on the signals that actually matter, improving accuracy, speed, and explainability.

In document processing and retrieval pipelines, feature selection determines which metadata fields, text statistics, or derived features are worth including in ranking models alongside embedding similarity.


What are the three main approaches?

Filter methods

Rank features independently of any model using statistical measures:

MethodMeasuresType
CorrelationLinear relationship with targetNumerical
Chi-squaredDependency between categorical feature and targetCategorical
Mutual informationAny statistical dependencyBoth
Variance thresholdRemoves near-constant featuresBoth
from sklearn.feature_selection import SelectKBest, mutual_info_classif
selector = SelectKBest(score_func=mutual_info_classif, k=20)
X_selected = selector.fit_transform(X_train, y_train)

Filter methods are fast but don’t account for feature interactions.

Wrapper methods

Evaluate subsets of features by training a model and measuring performance:

  • Forward selection: start empty, add features one at a time
  • Backward elimination: start with all features, remove least useful
  • Recursive Feature Elimination (RFE): repeatedly trains a model and prunes the weakest feature

Wrapper methods are more accurate but computationally expensive, since each iteration requires retraining the model.

Embedded methods

Feature selection happens as part of model training:

  • L1 regularisation (Lasso): shrinks irrelevant feature weights to zero
  • Tree-based importance: random forests and gradient boosted trees rank features by their contribution to splits
  • Elastic net: combines L1 and L2 regularisation
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
importances = rf.feature_importances_

Embedded methods are efficient and account for feature interactions within the model.


Filter vs wrapper vs embedded methods

FilterWrapperEmbedded
Model-dependent
Computational costLowHighMedium
Accounts for interactions
Best forInitial explorationSmall feature setsLarge datasets

Feature selection in retrieval and re-ranking

When building a re-ranking model on top of embedding retrieval, feature selection determines which signals to include alongside cosine similarity:

Likely informative features:

  • Embedding similarity score
  • Document recency (days since publication)
  • Source authority (domain trust score)
  • Query-document metadata match (same category, same language)
  • Document length (as a quality signal)

Likely uninformative:

  • Raw document ID
  • File creation timestamp (vs publication date)
  • Formatting metadata irrelevant to content

Frequently asked questions

What’s the difference between feature selection and feature extraction? Feature selection picks from existing features. Feature extraction creates new representations (e.g. PCA components, embedding vectors). Embedding models perform feature extraction.

Can I use multiple selection methods together? Yes. A common pipeline is: variance threshold (remove constants) → correlation filter (remove redundant) → RFE or embedded selection (final selection). Each step removes different types of irrelevant features.

How do you evaluate whether feature selection improved the model? Compare cross-validated accuracy/F1/recall on the selected feature set vs the full feature set. Also compare inference speed and model size.


Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github 2.0K

Contact us

Tell us about your use case and we'll get back to you shortly.