Pre-processing

What is Feature Engineering?

Feature engineering is the process of transforming raw data into input representations that help machine learning models learn more effectively. It includes creating new features from existing ones, encoding categorical variables, combining signals, and extracting domain-specific information. Good feature engineering often has more impact on model performance than model architecture choice.

Why does feature engineering matter?

Models learn from the representations you give them. Raw data is rarely in a form that makes the underlying patterns obvious. Feature engineering bridges the gap between raw inputs and the signals a model needs to make good predictions.

In modern search and RAG pipelines, feature engineering happens at two levels: at the document level (how you prepare content before embedding) and at the retrieval level (how you combine signals like embedding similarity, metadata, and recency).

What are the main types of feature engineering?

Numerical features

Normalisation / scaling: bring features to comparable ranges (see: feature scaling)
Log transforms: reduce skewness in distributions (e.g. document length, price)
Binning: convert continuous values to categorical buckets
Polynomial features: capture non-linear relationships via interaction terms

Categorical features

One-hot encoding: binary vector with one 1 per category
Label encoding: integer per category (for ordinal or tree-based models)
Target encoding: replace category with its mean target value
Embeddings: learn dense representations of high-cardinality categories

Text features

TF-IDF: term frequency-inverse document frequency weighting
Dense embeddings: semantic vectors from encoder models (SIE)
Metadata extraction: extract structured information (dates, entities, section headers)
Readability scores: Flesch-Kincaid, sentence length statistics

Temporal features

Cyclical encoding: encode day-of-week, month as sin/cos to capture periodicity
Recency features: time since last event, rolling averages
Trend features: rate of change over time windows

Feature engineering for document retrieval

For search and RAG systems, feature engineering at the document level affects retrieval quality significantly:

Chunking strategy: splitting documents into chunks is itself a feature engineering decision. Chunk size, overlap, and splitting on semantic boundaries (headings, paragraphs) all affect what information ends up in each vector.

Metadata as retrieval features: document date, source domain, author, and section type are features that can filter or re-weight retrieval results alongside embedding similarity.

Query features: analysing query length, detected intent, or language enables routing to different retrieval strategies.

When is feature engineering less important?

Deep learning models, particularly transformer-based models, learn their own representations from raw inputs, reducing (but not eliminating) the need for manual feature engineering. Embedding models like those on SIE do their own “feature engineering” internally during encoding.

That said, the preparation of inputs to the embedding model (cleaning, chunking, metadata enrichment) is still a form of feature engineering that significantly impacts downstream quality.

Frequently asked questions

How do you know which features to engineer? Start with domain knowledge: what signals would a human expert use? Then use feature importance from a tree model, correlation analysis, or ablation testing to validate.

What’s the difference between feature engineering and feature selection? Feature engineering creates new features. Feature selection chooses which existing features to include. They’re complementary: engineer first, then select.

Does feature engineering still matter with deep learning? Less so for raw inputs to neural networks, but significantly for how you prepare data (chunking, cleaning, structuring) before it enters the model.