Pre-processing

What is Data Augmentation?

Data augmentation is the practice of artificially expanding a training dataset by creating modified versions of existing examples. For images this means flips, crops, and colour jitter; for text it means synonym replacement, back-translation, or paraphrasing. It reduces overfitting, improves generalisation, and reduces the amount of labelled data needed to train a good model.

Why does data augmentation matter?

Collecting labelled training data is expensive and time-consuming. Data augmentation multiplies the effective size of a dataset without additional labelling effort. For embedding model fine-tuning and reranker training (where you need (query, positive, negative) triplets), augmentation is particularly valuable for generating training pairs from limited examples.

How does data augmentation work for images?

Standard image augmentation techniques include:

Technique	What it does	Use case
Random crop	Crops a random region	Forces location invariance
Horizontal flip	Mirrors the image	Doubles dataset size
Colour jitter	Adjusts brightness/contrast/saturation	Lighting robustness
Random rotation	Rotates by a random angle	Orientation invariance
Gaussian noise	Adds pixel noise	Noise robustness
Cutout / erasing	Masks a random region	Occlusion robustness

from torchvision import transforms

augment = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(224, padding=4),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
])

How does data augmentation work for text?

Text augmentation is harder since discrete token changes can alter meaning. Common techniques:

Lexical substitution: replace words with synonyms from WordNet or an embedding model’s nearest neighbours.

Back-translation: translate to another language and back to get a paraphrase. Useful for generating query variants.

Random insertion / deletion / swap: insert, delete, or swap random tokens. Adds noise robustness.

LLM paraphrasing: use a language model to rephrase sentences while preserving meaning. Currently the highest-quality approach for generating training pairs.

Data augmentation for embedding model fine-tuning

When fine-tuning an embedding model for a specific domain (e.g. legal, medical), you need (query, positive document) pairs. If you have limited labelled data, augmentation can help:

Query augmentation: for each document, generate multiple paraphrased queries using back-translation or an LLM
Hard negative mining: retrieve semantically similar but incorrect documents as hard negatives, which are more informative than random negatives
Synthetic data generation: use an LLM to generate (question, answer) pairs from your document corpus (the GPL / InPars approach)

SIE supports LoRA fine-tuning, meaning you can apply domain-adapted augmentation, train a lightweight LoRA adapter, and hot-load it at inference time without retraining the full model.

What is test-time augmentation (TTA)?

Test-time augmentation applies multiple augmented versions of an input at inference time and averages the predictions. It can improve accuracy by 1-3% at the cost of higher latency. For embedding models, averaging vectors from multiple augmented versions of a query is occasionally used to improve retrieval robustness.

Frequently asked questions

How much data augmentation is too much? Augmentation that changes the semantic meaning of the input is harmful. For images, aggressive geometric distortion can hurt. For text, changing too many tokens produces unnatural training examples. Start conservatively and validate on held-out data.

Is data augmentation the same as synthetic data generation? Data augmentation modifies existing examples. Synthetic data generation creates entirely new examples (e.g. using a generative model). Both expand training sets, but synthetic generation is more powerful and riskier if quality isn’t controlled.

Does data augmentation help with class imbalance? Yes. Augmenting only the minority class (combined with SMOTE or similar techniques) can help balance datasets without oversampling.