Why did we open-source our inference engine? Read the post
← All Glossary Articles

What is Data Augmentation?

Data augmentation is the practice of artificially expanding a training dataset by creating modified versions of existing examples. For images this means flips, crops, and colour jitter; for text it means synonym replacement, back-translation, or paraphrasing. It reduces overfitting, improves generalisation, and reduces the amount of labelled data needed to train a good model.


Why does data augmentation matter?

Collecting labelled training data is expensive and time-consuming. Data augmentation multiplies the effective size of a dataset without additional labelling effort. For embedding model fine-tuning and reranker training (where you need (query, positive, negative) triplets), augmentation is particularly valuable for generating training pairs from limited examples.


How does data augmentation work for images?

Standard image augmentation techniques include:

TechniqueWhat it doesUse case
Random cropCrops a random regionForces location invariance
Horizontal flipMirrors the imageDoubles dataset size
Colour jitterAdjusts brightness/contrast/saturationLighting robustness
Random rotationRotates by a random angleOrientation invariance
Gaussian noiseAdds pixel noiseNoise robustness
Cutout / erasingMasks a random regionOcclusion robustness
from torchvision import transforms
augment = transforms.Compose([
transforms.RandomHorizontalFlip(),
transforms.RandomCrop(224, padding=4),
transforms.ColorJitter(brightness=0.2, contrast=0.2),
])

How does data augmentation work for text?

Text augmentation is harder since discrete token changes can alter meaning. Common techniques:

Lexical substitution: replace words with synonyms from WordNet or an embedding model’s nearest neighbours.

Back-translation: translate to another language and back to get a paraphrase. Useful for generating query variants.

Random insertion / deletion / swap: insert, delete, or swap random tokens. Adds noise robustness.

LLM paraphrasing: use a language model to rephrase sentences while preserving meaning. Currently the highest-quality approach for generating training pairs.


Data augmentation for embedding model fine-tuning

When fine-tuning an embedding model for a specific domain (e.g. legal, medical), you need (query, positive document) pairs. If you have limited labelled data, augmentation can help:

  • Query augmentation: for each document, generate multiple paraphrased queries using back-translation or an LLM
  • Hard negative mining: retrieve semantically similar but incorrect documents as hard negatives, which are more informative than random negatives
  • Synthetic data generation: use an LLM to generate (question, answer) pairs from your document corpus (the GPL / InPars approach)

SIE supports LoRA fine-tuning, meaning you can apply domain-adapted augmentation, train a lightweight LoRA adapter, and hot-load it at inference time without retraining the full model.


What is test-time augmentation (TTA)?

Test-time augmentation applies multiple augmented versions of an input at inference time and averages the predictions. It can improve accuracy by 1-3% at the cost of higher latency. For embedding models, averaging vectors from multiple augmented versions of a query is occasionally used to improve retrieval robustness.


Frequently asked questions

How much data augmentation is too much? Augmentation that changes the semantic meaning of the input is harmful. For images, aggressive geometric distortion can hurt. For text, changing too many tokens produces unnatural training examples. Start conservatively and validate on held-out data.

Is data augmentation the same as synthetic data generation? Data augmentation modifies existing examples. Synthetic data generation creates entirely new examples (e.g. using a generative model). Both expand training sets, but synthetic generation is more powerful and riskier if quality isn’t controlled.

Does data augmentation help with class imbalance? Yes. Augmenting only the minority class (combined with SMOTE or similar techniques) can help balance datasets without oversampling.


Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github 2.0K

Contact us

Tell us about your use case and we'll get back to you shortly.