Why did we open-source our inference engine? Read the post
← All Glossary Articles

What is Data Cleaning?

Data cleaning (also called data cleansing) is the process of identifying and correcting errors, inconsistencies, and missing values in a dataset before using it for training or analysis. It includes handling nulls, removing duplicates, fixing type mismatches, standardising formats, and filtering outliers. Unclean data is one of the most common causes of poor model performance.


Why does data cleaning matter?

“Garbage in, garbage out” is the most reliable principle in machine learning. A model trained on noisy, inconsistent, or incomplete data will learn those errors. In document retrieval and RAG pipelines, poor data cleaning means your index contains duplicates, malformed chunks, and low-quality content, all of which degrade search quality.


What are the main data cleaning tasks?

Handling missing values

Options depend on how much data is missing and why:

StrategyWhen to use
Remove rowsMissing at random, small percentage
Mean/median imputationNumerical features, roughly normal
Mode imputationCategorical features
Model-based imputationComplex dependencies between features
Leave as-isTree-based models handle nulls natively

Removing duplicates

Duplicate records inflate dataset size and bias model training towards repeated examples. For document corpora used in RAG, near-duplicate chunks waste context window space and reduce answer diversity.

df = df.drop_duplicates(subset=['document_id', 'chunk_text'])

Fixing data types

Ensure numerical columns are actually numeric, dates are parsed correctly, and categorical variables are consistently encoded.

Standardising formats

Normalise text case, whitespace, punctuation, and encoding. For document ingestion pipelines, consistent formatting before chunking improves embedding quality.

Outlier detection and handling

Statistical outliers can distort model training. Options: remove, cap (winsorise), or transform (log-scale). For retrieval systems, outlier documents (extremely long, extremely short, or off-topic) often warrant separate handling.


Data cleaning in document processing pipelines

For RAG and search pipelines, document-level cleaning is critical before indexing:

Text extraction quality: OCR output often contains artefacts (broken words, noise characters). Post-processing with regex or a language model can clean these up. SIE’s OCR models produce structured output that reduces downstream cleaning effort.

Chunk deduplication: overlapping chunks in large corpora create near-duplicates that waste vector DB space and return redundant results. MinHash or embedding-similarity deduplication helps.

Metadata consistency: ensure document metadata (date, source, category) is consistently formatted so it can be used for filtered retrieval.

Language detection: if your corpus is mixed-language, detect and tag languages so you can route to the appropriate embedding model.


Data cleaning vs feature engineering

Data cleaningFeature engineering
GoalRemove errors and inconsistenciesCreate informative representations
OrderFirstAfter cleaning
OutputClean, consistent raw dataNew or transformed features
ImpactFixes broken signalsAdds new signals

Both are required. Feature engineering on dirty data amplifies errors.


Frequently asked questions

How much time does data cleaning take? In practice, 60-80% of a data science project’s time is spent on data cleaning and preparation. This ratio is even higher for document processing pipelines where source data is unstructured.

Does data cleaning ever lose important information? Yes. Removing outliers or imputing values changes the data. Always clean on training data and apply the same transformations to test data without leaking test statistics.

Can I automate data cleaning? Partially. Tools like Great Expectations, dbt, and pandas-profiling automate detection and reporting. But decisions about what to do with anomalies often require domain knowledge.


Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github 2.0K

Contact us

Tell us about your use case and we'll get back to you shortly.