---
title: Data Cleaning
description: Master data cleaning techniques for accurate analysis. Learn data preprocessing methods, outlier detection, missing value imputation, and machine learning approaches to ensure data quality and reliability.
canonical_url: https://superlinked.com/glossary/data-cleaning
last_updated: 2026-06-11
---

# What is Data Cleaning?

Data cleaning (also called data cleansing) is the process of identifying and correcting errors, inconsistencies, and missing values in a dataset before using it for training or analysis. It includes handling nulls, removing duplicates, fixing type mismatches, standardising formats, and filtering outliers. Unclean data is one of the most common causes of poor model performance.

---

## Why does data cleaning matter?

"Garbage in, garbage out" is the most reliable principle in machine learning. A model trained on noisy, inconsistent, or incomplete data will learn those errors. In document retrieval and RAG pipelines, poor data cleaning means your index contains duplicates, malformed chunks, and low-quality content, all of which degrade search quality.

---

## What are the main data cleaning tasks?

### Handling missing values
Options depend on how much data is missing and why:

| Strategy | When to use |
|---|---|
| Remove rows | Missing at random, small percentage |
| Mean/median imputation | Numerical features, roughly normal |
| Mode imputation | Categorical features |
| Model-based imputation | Complex dependencies between features |
| Leave as-is | Tree-based models handle nulls natively |

### Removing duplicates
Duplicate records inflate dataset size and bias model training towards repeated examples. For document corpora used in RAG, near-duplicate chunks waste context window space and reduce answer diversity.

```python
df = df.drop_duplicates(subset=['document_id', 'chunk_text'])
```

### Fixing data types
Ensure numerical columns are actually numeric, dates are parsed correctly, and categorical variables are consistently encoded.

### Standardising formats
Normalise text case, whitespace, punctuation, and encoding. For document ingestion pipelines, consistent formatting before chunking improves embedding quality.

### Outlier detection and handling
Statistical outliers can distort model training. Options: remove, cap (winsorise), or transform (log-scale). For retrieval systems, outlier documents (extremely long, extremely short, or off-topic) often warrant separate handling.

---

## Data cleaning in document processing pipelines

For RAG and search pipelines, document-level cleaning is critical before indexing:

**Text extraction quality**: OCR output often contains artefacts (broken words, noise characters). Post-processing with regex or a language model can clean these up. SIE's OCR models produce structured output that reduces downstream cleaning effort.

**Chunk deduplication**: overlapping chunks in large corpora create near-duplicates that waste vector DB space and return redundant results. MinHash or embedding-similarity deduplication helps.

**Metadata consistency**: ensure document metadata (date, source, category) is consistently formatted so it can be used for filtered retrieval.

**Language detection**: if your corpus is mixed-language, detect and tag languages so you can route to the appropriate embedding model.

---

## Data cleaning vs feature engineering

| | Data cleaning | Feature engineering |
|---|---|---|
| Goal | Remove errors and inconsistencies | Create informative representations |
| Order | First | After cleaning |
| Output | Clean, consistent raw data | New or transformed features |
| Impact | Fixes broken signals | Adds new signals |

Both are required. Feature engineering on dirty data amplifies errors.

---

## Frequently asked questions

**How much time does data cleaning take?**
In practice, 60-80% of a data science project's time is spent on data cleaning and preparation. This ratio is even higher for document processing pipelines where source data is unstructured.

**Does data cleaning ever lose important information?**
Yes. Removing outliers or imputing values changes the data. Always clean on training data and apply the same transformations to test data without leaking test statistics.

**Can I automate data cleaning?**
Partially. Tools like Great Expectations, dbt, and pandas-profiling automate detection and reporting. But decisions about what to do with anomalies often require domain knowledge.

---

## Related resources

- [What is feature engineering?](/glossary/feature-engineering)
- [What is feature selection?](/glossary/feature-selection)
- [What is data augmentation?](/glossary/data-augmentation)
- [Regulatory Intelligence RAG example](/docs/examples/regulatory-intelligence-rag)
- [What is RAG?](/glossary/what-is-rag)
