A4.2.1 Describe the significance of data cleaning. (HL only)

A4.2.1 Describe the significance of data cleaning.

• The impact of data quality on model performance

• Techniques for handling outliers, removing or consolidating duplicate data, identifying incorrect data, filtering irrelevant data, transforming improperly formatted data, and imputation, deletion or predictive modelling for missing data

• Normalization and standardization as crucial preprocessing steps

The Big Idea

In machine learning, data quality is often more important than model complexity. A sophisticated algorithm trained on poor-quality data will underperform, misclassify, or even propagate harmful biases. Data cleaning is the foundational preprocessing step that ensures the training data is accurate, consistent, and formatted in a way that algorithms can process effectively. Clean data enables better generalization, reduces overfitting, and improves the model’s predictive power and reliability.

Poor data quality can introduce noise, bias, and misleading patterns, leading to degraded performance regardless of the model type (e.g., supervised, unsupervised, or reinforcement learning). Cleaning data is not merely a technical requirement; it is a core determinant of machine learning success.

The Impact of Data Quality on Model Performance

A machine learning model’s ability to learn patterns and make accurate predictions is directly tied to the quality of the input data. Poor data leads to:

Lower accuracy on both training and unseen test data
Increased variance, making models more prone to overfitting
Misleading feature importance or correlations
Longer training times due to inefficient learning
Biased or unethical outcomes, especially in sensitive domains (e.g., health, justice)

A widely accepted principle is:

"Garbage in, garbage out" — a model is only as good as the data it learns from.

Core Data Cleaning Techniques

1. Handling Outliers

Definition: Values that lie far outside the normal range of the data.
Detection Methods:
- Statistical (e.g., z-score, IQR method)
- Visual (e.g., box plots, scatter plots)
Treatment Options:
- Remove them if they are due to errors
- Cap them (clipping) to within bounds
- Transform them (e.g., log scale)

2. Removing or Consolidating Duplicate Data

Problem: Duplicates can overemphasize certain examples, biasing the model.
Approach:
- Identify exact or near-duplicates using hash functions or row comparisons
- Consolidate duplicates by averaging, summing, or selecting the most recent record

3. Identifying and Correcting Incorrect Data

Examples: Misspelled labels, numerical typos, corrupted fields
Techniques:
- Cross-validation with external datasets
- Domain-specific rules (e.g., age cannot be negative)
- Manual inspection or rule-based correction

4. Filtering Irrelevant or Redundant Data

Why it matters: Irrelevant features introduce noise and slow learning.
Technique: Feature selection methods (e.g., correlation analysis, mutual information)

5. Transforming Improperly Formatted Data

Problem: Inconsistent formats (e.g., date-time fields, units of measurement)
Solution:
- Use parsers and converters to unify formats (e.g., YYYY-MM-DD)
- Ensure consistent units (e.g., converting inches to centimeters)

6. Dealing with Missing Data

a. Deletion

Drop rows or columns with missing values (only appropriate when data loss is minimal)

b. Imputation

Replace missing values with:
- Mean/median/mode (for numerical or categorical features)
- Forward or backward fill (in time series)
- K-nearest neighbor imputation
- Regression-based prediction (predict missing value using other features)

c. Predictive Modelling

Train a secondary model to predict missing values based on the rest of the dataset

Normalization and Standardization

Both are crucial preprocessing steps that transform data into a consistent scale, improving model performance, especially for distance-based or gradient-based algorithms.

Normalization (Min-Max Scaling)

Formula:
$x' = \frac{x - \min(x)}{\max(x) - \min(x)}$
Range: [0, 1]
Use Case: Required for neural networks or algorithms sensitive to input scale (e.g., KNN, SVM with RBF kernels)

Standardization (Z-Score Scaling)

Formula:
$x' = \frac{x - \mu}{\sigma}$
Range: Mean = 0, Standard deviation = 1
Use Case: Algorithms assuming Gaussian distribution (e.g., logistic regression, linear regression)

Summary

Clean, well-prepared data is non-negotiable in any serious machine learning pipeline. Data cleaning ensures:

Validity: Values fall within expected ranges and formats
Consistency: Across features, records, and sources
Completeness: Sufficient information exists for learning
Accuracy: Data represents true measurements

Investing effort into data cleaning increases model robustness, reduces bias, and allows algorithms to extract meaningful patterns—maximizing the effectiveness of any learning system.