A4.2.1 Describe the significance of data cleaning. (HL only)

A4.2.1 Describe the significance of data cleaning. 
• The impact of data quality on model performance  
• Techniques for handling outliers, removing or consolidating duplicate data, identifying incorrect data, filtering irrelevant data, transforming improperly formatted data, and imputation, deletion or predictive modelling for missing data 
• Normalization and standardization as crucial preprocessing steps

The Big Idea

In machine learning, data quality is often more important than model complexity. A sophisticated algorithm trained on poor-quality data will underperform, misclassify, or even propagate harmful biases. Data cleaning is the foundational preprocessing step that ensures the training data is accurate, consistent, and formatted in a way that algorithms can process effectively. Clean data enables better generalization, reduces overfitting, and improves the model’s predictive power and reliability.

Poor data quality can introduce noise, bias, and misleading patterns, leading to degraded performance regardless of the model type (e.g., supervised, unsupervised, or reinforcement learning). Cleaning data is not merely a technical requirement; it is a core determinant of machine learning success.


The Impact of Data Quality on Model Performance

A machine learning model’s ability to learn patterns and make accurate predictions is directly tied to the quality of the input data. Poor data leads to:

  • Lower accuracy on both training and unseen test data
  • Increased variance, making models more prone to overfitting
  • Misleading feature importance or correlations
  • Longer training times due to inefficient learning
  • Biased or unethical outcomes, especially in sensitive domains (e.g., health, justice)

A widely accepted principle is:

"Garbage in, garbage out" — a model is only as good as the data it learns from.


Core Data Cleaning Techniques

1. Handling Outliers

  • Definition: Values that lie far outside the normal range of the data.
  • Detection Methods:
    • Statistical (e.g., z-score, IQR method)
    • Visual (e.g., box plots, scatter plots)
  • Treatment Options:
    • Remove them if they are due to errors
    • Cap them (clipping) to within bounds
    • Transform them (e.g., log scale)

2. Removing or Consolidating Duplicate Data

  • Problem: Duplicates can overemphasize certain examples, biasing the model.
  • Approach:
    • Identify exact or near-duplicates using hash functions or row comparisons
    • Consolidate duplicates by averaging, summing, or selecting the most recent record

3. Identifying and Correcting Incorrect Data

  • Examples: Misspelled labels, numerical typos, corrupted fields
  • Techniques:
    • Cross-validation with external datasets
    • Domain-specific rules (e.g., age cannot be negative)
    • Manual inspection or rule-based correction

4. Filtering Irrelevant or Redundant Data

  • Why it matters: Irrelevant features introduce noise and slow learning.
  • Technique: Feature selection methods (e.g., correlation analysis, mutual information)

5. Transforming Improperly Formatted Data

  • Problem: Inconsistent formats (e.g., date-time fields, units of measurement)
  • Solution:
    • Use parsers and converters to unify formats (e.g., YYYY-MM-DD)
    • Ensure consistent units (e.g., converting inches to centimeters)

6. Dealing with Missing Data

a. Deletion

  • Drop rows or columns with missing values (only appropriate when data loss is minimal)

b. Imputation

  • Replace missing values with:
    • Mean/median/mode (for numerical or categorical features)
    • Forward or backward fill (in time series)
    • K-nearest neighbor imputation
    • Regression-based prediction (predict missing value using other features)

c. Predictive Modelling

  • Train a secondary model to predict missing values based on the rest of the dataset

Normalization and Standardization

Both are crucial preprocessing steps that transform data into a consistent scale, improving model performance, especially for distance-based or gradient-based algorithms.

Normalization (Min-Max Scaling)

  • Formula:

    x=xmin(x)max(x)min(x)x' = \frac{x - \min(x)}{\max(x) - \min(x)}

  • Range: [0, 1]
  • Use Case: Required for neural networks or algorithms sensitive to input scale (e.g., KNN, SVM with RBF kernels)

Standardization (Z-Score Scaling)

  • Formula:

    x=xμσx' = \frac{x - \mu}{\sigma}

  • Range: Mean = 0, Standard deviation = 1
  • Use Case: Algorithms assuming Gaussian distribution (e.g., logistic regression, linear regression)

Summary

Clean, well-prepared data is non-negotiable in any serious machine learning pipeline. Data cleaning ensures:

  • Validity: Values fall within expected ranges and formats
  • Consistency: Across features, records, and sources
  • Completeness: Sufficient information exists for learning
  • Accuracy: Data represents true measurements

Investing effort into data cleaning increases model robustness, reduces bias, and allows algorithms to extract meaningful patterns—maximizing the effectiveness of any learning system.