Data Integrity

This article is not assessed by the IB but may be helpful to deepen your understanding. Plus, I think it's cool.

The Big Idea

Data integrity is the principle that data must remain accurate, consistent, and trustworthy throughout its entire lifecycle—from the moment it is created, through storage and processing, to transmission and long-term archival. In other words, an information system is only as reliable as the correctness and stability of the data it manages.

Whether we consider CPU registers, network packets, database records, or logs produced by distributed systems, data integrity ensures that the data you work with is actually the data you intended to work with.

Integrity failures lead to corrupted records, incorrect outputs, security vulnerabilities, and system failures. Maintaining data integrity is therefore a foundational responsibility for every system designer, developer, and administrator.


1. What Data Integrity Means

Data integrity refers to the correctness, completeness, and internal consistency of data at all times. It has three tightly connected dimensions:

  1. Accuracy – The data correctly reflects the real-world entity or event it represents.
  2. Consistency – The data does not contradict itself across different storage locations, systems, or states.
  3. Reliability/Trustworthiness – The data has not been altered unintentionally or maliciously.

Integrity applies across all layers of computing—not just databases—and is protected by a combination of hardware mechanisms, software mechanisms, and human processes.

 

2. Types of Data Integrity

2.1 Physical Integrity

Physical integrity ensures data is preserved correctly at the hardware level. Threats include:

  • disk failures
  • bit rot
  • power loss
  • electromagnetic interference
  • physical damage, fire, water, etc.

Techniques to maintain physical integrity include:

  • RAID arrays
  • ECC (Error-Correcting Code) memory
  • redundant backups
  • uninterruptible power supplies (UPS)
  • hardware-level checksums

Physical integrity protects the existence of data.

2.2 Logical Integrity

Logical integrity ensures data remains valid according to rules defined by the system. These rules may be simple (e.g., “age must be ≥ 0”) or complex (“every foreign key must reference an existing primary key”).

Logical integrity is maintained through:

  • domain constraints
  • format constraints
  • referential integrity
  • transaction integrity (ACID properties)
  • algorithmic validation

Logical integrity protects the correctness of data.

 

3. Data Integrity Across Computing Domains

Although many students first encounter data integrity in the context of database design, it is a universal computing concept.

3.1 Integrity in Hardware and CPU Operations

At the lowest level, integrity protects the correctness of data stored in registers and memory.

Examples:

  • ECC RAM automatically detects and corrects single-bit errors.
  • Parity bits and checksums verify correctness when data moves across buses.
  • Instruction pipelines rely on integrity to prevent corrupted instructions from propagating through fetch–decode–execute cycles.

If memory or register integrity is compromised, the CPU may execute corrupted instructions or operate on corrupted operands.

3.2 Integrity in Networks

During transmission, data can be corrupted by noise, interference, or packet loss. Network integrity is protected by:

  • checksums (TCP)
  • cyclic redundancy checks (CRC) (Ethernet frames)
  • sequence numbers (ensuring packets reassemble in order)
  • digital signatures (preventing tampering)
  • TLS certificates (guaranteeing authenticity)

Integrity ensures that the packet received is the same packet that was sent.

3.3 Integrity in Databases

Databases enforce integrity using multiple mechanisms:

  • Entity integrity – each row has a valid, unique primary key.
  • Referential integrity – foreign keys reference existing entities.
  • Domain integrity – data values conform to types and constraints.
  • Transactional integrity – ACID rules ensure consistent state transitions.

This is the most formalized context in which students usually study integrity.

3.4 Integrity in Software Systems

Software maintains integrity by:

  • validating inputs
  • using immutable data structures where appropriate
  • structuring algorithms to avoid side-effects
  • preventing race conditions
  • using rigorous testing and code review practices

Integrity ensures software output remains predictable and correct.

3.5 Integrity in Distributed Systems

Distributed systems introduce new challenges:

  • node failure
  • network partitions
  • eventual consistency
  • replication conflicts

Integrity mechanisms include:

  • version vectors
  • consensus algorithms (e.g., Paxos, Raft)
  • write-ahead logs
  • quorum reads/writes

Integrity must be preserved despite concurrency and partial system failure.

 

4. Threats to Data Integrity

Common threats include:

  • human error
  • software bugs
  • faulty hardware
  • malicious modification (cyberattacks)
  • misconfigured systems
  • concurrency conflicts
  • ransomware
  • unsynchronized caches or replicas
  • improper backup/restore workflows

Understanding these threats allows engineers to design robust mitigation strategies.

 

5. Mechanisms for Protecting Data Integrity

A system may use several or all of these:

Hardware-level protections

  • ECC memory
  • checksums
  • CRC
  • mirrored or RAID storage
  • redundant power supplies

Software-level protections

  • validation rules
  • exception handling
  • input sanitization
  • type systems
  • concurrency control
  • transactional logic

Network-level protections

  • TLS encryption with message integrity
  • TCP checksum verification
  • packet sequence enforcement

Organizational protections

  • access control
  • authentication and authorization
  • audit logs
  • backup policies
  • version control
  • change-management procedures

Together, these form a layered defense strategy.

 

6. Why Data Integrity Matters

Integrity failures have severe consequences:

  • corrupted financial transactions
  • incorrect medical records leading to harmful decisions
  • compromised machine learning datasets, leading to biased or inaccurate models
  • broken authentication and access control systems
  • inconsistent replication in distributed databases
  • incorrect program execution causing system crashes

In short: every reliable computing system depends on data integrity.

 

7. Examples for Classroom Use

Example 1 – Network Packet

A TCP segment is transmitted with checksum 0x4A12.
If the computed checksum at the receiver is 0x3F99, the packet is discarded.
Integrity ensured by rejecting corrupted data.

Example 2 – Database Insert

A student record is inserted without a valid primary key.
The database rejects the operation, preserving entity integrity.

Example 3 – Machine Learning

A mislabeled training sample introduces incorrect decision boundaries.
Cleaning and validating data prevents integrity loss in the model.

Example 4 – File Storage

An SSD uses wear-levelling and checksums to detect and correct corrupted blocks.

 

Conclusion

Data integrity is not a single concept limited to one system or technology. It is a cross-cutting principle that enables reliable computation at every level of a computer system—from the CPU executing instructions, to a database maintaining referential consistency, to network protocols ensuring packets are unmodified, to distributed architectures coordinating state across nodes.

Ultimately, no computational process is trustworthy without strong guarantees of data integrity.