A3.4.4 Describe the features of distributed databases. (HL only)

A3.4.4 Describe the features of distributed databases. 
• The need to maintain data consistency in a distributed database 
• The role of ACID to ensure reliable processing of transactions in distributed databases 
• Features of distributed databases: concurrency control, data consistency, data partitioning, data security, distribution transparency, fault tolerance, global query processing, location transparency, replication, scalability

 

The Big Idea

A distributed database is a collection of logically related databases that are physically distributed across multiple locations, connected via a network. Despite being stored in different sites, the system presents itself as a single unified database to the user or application.

Distributed databases are essential for modern systems that require high availability, geographic distribution, and scalability. However, they introduce new challenges, particularly in maintaining consistency, ensuring transaction reliability, and managing data distribution transparently.


Why Use a Distributed Database?

  • Geographic availability: keep data close to users in multiple regions
  • Fault tolerance: system continues to work even if one node fails
  • Load distribution: share the processing burden across multiple machines
  • Scalability: add more nodes to handle more data and users
  • Resilience: local failures don't compromise the entire system

1. ACID Transactions in Distributed Systems

To ensure reliable processing of transactions, distributed databases aim to preserve ACID properties:

ACID PropertyDefinition in Distributed Context
AtomicityA transaction must either complete on all nodes or roll back completely.
ConsistencyAll nodes must remain in a valid state before and after the transaction.
IsolationConcurrent transactions should not interfere with each other.
DurabilityOnce committed, changes are permanent, even if a node crashes.

Distributed atomicity is typically implemented via two-phase commit (2PC) or three-phase commit protocols, where coordinators and participants negotiate commit decisions across nodes.


2. Features of Distributed Databases

a. Data Consistency

Ensures that all replicas and partitions reflect the same state after updates. Techniques include:

  • Synchronous replication (strong consistency)
  • Eventual consistency (common in NoSQL systems)
  • Quorum-based protocols (used in systems like Cassandra)

b. Concurrency Control

Prevents conflicts when multiple users access or modify the same data simultaneously. Techniques:

  • Distributed locking protocols
  • Timestamp ordering
  • Optimistic concurrency control

c. Data Partitioning (Sharding)

Splits large datasets across multiple nodes:

  • Horizontal partitioning: rows are distributed (e.g., by region or customer ID)
  • Vertical partitioning: columns are distributed across nodes
    This improves performance and makes scaling possible.

d. Replication

Keeps copies of data on multiple nodes for:

  • Fault tolerance
  • Load balancing
  • Improved read latency

Replication types:

  • Master-slave (single-writer)
  • Multi-master (concurrent writes with conflict resolution)

e. Distribution Transparency

Users interact with the system as if it were a single database, unaware of:

  • Data location
  • Data fragmentation
  • Replication strategies

Subtypes include:

  • Location transparency: users don’t need to know where data is stored
  • Replication transparency: users don’t manage copies
  • Failure transparency: users don’t see node failures

f. Fault Tolerance

The ability to detect, isolate, and recover from failures without interrupting service:

  • Redundant nodes and failover strategies
  • Heartbeat mechanisms
  • Write-ahead logging and checkpointing

g. Global Query Processing

Queries may span multiple nodes and require:

  • Distributed query planners
  • Data shipping or function shipping
  • Join strategies across fragments or replicas

Goal: minimize data movement, latency, and network cost.

h. Data Security

Because data spans networks and sites, distributed systems require:

  • Encrypted communication channels
  • Access controls and authentication per site
  • Audit trails and secure replication

i. Scalability

The system must support horizontal scaling:

  • Add more nodes as data volume or user load increases
  • Avoid bottlenecks (e.g., single points of failure)
  • Maintain performance predictability as it grows

Example Use Cases

SystemWhy Distributed Database?
Global E-commerceLocal replicas for fast regional access and fault tolerance
Banking SystemsRegional partitions for compliance, with global consistency guarantees
Social NetworksHorizontal sharding of user data; real-time replication for availability
Content DeliveryReplicated media metadata across geographic nodes

Summary

A distributed database spans multiple sites, providing resilience, performance, and geographic scalability—but only if critical features like ACID compliance, concurrency control, and replication are carefully implemented.

The key design tension lies between:

  • Strong consistency vs. availability and latency
  • Simplicity vs. transparency and fault tolerance

Understanding these trade-offs is essential for engineers building globally distributed, high-throughput systems. As data volumes and access requirements grow, distributed databases provide the architectural foundation for scalable, reliable, and intelligent data infrastructure.