A3.4.4 Describe the features of distributed databases. (HL only)

A3.4.4 Describe the features of distributed databases.

• The need to maintain data consistency in a distributed database

• The role of ACID to ensure reliable processing of transactions in distributed databases

• Features of distributed databases: concurrency control, data consistency, data partitioning, data security, distribution transparency, fault tolerance, global query processing, location transparency, replication, scalability

The Big Idea

A distributed database is a collection of logically related databases that are physically distributed across multiple locations, connected via a network. Despite being stored in different sites, the system presents itself as a single unified database to the user or application.

Distributed databases are essential for modern systems that require high availability, geographic distribution, and scalability. However, they introduce new challenges, particularly in maintaining consistency, ensuring transaction reliability, and managing data distribution transparently.

Why Use a Distributed Database?

Geographic availability: keep data close to users in multiple regions
Fault tolerance: system continues to work even if one node fails
Load distribution: share the processing burden across multiple machines
Scalability: add more nodes to handle more data and users
Resilience: local failures don't compromise the entire system

1. ACID Transactions in Distributed Systems

To ensure reliable processing of transactions, distributed databases aim to preserve ACID properties:

ACID Property	Definition in Distributed Context
Atomicity	A transaction must either complete on all nodes or roll back completely.
Consistency	All nodes must remain in a valid state before and after the transaction.
Isolation	Concurrent transactions should not interfere with each other.
Durability	Once committed, changes are permanent, even if a node crashes.

Distributed atomicity is typically implemented via two-phase commit (2PC) or three-phase commit protocols, where coordinators and participants negotiate commit decisions across nodes.

2. Features of Distributed Databases

a. Data Consistency

Ensures that all replicas and partitions reflect the same state after updates. Techniques include:

Synchronous replication (strong consistency)
Eventual consistency (common in NoSQL systems)
Quorum-based protocols (used in systems like Cassandra)

b. Concurrency Control

Prevents conflicts when multiple users access or modify the same data simultaneously. Techniques:

Distributed locking protocols
Timestamp ordering
Optimistic concurrency control

c. Data Partitioning (Sharding)

Splits large datasets across multiple nodes:

Horizontal partitioning: rows are distributed (e.g., by region or customer ID)
Vertical partitioning: columns are distributed across nodes
This improves performance and makes scaling possible.

d. Replication

Keeps copies of data on multiple nodes for:

Fault tolerance
Load balancing
Improved read latency

Replication types:

Master-slave (single-writer)
Multi-master (concurrent writes with conflict resolution)

e. Distribution Transparency

Users interact with the system as if it were a single database, unaware of:

Data location
Data fragmentation
Replication strategies

Subtypes include:

Location transparency: users don’t need to know where data is stored
Replication transparency: users don’t manage copies
Failure transparency: users don’t see node failures

f. Fault Tolerance

The ability to detect, isolate, and recover from failures without interrupting service:

Redundant nodes and failover strategies
Heartbeat mechanisms
Write-ahead logging and checkpointing

g. Global Query Processing

Queries may span multiple nodes and require:

Distributed query planners
Data shipping or function shipping
Join strategies across fragments or replicas

Goal: minimize data movement, latency, and network cost.

h. Data Security

Because data spans networks and sites, distributed systems require:

Encrypted communication channels
Access controls and authentication per site
Audit trails and secure replication

i. Scalability

The system must support horizontal scaling:

Add more nodes as data volume or user load increases
Avoid bottlenecks (e.g., single points of failure)
Maintain performance predictability as it grows

Example Use Cases

System	Why Distributed Database?
Global E-commerce	Local replicas for fast regional access and fault tolerance
Banking Systems	Regional partitions for compliance, with global consistency guarantees
Social Networks	Horizontal sharding of user data; real-time replication for availability
Content Delivery	Replicated media metadata across geographic nodes

Summary

A distributed database spans multiple sites, providing resilience, performance, and geographic scalability—but only if critical features like ACID compliance, concurrency control, and replication are carefully implemented.

The key design tension lies between:

Strong consistency vs. availability and latency
Simplicity vs. transparency and fault tolerance

Understanding these trade-offs is essential for engineers building globally distributed, high-throughput systems. As data volumes and access requirements grow, distributed databases provide the architectural foundation for scalable, reliable, and intelligent data infrastructure.