A3.4.4 Describe the features of distributed databases.
• The need to maintain data consistency in a distributed database
• The role of ACID to ensure reliable processing of transactions in distributed databases
• Features of distributed databases: concurrency control, data consistency, data partitioning, data security, distribution transparency, fault tolerance, global query processing, location transparency, replication, scalability
The Big Idea
A distributed database is a collection of logically related databases that are physically distributed across multiple locations, connected via a network. Despite being stored in different sites, the system presents itself as a single unified database to the user or application.
Distributed databases are essential for modern systems that require high availability, geographic distribution, and scalability. However, they introduce new challenges, particularly in maintaining consistency, ensuring transaction reliability, and managing data distribution transparently.
Why Use a Distributed Database?
- Geographic availability: keep data close to users in multiple regions
- Fault tolerance: system continues to work even if one node fails
- Load distribution: share the processing burden across multiple machines
- Scalability: add more nodes to handle more data and users
- Resilience: local failures don't compromise the entire system
1. ACID Transactions in Distributed Systems
To ensure reliable processing of transactions, distributed databases aim to preserve ACID properties:
| ACID Property | Definition in Distributed Context |
|---|---|
| Atomicity | A transaction must either complete on all nodes or roll back completely. |
| Consistency | All nodes must remain in a valid state before and after the transaction. |
| Isolation | Concurrent transactions should not interfere with each other. |
| Durability | Once committed, changes are permanent, even if a node crashes. |
Distributed atomicity is typically implemented via two-phase commit (2PC) or three-phase commit protocols, where coordinators and participants negotiate commit decisions across nodes.
2. Features of Distributed Databases
a. Data Consistency
Ensures that all replicas and partitions reflect the same state after updates. Techniques include:
- Synchronous replication (strong consistency)
- Eventual consistency (common in NoSQL systems)
- Quorum-based protocols (used in systems like Cassandra)
b. Concurrency Control
Prevents conflicts when multiple users access or modify the same data simultaneously. Techniques:
- Distributed locking protocols
- Timestamp ordering
- Optimistic concurrency control
c. Data Partitioning (Sharding)
Splits large datasets across multiple nodes:
- Horizontal partitioning: rows are distributed (e.g., by region or customer ID)
- Vertical partitioning: columns are distributed across nodes
This improves performance and makes scaling possible.
d. Replication
Keeps copies of data on multiple nodes for:
- Fault tolerance
- Load balancing
- Improved read latency
Replication types:
- Master-slave (single-writer)
- Multi-master (concurrent writes with conflict resolution)
e. Distribution Transparency
Users interact with the system as if it were a single database, unaware of:
- Data location
- Data fragmentation
- Replication strategies
Subtypes include:
- Location transparency: users don’t need to know where data is stored
- Replication transparency: users don’t manage copies
- Failure transparency: users don’t see node failures
f. Fault Tolerance
The ability to detect, isolate, and recover from failures without interrupting service:
- Redundant nodes and failover strategies
- Heartbeat mechanisms
- Write-ahead logging and checkpointing
g. Global Query Processing
Queries may span multiple nodes and require:
- Distributed query planners
- Data shipping or function shipping
- Join strategies across fragments or replicas
Goal: minimize data movement, latency, and network cost.
h. Data Security
Because data spans networks and sites, distributed systems require:
- Encrypted communication channels
- Access controls and authentication per site
- Audit trails and secure replication
i. Scalability
The system must support horizontal scaling:
- Add more nodes as data volume or user load increases
- Avoid bottlenecks (e.g., single points of failure)
- Maintain performance predictability as it grows
Example Use Cases
| System | Why Distributed Database? |
|---|---|
| Global E-commerce | Local replicas for fast regional access and fault tolerance |
| Banking Systems | Regional partitions for compliance, with global consistency guarantees |
| Social Networks | Horizontal sharding of user data; real-time replication for availability |
| Content Delivery | Replicated media metadata across geographic nodes |
Summary
A distributed database spans multiple sites, providing resilience, performance, and geographic scalability—but only if critical features like ACID compliance, concurrency control, and replication are carefully implemented.
The key design tension lies between:
- Strong consistency vs. availability and latency
- Simplicity vs. transparency and fault tolerance
Understanding these trade-offs is essential for engineers building globally distributed, high-throughput systems. As data volumes and access requirements grow, distributed databases provide the architectural foundation for scalable, reliable, and intelligent data infrastructure.