Distributed Systems: A Practical Overview
What Makes a System "Distributed"?
A distributed system is a collection of independent computers that appears to users as a single coherent system. The key challenges arise from:
- Partial failure - Parts of the system can fail independently
- Unreliable networks - Messages can be lost, delayed, or duplicated
- No global clock - Different nodes have different views of time
The CAP Theorem
Eric Brewer's CAP theorem states that a distributed system can only provide two of three guarantees:
- Consistency: All nodes see the same data at the same time
- Availability: Every request receives a response
- Partition tolerance: System continues operating despite network partitions
In practice, network partitions happen, so you're really choosing between CP and AP systems.
CP Systems (Consistency + Partition Tolerance)
- Examples: ZooKeeper, etcd, Consul
- Sacrifice availability during partitions
- Good for: coordination, leader election, configuration
AP Systems (Availability + Partition Tolerance)
- Examples: Cassandra, DynamoDB, CouchDB
- Sacrifice consistency during partitions
- Good for: high-throughput, always-on services
Consensus Algorithms
When nodes need to agree on something, they use consensus algorithms.
Paxos
- Original consensus algorithm by Leslie Lamport
- Notoriously difficult to understand and implement
- Foundation for many other algorithms
Raft
- Designed to be understandable
- Used in etcd, Consul, CockroachDB
- Separates leader election from log replication
PBFT (Practical Byzantine Fault Tolerance)
- Handles malicious nodes
- Used in blockchain systems
- Higher overhead than crash-fault-tolerant algorithms
Replication Strategies
Single-Leader Replication
- One node accepts writes
- Followers replicate from leader
- Simple but leader is bottleneck
Multi-Leader Replication
- Multiple nodes accept writes
- Must handle write conflicts
- Good for multi-datacenter deployments
Leaderless Replication
- Any node accepts writes
- Uses quorum reads/writes
- Examples: Dynamo-style databases
Consistency Models
From strongest to weakest:
- Linearizability - Operations appear instantaneous
- Sequential consistency - Operations appear in some sequential order
- Causal consistency - Causally related operations appear in order
- Eventual consistency - Given enough time, all replicas converge
Partitioning (Sharding)
Distributing data across nodes:
Hash Partitioning
- Hash key to determine partition
- Even distribution
- Range queries are inefficient
Range Partitioning
- Ranges of keys on different nodes
- Good for range queries
- Risk of hot spots
Conclusion
Building distributed systems requires understanding these fundamental concepts. Start simple, add complexity only when needed, and always plan for failure.