qmd-web / eval-docs /distributed-systems-overview.md
shreyask's picture
fix: add eval-docs to root for HF static serving
6534024 verified

Distributed Systems: A Practical Overview

What Makes a System "Distributed"?

A distributed system is a collection of independent computers that appears to users as a single coherent system. The key challenges arise from:

  1. Partial failure - Parts of the system can fail independently
  2. Unreliable networks - Messages can be lost, delayed, or duplicated
  3. No global clock - Different nodes have different views of time

The CAP Theorem

Eric Brewer's CAP theorem states that a distributed system can only provide two of three guarantees:

  • Consistency: All nodes see the same data at the same time
  • Availability: Every request receives a response
  • Partition tolerance: System continues operating despite network partitions

In practice, network partitions happen, so you're really choosing between CP and AP systems.

CP Systems (Consistency + Partition Tolerance)

  • Examples: ZooKeeper, etcd, Consul
  • Sacrifice availability during partitions
  • Good for: coordination, leader election, configuration

AP Systems (Availability + Partition Tolerance)

  • Examples: Cassandra, DynamoDB, CouchDB
  • Sacrifice consistency during partitions
  • Good for: high-throughput, always-on services

Consensus Algorithms

When nodes need to agree on something, they use consensus algorithms.

Paxos

  • Original consensus algorithm by Leslie Lamport
  • Notoriously difficult to understand and implement
  • Foundation for many other algorithms

Raft

  • Designed to be understandable
  • Used in etcd, Consul, CockroachDB
  • Separates leader election from log replication

PBFT (Practical Byzantine Fault Tolerance)

  • Handles malicious nodes
  • Used in blockchain systems
  • Higher overhead than crash-fault-tolerant algorithms

Replication Strategies

Single-Leader Replication

  • One node accepts writes
  • Followers replicate from leader
  • Simple but leader is bottleneck

Multi-Leader Replication

  • Multiple nodes accept writes
  • Must handle write conflicts
  • Good for multi-datacenter deployments

Leaderless Replication

  • Any node accepts writes
  • Uses quorum reads/writes
  • Examples: Dynamo-style databases

Consistency Models

From strongest to weakest:

  1. Linearizability - Operations appear instantaneous
  2. Sequential consistency - Operations appear in some sequential order
  3. Causal consistency - Causally related operations appear in order
  4. Eventual consistency - Given enough time, all replicas converge

Partitioning (Sharding)

Distributing data across nodes:

Hash Partitioning

  • Hash key to determine partition
  • Even distribution
  • Range queries are inefficient

Range Partitioning

  • Ranges of keys on different nodes
  • Good for range queries
  • Risk of hot spots

Conclusion

Building distributed systems requires understanding these fundamental concepts. Start simple, add complexity only when needed, and always plan for failure.