Spaces:

shreyask
/

qmd-web

Running

App Files Files Community

qmd-web / eval-docs /distributed-systems-overview.md

shreyask

fix: add eval-docs to root for HF static serving

6534024 verified 3 days ago

preview code

raw

history blame contribute delete

2.93 kB

Distributed Systems: A Practical Overview

What Makes a System "Distributed"?

A distributed system is a collection of independent computers that appears to users as a single coherent system. The key challenges arise from:

Partial failure - Parts of the system can fail independently
Unreliable networks - Messages can be lost, delayed, or duplicated
No global clock - Different nodes have different views of time

The CAP Theorem

Eric Brewer's CAP theorem states that a distributed system can only provide two of three guarantees:

Consistency: All nodes see the same data at the same time
Availability: Every request receives a response
Partition tolerance: System continues operating despite network partitions

In practice, network partitions happen, so you're really choosing between CP and AP systems.

CP Systems (Consistency + Partition Tolerance)

Examples: ZooKeeper, etcd, Consul
Sacrifice availability during partitions
Good for: coordination, leader election, configuration

AP Systems (Availability + Partition Tolerance)

Examples: Cassandra, DynamoDB, CouchDB
Sacrifice consistency during partitions
Good for: high-throughput, always-on services

Consensus Algorithms

When nodes need to agree on something, they use consensus algorithms.

Paxos

Original consensus algorithm by Leslie Lamport
Notoriously difficult to understand and implement
Foundation for many other algorithms

Raft

Designed to be understandable
Used in etcd, Consul, CockroachDB
Separates leader election from log replication

PBFT (Practical Byzantine Fault Tolerance)

Handles malicious nodes
Used in blockchain systems
Higher overhead than crash-fault-tolerant algorithms

Replication Strategies

Single-Leader Replication

One node accepts writes
Followers replicate from leader
Simple but leader is bottleneck

Multi-Leader Replication

Multiple nodes accept writes
Must handle write conflicts
Good for multi-datacenter deployments

Leaderless Replication

Any node accepts writes
Uses quorum reads/writes
Examples: Dynamo-style databases

Consistency Models

From strongest to weakest:

Linearizability - Operations appear instantaneous
Sequential consistency - Operations appear in some sequential order
Causal consistency - Causally related operations appear in order
Eventual consistency - Given enough time, all replicas converge

Partitioning (Sharding)

Distributing data across nodes:

Hash Partitioning

Hash key to determine partition
Even distribution
Range queries are inefficient

Range Partitioning

Ranges of keys on different nodes
Good for range queries
Risk of hot spots

Conclusion

Building distributed systems requires understanding these fundamental concepts. Start simple, add complexity only when needed, and always plan for failure.