Spaces:

shreyask
/

qmd-web

Running

App Files Files Community

qmd-web / eval-docs /distributed-systems-overview.md

shreyask

fix: add eval-docs to root for HF static serving

6534024 verified 4 days ago

preview code

raw

history blame contribute delete

2.93 kB

	# Distributed Systems: A Practical Overview

	## What Makes a System "Distributed"?

	A distributed system is a collection of independent computers that appears to users as a single coherent system. The key challenges arise from:

	1. Partial failure - Parts of the system can fail independently
	2. Unreliable networks - Messages can be lost, delayed, or duplicated
	3. No global clock - Different nodes have different views of time

	## The CAP Theorem

	Eric Brewer's CAP theorem states that a distributed system can only provide two of three guarantees:

	- Consistency: All nodes see the same data at the same time
	- Availability: Every request receives a response
	- Partition tolerance: System continues operating despite network partitions

	In practice, network partitions happen, so you're really choosing between CP and AP systems.

	### CP Systems (Consistency + Partition Tolerance)
	- Examples: ZooKeeper, etcd, Consul
	- Sacrifice availability during partitions
	- Good for: coordination, leader election, configuration

	### AP Systems (Availability + Partition Tolerance)
	- Examples: Cassandra, DynamoDB, CouchDB
	- Sacrifice consistency during partitions
	- Good for: high-throughput, always-on services

	## Consensus Algorithms

	When nodes need to agree on something, they use consensus algorithms.

	### Paxos
	- Original consensus algorithm by Leslie Lamport
	- Notoriously difficult to understand and implement
	- Foundation for many other algorithms

	### Raft
	- Designed to be understandable
	- Used in etcd, Consul, CockroachDB
	- Separates leader election from log replication

	### PBFT (Practical Byzantine Fault Tolerance)
	- Handles malicious nodes
	- Used in blockchain systems
	- Higher overhead than crash-fault-tolerant algorithms

	## Replication Strategies

	### Single-Leader Replication
	- One node accepts writes
	- Followers replicate from leader
	- Simple but leader is bottleneck

	### Multi-Leader Replication
	- Multiple nodes accept writes
	- Must handle write conflicts
	- Good for multi-datacenter deployments

	### Leaderless Replication
	- Any node accepts writes
	- Uses quorum reads/writes
	- Examples: Dynamo-style databases

	## Consistency Models

	From strongest to weakest:

	1. Linearizability - Operations appear instantaneous
	2. Sequential consistency - Operations appear in some sequential order
	3. Causal consistency - Causally related operations appear in order
	4. Eventual consistency - Given enough time, all replicas converge

	## Partitioning (Sharding)

	Distributing data across nodes:

	### Hash Partitioning
	- Hash key to determine partition
	- Even distribution
	- Range queries are inefficient

	### Range Partitioning
	- Ranges of keys on different nodes
	- Good for range queries
	- Risk of hot spots

	## Conclusion

	Building distributed systems requires understanding these fundamental concepts. Start simple, add complexity only when needed, and always plan for failure.