{"_id":"69ecdecf85ead3a7057c2907","chapter_name":"Basics of System Design","chapter_no":1,"content":{"1.10_fault_tolerance_and_reliability":{"concepts":{"Bulkhead":"Isolate failures so they don't spread (like ship compartments)","Circuit Breaker":"Stop calling a failing service to prevent cascade","Failover":"Automatic switch to backup when primary fails","Redundancy":"Multiple copies of critical components (servers, DBs)","Replication":"Duplicate data across nodes","Retry with Exponential Backoff":"Retry failed requests with increasing delays"},"definition":"Fault tolerance is the ability of a system to continue operating correctly when one or more components fail."},"1.11_availability_basics":{"formula":"Availability = Uptime / (Uptime + Downtime). Expressed as a percentage (SLA).","sla_table":[{"SLA":"99% (2 nines)","annual_downtime":"~3.65 days/year","industry_usage":"Internal tools"},{"SLA":"99.9% (3 nines)","annual_downtime":"~8.76 hours/year","industry_usage":"Many web services"},{"SLA":"99.99% (4 nines)","annual_downtime":"~52 minutes/year","industry_usage":"Most production SaaS"},{"SLA":"99.999% (5 nines)","annual_downtime":"~5.26 minutes/year","industry_usage":"Critical infra (banking, telecom)"}]},"1.12_performance_metrics":{"QPS":"Queries Per Second – Number of DB queries / API calls per second","TPS":"Transactions Per Second – Number of complete transactions (read+write) per second","estimation_example":"100M DAU × 10 actions/day = 1B requests/day ≈ ~11,600 QPS average","percentile_latency":"P50 / P95 / P99 Latency: 50th, 95th, 99th percentile response times"},"1.1_what_is_system_design":{"definition":"System Design is the process of defining the architecture, components, modules, interfaces, and data flow of a system to satisfy specified requirements. It bridges the gap between a vague problem statement and a robust, scalable, production-ready solution.","key_insight":"System design is NOT about writing code. It's about making high-level architectural decisions. Always start with requirements, then design, then dive into details.","key_properties":["Scalable","Reliable","Maintainable","Efficient"]},"1.2_functional_vs_nonfunctional_requirements":{"common_nfrs":["Scalability","Availability (e.g., 99.99% uptime = ~52 mins/year downtime)","Latency (P99 \u003c 200ms)","Throughput (requests per second)","Durability","Security","Consistency"],"comparison":[{"aspect":"Definition","functional":"WHAT the system does","non_functional":"HOW WELL the system does it"},{"aspect":"Example (URL Shortener)","functional":"Shorten a long URL to a short one","non_functional":"Handle 10,000 requests/second"},{"aspect":"Example (Chat App)","functional":"Send and receive messages","non_functional":"Deliver messages in \u003c 100ms"},{"aspect":"Testing","functional":"Verifiable via use cases","non_functional":"Measured via benchmarks/SLAs"},{"aspect":"Priority","functional":"Core features","non_functional":"Quality attributes (NFRs)"}]},"1.3_scalability_vertical_vs_horizontal":{"comparison":[{"dimension":"Approach","horizontal":"Add more machines to the pool","vertical":"Add more power to existing machine (CPU, RAM, SSD)"},{"dimension":"Cost","horizontal":"More cost-effective; commodity hardware","vertical":"Expensive at high end, limited ceiling"},{"dimension":"Limit","horizontal":"Near-infinite (add nodes)","vertical":"Hard limit (largest machine available)"},{"dimension":"Complexity","horizontal":"Complex – needs load balancing, distributed state","vertical":"Simple – no code changes needed"},{"dimension":"Failure","horizontal":"Resilient – one node failing doesn't take down system","vertical":"Single point of failure"},{"dimension":"Best For","horizontal":"Web servers, microservices, stateless apps","vertical":"Databases (initially), simple apps"}],"definition":"Scalability is the ability of a system to handle a growing amount of work by adding resources.","tip":"Most large-scale systems use horizontal scaling for stateless services and vertical scaling for databases, then shard when needed."},"1.4_latency_vs_throughput":{"comparison":[{"latency":"Time for a single request to complete","metric":"Definition","throughput":"Number of requests handled per unit time"},{"latency":"Milliseconds (ms) / microseconds (μs)","metric":"Unit","throughput":"Requests/sec (RPS), Queries/sec (QPS)"},{"latency":"How fast one car travels from A to B","metric":"Analogy","throughput":"How many cars pass a point per hour"},{"latency":"Minimize","metric":"Goal","throughput":"Maximize"},{"latency":"Batching ↑ throughput but ↑ latency","metric":"Tradeoff","throughput":"Parallelism ↑ throughput, caching ↓ latency"}],"latency_numbers":{"L1_cache_reference":"~0.5 ns","SSD_random_read":"~100 μs","disk_seek":"~10 ms","main_memory_reference":"~100 ns","packet_roundtrip_cross_continent":"~150 ms","packet_roundtrip_same_region":"~0.5 ms"}},"1.5_CAP_theorem_intro":{"definition":"The CAP Theorem (Brewer's Theorem) states that a distributed system can guarantee at most two of the following three properties simultaneously.","key_note":"Since network partitions are inevitable in real distributed systems, you must always choose between Consistency (CP) and Availability (AP).","properties":{"A":"Availability – Every request receives a (non-error) response – not guaranteed to be most recent","C":"Consistency – Every read receives the most recent write or an error","P":"Partition Tolerance – System continues operating despite network partitions (message loss/delay)"},"types":[{"examples":["HBase","Zookeeper","MongoDB (strong)"],"guarantees":"Data is always consistent; may reject requests","type":"CP","use_case":"Financial systems, inventory"},{"examples":["Cassandra","CouchDB","DynamoDB"],"guarantees":"Always responds; data may be stale","type":"AP","use_case":"Social feeds, DNS, shopping carts"},{"examples":["Traditional RDBMS (single node)"],"guarantees":"Works only without partitions – impossible in distributed systems","type":"CA","use_case":"Single-node systems only"}]},"1.6_monolith_vs_distributed":{"comparison":[{"aspect":"Structure","distributed":"Multiple independent services","monolithic":"Single deployable unit"},{"aspect":"Development","distributed":"Complex; requires orchestration","monolithic":"Simpler to start"},{"aspect":"Scaling","distributed":"Scale individual services","monolithic":"Scale entire app"},{"aspect":"Deployment","distributed":"Independent deployments","monolithic":"One deployment"},{"aspect":"Failure","distributed":"Isolated failures","monolithic":"One failure can crash all"},{"aspect":"Best For","distributed":"Large-scale / multiple teams","monolithic":"Early-stage / small teams"}],"tip":"Start with a monolith, then extract services when you hit bottlenecks. Don't over-engineer early."},"1.7_HLD_vs_LLD":{"comparison":[{"HLD":"Architecture \u0026 component interaction","LLD":"Class diagrams, DB schemas, APIs","aspect":"Focus"},{"HLD":"Architects, senior engineers","LLD":"Developers","aspect":"Audience"},{"HLD":"Bird's-eye view","LLD":"Step-by-step implementation","aspect":"Detail"},{"HLD":"Architecture diagram","LLD":"Code-level design, data models","aspect":"Deliverable"}]},"1.8_client_server_architecture":{"components":{"client":"Initiates requests (browser, mobile app, CLI)","protocol":"Usually HTTP/HTTPS, WebSocket for real-time","server":"Processes requests, returns response"},"notes":"DNS resolves domain to server IP; TCP/IP handles transport","request_flow":["1. User types URL → Browser resolves DNS","2. TCP connection established (3-way handshake)","3. HTTP request sent to server","4. Server processes, queries DB if needed","5. Response returned, browser renders HTML"]},"1.9_stateless_vs_stateful":{"comparison":[{"aspect":"State stored","stateful":"Server memory","stateless":"Client or external store (DB, cache)"},{"aspect":"Scalability","stateful":"Hard – sticky sessions required","stateless":"Easy – any server handles any request"},{"aspect":"Example","stateful":"WebSocket sessions, FTP","stateless":"REST APIs, HTTP"},{"aspect":"Fault tolerance","stateful":"Low – server crash loses session","stateless":"High – server crash loses nothing"}]}},"level":"Beginner","subtitle":"Fundamentals, scalability, CAP theorem, and core architectural concepts","topics_count":12}