Distributed Systems
Maintaining multiple copies of the same data across different nodes for fault tolerance, read scalability, and lower latency.
Replication is the process of copying data across multiple nodes — sometimes within a single data center, sometimes across continents. The goal is one or more of: durability (copies survive node failures), availability (a replica can serve reads if the primary is down), read scalability (load is spread across replicas), and lower read latency (reads served from a nearby replica).
There are three main replication topologies. Single-leader (primary-replica) is the most common: all writes go to one node, which streams a log of changes to followers. Multi-leader allows writes at multiple nodes, requiring conflict resolution. Leaderless (Dynamo-style) lets clients write to any node, using quorums and read-repair to converge.
Replication can be synchronous (the write only succeeds after replicas acknowledge) or asynchronous (the write returns immediately and replicas catch up later). Synchronous replication gives stronger durability but higher latency and lower availability. Most production systems run a mix: synchronous to a small number of nearby replicas, asynchronous to remote ones.
Use replication whenever data loss or downtime is unacceptable. Even a small system should run at least one replica for failover.
Replication multiplies storage cost, complicates writes, introduces lag (replicas may be behind), and creates split-brain risk during partitions. Asynchronous replication can lose recent writes on primary failure.
A protocol by which a group of nodes selects one node as the coordinator or "leader" responsible for a given task.
The process by which a group of distributed nodes agree on a single value or sequence of values, even in the presence of failures.
Splitting a large dataset across multiple machines so that each shard holds a subset of the data and handles a subset of the load.
A consistency model where, given enough time and no new updates, all replicas of a piece of data will converge to the same value.
A principle stating that a distributed data store can provide at most two of: Consistency, Availability, and Partition tolerance.
The minimum number of nodes that must agree on an operation for it to be considered successful in a distributed system.