Distributed Systems
Also known as: Distributed Consensus, Raft, Paxos
The process by which a group of distributed nodes agree on a single value or sequence of values, even in the presence of failures.
Consensus is one of the fundamental problems in distributed systems. A consensus protocol allows a group of nodes to agree on a value such that all non-faulty nodes decide on the same value, the chosen value was proposed by some node, and the protocol terminates. The two famous algorithms are Paxos (Lamport, 1990s) and Raft (Ongaro & Ousterhout, 2014). Raft was explicitly designed to be more understandable than Paxos and is the dominant choice in modern systems.
Consensus underpins much of modern infrastructure: leader election in Kubernetes (etcd), service discovery in Consul, distributed locking in ZooKeeper, transaction commit in Spanner, and replicated state machines in nearly every modern database. Whenever you need a single source of truth across multiple nodes, consensus is the building block.
Consensus requires a majority (quorum) of nodes to agree. With three nodes, you need at least two; with five, three. This is why consensus clusters are sized as odd numbers — five nodes can tolerate two failures, four can only tolerate one.
Use consensus whenever multiple nodes must agree on a critical fact: who is the leader, what is the next log entry, did this transaction commit. Consensus is expensive — keep its scope small.
Consensus is slow (multiple network round-trips), unavailable when a quorum is not reachable, and complex to implement correctly. Reuse battle-tested implementations (etcd, ZooKeeper) rather than rolling your own.
A protocol by which a group of nodes selects one node as the coordinator or "leader" responsible for a given task.
The minimum number of nodes that must agree on an operation for it to be considered successful in a distributed system.
Maintaining multiple copies of the same data across different nodes for fault tolerance, read scalability, and lower latency.
A principle stating that a distributed data store can provide at most two of: Consistency, Availability, and Partition tolerance.
A consistency model where, given enough time and no new updates, all replicas of a piece of data will converge to the same value.
Splitting a large dataset across multiple machines so that each shard holds a subset of the data and handles a subset of the load.