Distributed Systems
A protocol by which a group of nodes selects one node as the coordinator or "leader" responsible for a given task.
Leader election is the mechanism by which a distributed system designates one node as the leader for some scope of work. The leader typically handles writes, coordinates state changes, or schedules work, while followers replicate or stand by. Examples: a single MongoDB primary, the etcd leader, the active master in HDFS, the producer of a Kafka transaction coordinator.
Election is usually built on consensus: nodes propose themselves, a quorum agrees, and the winner takes the role for a bounded "term" or "lease". When the leader fails (detected by missed heartbeats), a new election kicks off and another node takes over. Lease-based election with fencing tokens prevents the dreaded split-brain scenario where two nodes both believe they are the leader.
Leader election simplifies many distributed problems by reducing them to single-node problems: only one node accepts writes, so you avoid concurrent-write conflicts entirely.
Use leader election whenever you need a single coordinator: primary database replicas, scheduled job runners, cluster managers, distributed lock services.
A single leader is a bottleneck (all writes go through it) and an availability concern (if election takes seconds, the system is unavailable for writes during that window). Multi-leader and leaderless designs trade complexity for higher availability.
The process by which a group of distributed nodes agree on a single value or sequence of values, even in the presence of failures.
The minimum number of nodes that must agree on an operation for it to be considered successful in a distributed system.
Maintaining multiple copies of the same data across different nodes for fault tolerance, read scalability, and lower latency.
A principle stating that a distributed data store can provide at most two of: Consistency, Availability, and Partition tolerance.
A consistency model where, given enough time and no new updates, all replicas of a piece of data will converge to the same value.
Splitting a large dataset across multiple machines so that each shard holds a subset of the data and handles a subset of the load.