Distributed Systems

Replication

Maintaining multiple copies of the same data across different nodes for fault tolerance, read scalability, and lower latency.

In depth

Replication is the process of copying data across multiple nodes — sometimes within a single data center, sometimes across continents. The goal is one or more of: durability (copies survive node failures), availability (a replica can serve reads if the primary is down), read scalability (load is spread across replicas), and lower read latency (reads served from a nearby replica).

There are three main replication topologies. Single-leader (primary-replica) is the most common: all writes go to one node, which streams a log of changes to followers. Multi-leader allows writes at multiple nodes, requiring conflict resolution. Leaderless (Dynamo-style) lets clients write to any node, using quorums and read-repair to converge.

Replication can be synchronous (the write only succeeds after replicas acknowledge) or asynchronous (the write returns immediately and replicas catch up later). Synchronous replication gives stronger durability but higher latency and lower availability. Most production systems run a mix: synchronous to a small number of nearby replicas, asynchronous to remote ones.

When to use

Use replication whenever data loss or downtime is unacceptable. Even a small system should run at least one replica for failover.

Tradeoffs

Replication multiplies storage cost, complicates writes, introduces lag (replicas may be behind), and creates split-brain risk during partitions. Asynchronous replication can lose recent writes on primary failure.

Leader Election

A protocol by which a group of nodes selects one node as the coordinator or "leader" responsible for a given task.

Consensus

The process by which a group of distributed nodes agree on a single value or sequence of values, even in the presence of failures.

Sharding

Splitting a large dataset across multiple machines so that each shard holds a subset of the data and handles a subset of the load.

Eventual Consistency

A consistency model where, given enough time and no new updates, all replicas of a piece of data will converge to the same value.

CAP Theorem

A principle stating that a distributed data store can provide at most two of: Consistency, Availability, and Partition tolerance.

Quorum

The minimum number of nodes that must agree on an operation for it to be considered successful in a distributed system.

HardInfrastructure

Distributed Systems

Replication

Maintaining multiple copies of the same data across different nodes for fault tolerance, read scalability, and lower latency.

In depth

When to use

Use replication whenever data loss or downtime is unacceptable. Even a small system should run at least one replica for failover.

Replication

In depth

When to use

Tradeoffs

Related terms

Leader Election

Consensus

Sharding

Eventual Consistency

CAP Theorem

Quorum

Practice this concept

Design a Distributed File System

Design a Distributed Messaging System

Design a Key Value Store

Design a Disaster Recovery System

Replication

In depth

When to use

Tradeoffs

Related terms

Leader Election

Consensus

Sharding

Eventual Consistency

CAP Theorem

Quorum

Practice this concept

Design a Distributed File System

Design a Distributed Messaging System

Design a Key Value Store

Design a Disaster Recovery System