Distributed Systems
Also known as: Partitioning, Horizontal Partitioning
Splitting a large dataset across multiple machines so that each shard holds a subset of the data and handles a subset of the load.
Sharding (also called horizontal partitioning) divides a large dataset into smaller pieces called shards, each held on a separate node. Where replication makes copies of the same data, sharding distributes different data across nodes. Together they let a system scale beyond the capacity and throughput of any single machine.
The shard key — the field used to decide which shard a row belongs to — is the most important design decision in any sharded system. Hash-based sharding (hash the key, mod by shard count) gives even distribution but makes range queries expensive. Range-based sharding keeps related rows together for fast range scans but risks hot shards if the workload is skewed. Geographic sharding routes users to the data center nearest them.
Resharding (changing the number of shards) is one of the hardest operational problems in distributed systems. Consistent hashing reduces the data movement required when shards are added or removed, which is why it underpins systems like Cassandra and DynamoDB.
Shard when a single node cannot hold the dataset, cannot serve the write throughput, or cannot meet latency requirements. Shard reluctantly — operational cost is high.
Sharding eliminates cross-shard transactions, makes joins expensive, complicates resharding, and often forces denormalization. A wrong shard key bakes in years of pain.
Adding more machines to a system to handle increased load, as opposed to making a single machine more powerful.
A hashing technique that minimizes the amount of data that needs to be moved when nodes are added to or removed from a distributed system.
Maintaining multiple copies of the same data across different nodes for fault tolerance, read scalability, and lower latency.
A principle stating that a distributed data store can provide at most two of: Consistency, Availability, and Partition tolerance.
A consistency model where, given enough time and no new updates, all replicas of a piece of data will converge to the same value.
The process by which a group of distributed nodes agree on a single value or sequence of values, even in the presence of failures.