Scalability & Performance
Also known as: Scale Out, Scaling Out
Adding more machines to a system to handle increased load, as opposed to making a single machine more powerful.
Horizontal scaling — also called scaling out — adds more nodes (servers, containers, pods) to a system. Each new node takes a share of the traffic, typically routed by a load balancer. The capacity of the system grows roughly linearly with the number of nodes.
Horizontal scaling is the dominant approach for modern web-scale systems because commodity hardware is cheap, cloud providers make provisioning instant, and there is no theoretical ceiling: you can always add another node. Stateless services (where any node can serve any request) horizontally scale trivially. Stateful services (databases, caches with affinity) scale horizontally through sharding, replication, or partitioning.
Vertical scaling — buying a bigger box — is the alternative. Vertical scaling is simpler and avoids distributed-systems complexity, but hits hardware ceilings, has costly upgrade paths, and provides no fault tolerance.
Choose horizontal scaling when traffic exceeds what a single beefy machine can handle, when you need fault tolerance (one node can die without bringing down the system), or when you want elastic capacity that grows and shrinks with demand.
Horizontal scaling pushes complexity into your application: you must handle distributed state, eventual consistency, partial failures, and idempotency. Stateful workloads (databases) are much harder to scale out than stateless ones.
A component that distributes incoming network traffic across multiple backend servers to maximize throughput, minimize response time, and avoid overload.
Splitting a large dataset across multiple machines so that each shard holds a subset of the data and handles a subset of the load.
Maintaining multiple copies of the same data across different nodes for fault tolerance, read scalability, and lower latency.
Storing copies of frequently accessed data in fast memory so that subsequent requests can be served without recomputing or refetching.
A globally distributed network of edge servers that cache static content close to end users to minimize latency and origin load.
Increasing the capacity of a single machine — more CPU, memory, or disk — to handle more load.