Scalability & Performance
Also known as: Scale Up, Scaling Up
Increasing the capacity of a single machine — more CPU, memory, or disk — to handle more load.
Vertical scaling means making one server bigger. You replace a 4-core, 16 GB instance with a 32-core, 256 GB one. The application code typically does not change; the same monolithic process simply has more resources available.
Vertical scaling is appealing because it avoids distributed-systems complexity. There is no need to shard data, coordinate replicas, or handle partial failures. It is often the right first move when a system is small, the engineering team is small, and reliability requirements are modest.
The downside is a hard ceiling. Even the largest cloud instances top out at a few terabytes of RAM and a few hundred cores. Vertical scaling also provides no fault tolerance — if the one big machine dies, the entire system goes down. And the cost curve is non-linear: doubling instance size often more than doubles the price.
Use vertical scaling for early-stage systems, single-node databases (within reason), and any workload that genuinely fits on one machine. Many companies postpone horizontal scaling far longer than they think by aggressively scaling vertically first.
Hardware ceiling, no fault tolerance, expensive at the upper end, and downtime during upgrade in many setups. Eventually, every successful system outgrows vertical scaling.
Adding more machines to a system to handle increased load, as opposed to making a single machine more powerful.
Splitting a large dataset across multiple machines so that each shard holds a subset of the data and handles a subset of the load.
Storing copies of frequently accessed data in fast memory so that subsequent requests can be served without recomputing or refetching.
A globally distributed network of edge servers that cache static content close to end users to minimize latency and origin load.
A component that distributes incoming network traffic across multiple backend servers to maximize throughput, minimize response time, and avoid overload.
The time delay between a request being sent and a response being received — typically measured in milliseconds.