Scalability & Performance
The time delay between a request being sent and a response being received — typically measured in milliseconds.
Latency measures how long a single operation takes from start to finish. For a web request, end-to-end latency includes DNS resolution, TCP handshake, TLS handshake, request transmission, server processing, response transmission, and client rendering.
In distributed systems, latency is reported as percentiles — p50, p95, p99, p99.9 — not averages. The reason is that long-tail latency dominates user experience. A service with a 100 ms median but a 5-second p99 will frustrate users on roughly one in every 100 requests, and at scale that means many frustrated users per minute.
Latency is bounded by physics: the speed of light caps round-trip time between continents at around 80–150 ms. The only ways to defeat this are caching (avoid the trip entirely) and edge compute (move the computation closer to the user).
Always track and budget latency for any user-facing system. Define an SLO (e.g., p99 < 300 ms) and treat regressions as bugs.
Optimizing for latency often costs throughput, money (more replicas, more edge nodes), or simplicity. Latency at the tail (p99+) is much harder to control than the median.
The number of operations a system can handle per unit of time, often measured in requests per second (RPS) or queries per second (QPS).
A globally distributed network of edge servers that cache static content close to end users to minimize latency and origin load.
Storing copies of frequently accessed data in fast memory so that subsequent requests can be served without recomputing or refetching.
A component that distributes incoming network traffic across multiple backend servers to maximize throughput, minimize response time, and avoid overload.
Adding more machines to a system to handle increased load, as opposed to making a single machine more powerful.
Increasing the capacity of a single machine — more CPU, memory, or disk — to handle more load.