Scalability & Performance

Latency

The time delay between a request being sent and a response being received — typically measured in milliseconds.

In depth

Latency measures how long a single operation takes from start to finish. For a web request, end-to-end latency includes DNS resolution, TCP handshake, TLS handshake, request transmission, server processing, response transmission, and client rendering.

In distributed systems, latency is reported as percentiles — p50, p95, p99, p99.9 — not averages. The reason is that long-tail latency dominates user experience. A service with a 100 ms median but a 5-second p99 will frustrate users on roughly one in every 100 requests, and at scale that means many frustrated users per minute.

Latency is bounded by physics: the speed of light caps round-trip time between continents at around 80–150 ms. The only ways to defeat this are caching (avoid the trip entirely) and edge compute (move the computation closer to the user).

When to use

Always track and budget latency for any user-facing system. Define an SLO (e.g., p99 < 300 ms) and treat regressions as bugs.

Tradeoffs

Optimizing for latency often costs throughput, money (more replicas, more edge nodes), or simplicity. Latency at the tail (p99+) is much harder to control than the median.

Throughput

The number of operations a system can handle per unit of time, often measured in requests per second (RPS) or queries per second (QPS).

CDN (Content Delivery Network)

A globally distributed network of edge servers that cache static content close to end users to minimize latency and origin load.

Caching

Storing copies of frequently accessed data in fast memory so that subsequent requests can be served without recomputing or refetching.

Load Balancer

A component that distributes incoming network traffic across multiple backend servers to maximize throughput, minimize response time, and avoid overload.

Horizontal Scaling

Adding more machines to a system to handle increased load, as opposed to making a single machine more powerful.

Vertical Scaling

Increasing the capacity of a single machine — more CPU, memory, or disk — to handle more load.

Back to glossary

Scalability & Performance

Latency

The time delay between a request being sent and a response being received — typically measured in milliseconds.

In depth

When to use

Always track and budget latency for any user-facing system. Define an SLO (e.g., p99 < 300 ms) and treat regressions as bugs.

Tradeoffs

Optimizing for latency often costs throughput, money (more replicas, more edge nodes), or simplicity. Latency at the tail (p99+) is much harder to control than the median.

Latency

In depth

When to use

Tradeoffs

Related terms

Throughput

CDN (Content Delivery Network)

Caching

Load Balancer

Horizontal Scaling

Vertical Scaling

Latency

In depth

When to use

Tradeoffs

Related terms

Throughput

CDN (Content Delivery Network)

Caching

Load Balancer

Horizontal Scaling

Vertical Scaling