Scalability & Performance

Caching

Storing copies of frequently accessed data in fast memory so that subsequent requests can be served without recomputing or refetching.

In depth

Caching is the practice of keeping a temporary copy of data in a faster, closer storage tier so that repeated requests do not need to hit the original (slower, more expensive) source. Common cache tiers include CPU caches, in-process memory, distributed in-memory stores like Redis or Memcached, CDN edge caches, and HTTP browser caches.

Caches dramatically reduce latency and load on backend systems. A cache hit might take 1 millisecond; the equivalent database query might take 50–200 milliseconds. At scale, this difference is the difference between a system that costs thousands and one that costs millions.

The core challenge is invalidation — knowing when cached data has gone stale. Strategies include time-to-live (TTL) expiry, write-through (update cache on every write), write-behind (update cache first, persist later), and cache-aside (application reads cache, falls back to DB on miss, then populates cache).

When to use

Use caching whenever read traffic far exceeds writes, when computing a result is expensive, or when latency to the source of truth is high. Almost every read-heavy system at scale uses caching.

Tradeoffs

Caches add complexity (a second source of truth) and can serve stale data. They also introduce failure modes like cache stampedes (many clients miss the cache simultaneously and overwhelm the backend) and hot keys (one key receives a disproportionate share of traffic).

CDN (Content Delivery Network)

A globally distributed network of edge servers that cache static content close to end users to minimize latency and origin load.

Latency

The time delay between a request being sent and a response being received — typically measured in milliseconds.

Load Balancer

A component that distributes incoming network traffic across multiple backend servers to maximize throughput, minimize response time, and avoid overload.

Horizontal Scaling

Adding more machines to a system to handle increased load, as opposed to making a single machine more powerful.

Vertical Scaling

Increasing the capacity of a single machine — more CPU, memory, or disk — to handle more load.

Throughput

The number of operations a system can handle per unit of time, often measured in requests per second (RPS) or queries per second (QPS).

MediumInfrastructure

Scalability & Performance

Caching

Storing copies of frequently accessed data in fast memory so that subsequent requests can be served without recomputing or refetching.

In depth

When to use

Use caching whenever read traffic far exceeds writes, when computing a result is expensive, or when latency to the source of truth is high. Almost every read-heavy system at scale uses caching.

Caching

In depth

When to use

Tradeoffs

Related terms

CDN (Content Delivery Network)

Latency

Load Balancer

Horizontal Scaling

Vertical Scaling

Throughput

Practice this concept

Design a Web Cache

Design a Global Content Distribution Network

Design a Cloud Storage Gateway

Design a Domain Name System

Caching

In depth

When to use

Tradeoffs

Related terms

CDN (Content Delivery Network)

Latency

Load Balancer

Horizontal Scaling

Vertical Scaling

Throughput

Practice this concept

Design a Web Cache

Design a Global Content Distribution Network

Design a Cloud Storage Gateway

Design a Domain Name System