Scalability & Performance
Storing copies of frequently accessed data in fast memory so that subsequent requests can be served without recomputing or refetching.
Caching is the practice of keeping a temporary copy of data in a faster, closer storage tier so that repeated requests do not need to hit the original (slower, more expensive) source. Common cache tiers include CPU caches, in-process memory, distributed in-memory stores like Redis or Memcached, CDN edge caches, and HTTP browser caches.
Caches dramatically reduce latency and load on backend systems. A cache hit might take 1 millisecond; the equivalent database query might take 50–200 milliseconds. At scale, this difference is the difference between a system that costs thousands and one that costs millions.
The core challenge is invalidation — knowing when cached data has gone stale. Strategies include time-to-live (TTL) expiry, write-through (update cache on every write), write-behind (update cache first, persist later), and cache-aside (application reads cache, falls back to DB on miss, then populates cache).
Use caching whenever read traffic far exceeds writes, when computing a result is expensive, or when latency to the source of truth is high. Almost every read-heavy system at scale uses caching.
Caches add complexity (a second source of truth) and can serve stale data. They also introduce failure modes like cache stampedes (many clients miss the cache simultaneously and overwhelm the backend) and hot keys (one key receives a disproportionate share of traffic).
A globally distributed network of edge servers that cache static content close to end users to minimize latency and origin load.
The time delay between a request being sent and a response being received — typically measured in milliseconds.
A component that distributes incoming network traffic across multiple backend servers to maximize throughput, minimize response time, and avoid overload.
Adding more machines to a system to handle increased load, as opposed to making a single machine more powerful.
Increasing the capacity of a single machine — more CPU, memory, or disk — to handle more load.
The number of operations a system can handle per unit of time, often measured in requests per second (RPS) or queries per second (QPS).