Scalability & Performance
Also known as: Throttling
A control mechanism that caps the number of requests a client can make in a given time window to protect a service from abuse and overload.
Rate limiting restricts how many requests a single client (identified by user ID, API key, IP address, or any combination) can make within a sliding or fixed time window. It is the primary defense against abusive clients, runaway scripts, and accidental request storms.
Common algorithms include the token bucket (tokens regenerate at a fixed rate; each request consumes one), the leaky bucket (requests queue and drain at a fixed rate), fixed window (count requests per minute, reset at minute boundaries), and sliding window (a smoother variant that avoids edge bursts).
In a distributed system, rate limiting state must be shared across all nodes. Redis is a popular backend because of its single-threaded atomic operations like INCR and EXPIRE. Some systems push rate limiting to the edge (CDN, API gateway) so that abusive traffic is rejected before it reaches origin.
Apply rate limiting to every public API, every login endpoint, every expensive operation, and any endpoint that triggers downstream side effects (email, SMS, payment).
Aggressive rate limiting frustrates legitimate users; lax rate limiting leaves you exposed. Distributed rate limiting requires coordination, adding latency. Per-user limits can be circumvented by clients using many accounts or IPs.
A single entry point that routes external requests to internal services, handling concerns like authentication, rate limiting, and request transformation in one place.
A pattern that stops calls to a failing downstream service for a cool-off period to prevent cascading failures and give the service time to recover.
A property of operations such that performing them multiple times has the same effect as performing them once — essential for safe retries.
Storing copies of frequently accessed data in fast memory so that subsequent requests can be served without recomputing or refetching.
A globally distributed network of edge servers that cache static content close to end users to minimize latency and origin load.
A component that distributes incoming network traffic across multiple backend servers to maximize throughput, minimize response time, and avoid overload.