Reliability & Resilience
Also known as: Exponential Backoff
A reliability pattern that re-attempts failed operations after progressively longer delays, optionally with jitter, to ride out transient failures.
Retrying failed operations is the simplest reliability pattern and one of the easiest to get wrong. The naive approach — retry immediately on failure — turns a hiccup into a flood, since every client hits the recovering service simultaneously. Exponential backoff multiplies the delay with each retry (1s, 2s, 4s, 8s, …), spreading out the load. Adding jitter (random variation) prevents synchronized retry waves from many clients.
Not every error is retryable. 5xx errors and timeouts often warrant a retry; 4xx errors usually do not (the request itself is wrong). Idempotent operations are always safe to retry; non-idempotent ones need an idempotency key first. Cap the total retry budget — both per request and per time window — to avoid runaway behavior.
At the system level, retries often pair with circuit breakers (stop retrying when the downstream is clearly down) and dead-letter queues (after N failed retries, move the message aside for human inspection).
Apply retry-with-backoff-and-jitter to every network call that might transiently fail: HTTP, database connections, queue operations, external APIs.
Aggressive retries can overwhelm a struggling downstream. Long retry chains increase user-perceived latency. Without idempotency, retries can duplicate side effects.
A property of operations such that performing them multiple times has the same effect as performing them once — essential for safe retries.
A pattern that stops calls to a failing downstream service for a cool-off period to prevent cascading failures and give the service time to recover.
A control mechanism that caps the number of requests a client can make in a given time window to protect a service from abuse and overload.
A buffer that holds messages between producers and consumers, enabling asynchronous processing and decoupling of services.
Service Level Indicator (the metric), Service Level Objective (the target), and Service Level Agreement (the contract with consequences).
Designing a system so that when a component fails, the rest of the system continues to operate with reduced functionality rather than failing completely.