Reliability & Resilience
A pattern that stops calls to a failing downstream service for a cool-off period to prevent cascading failures and give the service time to recover.
A circuit breaker wraps calls to a downstream service. When failures exceed a threshold (e.g., 50% of calls in the last 10 seconds), the breaker "trips" — subsequent calls fail immediately without contacting the downstream service. After a cool-off period, the breaker enters a half-open state and lets a limited number of calls through to test recovery; if they succeed, the breaker closes and normal traffic resumes.
The pattern prevents two failure modes. First, cascading failure: a slow downstream chokes the upstream service's thread pool, which then chokes its upstream, until the whole system is gridlocked. Second, retry storms: every client retrying a failing service drives load even higher and prevents recovery.
Libraries like Netflix Hystrix (now in maintenance), Resilience4j, Polly, and most service mesh sidecars (Istio, Linkerd) implement circuit breakers as a configuration concern rather than application code.
Wrap every cross-service call with a circuit breaker, especially in microservice architectures where one slow service can drag down many others.
Circuit breakers introduce additional behavior to test and tune. Fallback responses may be confusing to users. Tripping too eagerly causes false positives; too late defeats the purpose.
A reliability pattern that re-attempts failed operations after progressively longer delays, optionally with jitter, to ride out transient failures.
A control mechanism that caps the number of requests a client can make in a given time window to protect a service from abuse and overload.
Designing a system so that when a component fails, the rest of the system continues to operate with reduced functionality rather than failing completely.
A property of operations such that performing them multiple times has the same effect as performing them once — essential for safe retries.
Service Level Indicator (the metric), Service Level Objective (the target), and Service Level Agreement (the contract with consequences).