Reliability & Resilience
Designing a system so that when a component fails, the rest of the system continues to operate with reduced functionality rather than failing completely.
Graceful degradation is the practice of designing services to fail partially rather than completely. When a recommendation service is down, the homepage still loads — just without the personalized carousel. When the comments service is slow, the article still renders — comments show a "loading…" placeholder or are hidden entirely. The user sees a worse experience but not a broken one.
Implementing graceful degradation typically combines several techniques: timeouts on every dependency, circuit breakers that trip on excess failures, fallback values (cached, default, or empty), feature flags that can disable expensive features under load, and prioritization of critical paths over nice-to-haves.
The alternative — a single dependency outage takes down the whole product — is almost always worse for users and worse for the business. Designing for partial failure is what separates merely working systems from genuinely robust ones.
Build graceful degradation into every system that has independent components. The richer the product surface, the more critical it becomes.
Graceful degradation requires more code paths, more fallback content, more testing, and more thought during design. It is easy to skip and very expensive to retrofit.
A pattern that stops calls to a failing downstream service for a cool-off period to prevent cascading failures and give the service time to recover.
A reliability pattern that re-attempts failed operations after progressively longer delays, optionally with jitter, to ride out transient failures.
A property of operations such that performing them multiple times has the same effect as performing them once — essential for safe retries.
Service Level Indicator (the metric), Service Level Objective (the target), and Service Level Agreement (the contract with consequences).