Reliability & Resilience
A property of operations such that performing them multiple times has the same effect as performing them once — essential for safe retries.
An operation is idempotent if calling it twice has the same effect as calling it once. Reading a value (GET) is naturally idempotent. Setting a value (PUT user.name = "Alice") is idempotent because applying it twice still results in the same name. Incrementing a counter (POST /counter/inc) is not idempotent — two calls produce two increments.
Idempotency matters because networks fail. A client sends a request, gets no response, and must decide: did the server receive it or not? With an idempotent operation, the safe answer is always "retry". With a non-idempotent operation, retrying risks double-charging, double-sending, or double-creating.
The standard technique for making non-idempotent operations safe is the idempotency key. The client generates a unique key per logical operation and includes it in the request; the server records the key alongside the result. If the same key arrives again, the server returns the cached result instead of re-executing. Stripe, AWS, and most modern payment APIs use this pattern.
Make every state-changing operation idempotent — payments, user creation, order placement, message processing. Network retries are inevitable; idempotency makes them safe.
Idempotency keys require server-side state (typically with an expiration). Some operations are genuinely hard to make idempotent without restructuring the data model.
A reliability pattern that re-attempts failed operations after progressively longer delays, optionally with jitter, to ride out transient failures.
A buffer that holds messages between producers and consumers, enabling asynchronous processing and decoupling of services.
An HTTP callback that one system sends to another to notify it of an event, enabling push-style integrations between services.
A pattern that stops calls to a failing downstream service for a cool-off period to prevent cascading failures and give the service time to recover.
Service Level Indicator (the metric), Service Level Objective (the target), and Service Level Agreement (the contract with consequences).
Designing a system so that when a component fails, the rest of the system continues to operate with reduced functionality rather than failing completely.