Reliability & Resilience
Service Level Indicator (the metric), Service Level Objective (the target), and Service Level Agreement (the contract with consequences).
These three terms — popularized by Google's SRE practice — form a hierarchy of service quality. An SLI (Service Level Indicator) is a measurable metric: request success rate, p99 latency, time to first byte. An SLO (Service Level Objective) is the target you commit to internally: "99.9% of requests succeed", "p99 latency under 300 ms over a 28-day window". An SLA (Service Level Agreement) is the external, contractual version, typically with financial penalties for breach.
The SLO is the most important of the three for engineering practice. It defines an error budget — the small fraction of failures you are allowed before missing the target. If you are well within budget, you can ship faster and take more risks. If you are burning the budget, slow down, harden the system, focus on reliability work.
SLOs work best when they are user-centered (measure what users care about, not internal proxies), few in number (one or two per service), and tied to operational decisions (alerts, postmortems, release decisions).
Define SLOs for every user-facing service. The conversation about what number to pick is the most valuable part of the exercise.
SLOs require honest measurement and the discipline to act on the budget. Set them too high and you cripple velocity; too low and you erode user trust.
The time delay between a request being sent and a response being received — typically measured in milliseconds.
The number of operations a system can handle per unit of time, often measured in requests per second (RPS) or queries per second (QPS).
A pattern that stops calls to a failing downstream service for a cool-off period to prevent cascading failures and give the service time to recover.
A property of operations such that performing them multiple times has the same effect as performing them once — essential for safe retries.
A reliability pattern that re-attempts failed operations after progressively longer delays, optionally with jitter, to ride out transient failures.
Designing a system so that when a component fails, the rest of the system continues to operate with reduced functionality rather than failing completely.