SystemCity
WorkspaceProblemsCanvasPricing
Sign in
S

SystemCity

AI-powered system design tutor. Learn architecture, ace interviews, build real systems.

Learn

  • Learn System Design
  • Interview Prep Guide
  • All Problems
  • Glossary
  • Compare
  • Design Canvas

Product

  • Pricing
  • Portfolio
  • Support

Legal

  • Terms
  • Privacy
  • Refunds

© 2026 SystemCity. All rights reserved.

Master system design · interview prep · 120+ problems

Back to glossary

Reliability & Resilience

SLA, SLO, SLI

Service Level Indicator (the metric), Service Level Objective (the target), and Service Level Agreement (the contract with consequences).

In depth

These three terms — popularized by Google's SRE practice — form a hierarchy of service quality. An SLI (Service Level Indicator) is a measurable metric: request success rate, p99 latency, time to first byte. An SLO (Service Level Objective) is the target you commit to internally: "99.9% of requests succeed", "p99 latency under 300 ms over a 28-day window". An SLA (Service Level Agreement) is the external, contractual version, typically with financial penalties for breach.

The SLO is the most important of the three for engineering practice. It defines an error budget — the small fraction of failures you are allowed before missing the target. If you are well within budget, you can ship faster and take more risks. If you are burning the budget, slow down, harden the system, focus on reliability work.

SLOs work best when they are user-centered (measure what users care about, not internal proxies), few in number (one or two per service), and tied to operational decisions (alerts, postmortems, release decisions).

When to use

Define SLOs for every user-facing service. The conversation about what number to pick is the most valuable part of the exercise.

Tradeoffs

SLOs require honest measurement and the discipline to act on the budget. Set them too high and you cripple velocity; too low and you erode user trust.

Related terms

Latency

The time delay between a request being sent and a response being received — typically measured in milliseconds.

Throughput

The number of operations a system can handle per unit of time, often measured in requests per second (RPS) or queries per second (QPS).

Circuit Breaker

A pattern that stops calls to a failing downstream service for a cool-off period to prevent cascading failures and give the service time to recover.

Idempotency

A property of operations such that performing them multiple times has the same effect as performing them once — essential for safe retries.

Retry & Backoff

A reliability pattern that re-attempts failed operations after progressively longer delays, optionally with jitter, to ride out transient failures.

Graceful Degradation

Designing a system so that when a component fails, the rest of the system continues to operate with reduced functionality rather than failing completely.