SystemCity
WorkspaceProblemsCanvasPricing
Sign in
S

SystemCity

AI-powered system design tutor. Learn architecture, ace interviews, build real systems.

Learn

  • Learn System Design
  • Interview Prep Guide
  • All Problems
  • Glossary
  • Compare
  • Design Canvas

Product

  • Pricing
  • Portfolio
  • Support

Legal

  • Terms
  • Privacy
  • Refunds

© 2026 SystemCity. All rights reserved.

Master system design · interview prep · 120+ problems

Back to glossary

Reliability & Resilience

Retry & Backoff

Also known as: Exponential Backoff

A reliability pattern that re-attempts failed operations after progressively longer delays, optionally with jitter, to ride out transient failures.

In depth

Retrying failed operations is the simplest reliability pattern and one of the easiest to get wrong. The naive approach — retry immediately on failure — turns a hiccup into a flood, since every client hits the recovering service simultaneously. Exponential backoff multiplies the delay with each retry (1s, 2s, 4s, 8s, …), spreading out the load. Adding jitter (random variation) prevents synchronized retry waves from many clients.

Not every error is retryable. 5xx errors and timeouts often warrant a retry; 4xx errors usually do not (the request itself is wrong). Idempotent operations are always safe to retry; non-idempotent ones need an idempotency key first. Cap the total retry budget — both per request and per time window — to avoid runaway behavior.

At the system level, retries often pair with circuit breakers (stop retrying when the downstream is clearly down) and dead-letter queues (after N failed retries, move the message aside for human inspection).

When to use

Apply retry-with-backoff-and-jitter to every network call that might transiently fail: HTTP, database connections, queue operations, external APIs.

Tradeoffs

Aggressive retries can overwhelm a struggling downstream. Long retry chains increase user-perceived latency. Without idempotency, retries can duplicate side effects.

Related terms

Idempotency

A property of operations such that performing them multiple times has the same effect as performing them once — essential for safe retries.

Circuit Breaker

A pattern that stops calls to a failing downstream service for a cool-off period to prevent cascading failures and give the service time to recover.

Rate Limiting

A control mechanism that caps the number of requests a client can make in a given time window to protect a service from abuse and overload.

Message Queue

A buffer that holds messages between producers and consumers, enabling asynchronous processing and decoupling of services.

SLA, SLO, SLI

Service Level Indicator (the metric), Service Level Objective (the target), and Service Level Agreement (the contract with consequences).

Graceful Degradation

Designing a system so that when a component fails, the rest of the system continues to operate with reduced functionality rather than failing completely.

Practice this concept

MediumInfrastructure

Design a Webhook Notification Service

HardFinance

Design a Scheduled Digital Transaction System