SystemCity
WorkspaceProblemsCanvasPricing
Sign in
S

SystemCity

AI-powered system design tutor. Learn architecture, ace interviews, build real systems.

Learn

  • Learn System Design
  • Interview Prep Guide
  • All Problems
  • Glossary
  • Compare
  • Design Canvas

Product

  • Pricing
  • Portfolio
  • Support

Legal

  • Terms
  • Privacy
  • Refunds

© 2026 SystemCity. All rights reserved.

Master system design · interview prep · 120+ problems

Back to glossary

Scalability & Performance

Throughput

The number of operations a system can handle per unit of time, often measured in requests per second (RPS) or queries per second (QPS).

In depth

Throughput measures the rate at which a system processes work. Common units include requests per second, transactions per second, messages per second, or bytes per second. Throughput and latency are related but distinct: a system can have high throughput and high latency simultaneously (think of a long assembly line that produces a finished product every second but takes an hour for any single item to traverse).

Throughput is bounded by the slowest stage in the pipeline. A web service might be CPU-bound, network-bound, database-bound, or I/O-bound, and the bottleneck shifts as you scale. Profiling and load testing are essential to know which.

For system design interviews, you should be able to do back-of-envelope calculations: a single modern server can typically handle a few thousand to tens of thousands of requests per second for simple workloads, much less for complex ones.

When to use

Throughput is the primary scaling metric for batch systems, message queues, ingestion pipelines, and write-heavy databases.

Tradeoffs

Maximizing throughput often hurts latency: batching and pipelining trade individual response time for aggregate work done. Find the right point on this curve for your workload.

Related terms

Latency

The time delay between a request being sent and a response being received — typically measured in milliseconds.

Load Balancer

A component that distributes incoming network traffic across multiple backend servers to maximize throughput, minimize response time, and avoid overload.

Caching

Storing copies of frequently accessed data in fast memory so that subsequent requests can be served without recomputing or refetching.

CDN (Content Delivery Network)

A globally distributed network of edge servers that cache static content close to end users to minimize latency and origin load.

Horizontal Scaling

Adding more machines to a system to handle increased load, as opposed to making a single machine more powerful.

Vertical Scaling

Increasing the capacity of a single machine — more CPU, memory, or disk — to handle more load.