Scalability & Performance
A component that distributes incoming network traffic across multiple backend servers to maximize throughput, minimize response time, and avoid overload.
A load balancer sits in front of a pool of backend servers and routes each incoming request to one of them according to an algorithm. Common algorithms include round-robin, least-connections, weighted round-robin, IP hash (for sticky sessions), and consistent hashing.
Load balancers operate at different OSI layers. Layer 4 (transport) load balancers route based on IP and port — fast but unaware of HTTP semantics. Layer 7 (application) load balancers like NGINX, HAProxy, and AWS ALB inspect HTTP headers, paths, and cookies, allowing path-based routing, header-based routing, and request-level features like rate limiting, TLS termination, and request rewriting.
A properly designed load balancer enables horizontal scaling, performs health checks to remove unhealthy nodes from rotation, and provides a single virtual IP / DNS name that hides the topology of the backend.
Anytime you have more than one backend server. The first scaling step in almost every architecture is to put a load balancer in front of two or more app servers.
A single load balancer is a single point of failure — production systems run them in active/active or active/passive HA pairs. Sticky sessions improve cache locality but reduce flexibility and can create hot servers.
Adding more machines to a system to handle increased load, as opposed to making a single machine more powerful.
A server that sits in front of one or more backend servers and forwards client requests to them, often handling TLS, caching, compression, and load balancing.
A hashing technique that minimizes the amount of data that needs to be moved when nodes are added to or removed from a distributed system.
Storing copies of frequently accessed data in fast memory so that subsequent requests can be served without recomputing or refetching.
A globally distributed network of edge servers that cache static content close to end users to minimize latency and origin load.
Increasing the capacity of a single machine — more CPU, memory, or disk — to handle more load.