Scaling Queue Workers Horizontally

A complete guide to understand and engineer Queue Workers. Throughout this blog the main idea is to build a language/framework agnostic understanding of the topic.

Queue Meme

Tools For The Job

Although the concept is language agnostic, we would still need some language/framework to test our understanding. For the purpose of this blog we would be using the following tools:

Python (For Writing Producer and Consumer Logic)
Redis (To Act as Message Broker)
BullMQ (For Queue and Worker Implementation)

Introduction

Before we get into the testing part, lets try to build a conceptual understanding of the model.

1. Architectural Foundations

At its core a queue worker system is specialised architectural component designed to decouple the generation of work from its processing. Let's compare this with a real world example.

Imagine you own a car agency where you have specialised workers, like mechanics, ready to work. People walk in with various needs—some want to buy a car, while others need a complicated repair. But when these customers walk through the door, they have absolutely no idea where to go or who to talk to.

To solve this, you might think, "I'll just add a simple reception area to point the customers to the right place." But very soon, things start falling apart. First, your mechanics get extremely frustrated. They are highly skilled workers with their own tools and environment, but now they are getting interrupted constantly. They cannot work properly because a lot of their time is wasted dealing directly with customers. It's a clear mixing of roles — a "separation of concerns" problem. The receptionist can only give directions, leaving the mechanics completely overloaded with talking to people instead of doing their actual jobs.

To make matters worse, the customers are also getting angry. Every visit takes way too much time. When they want to buy a car, there is no one available to give them simple attention, show them the cars, or nicely explain which model is best for their use case.

So, what can you do?

Car Dealership System

You introduce a Manager to handle the front end of your business.

This Manager has two big jobs. First, if a customer comes in with a quick, simple request—like wanting to look at cars or needing advice—the Manager takes care of it right away on their own. They make the process lightning fast and keep the customer smiling.

Second, if a customer needs serious work done, the Manager doesn't try to fix the car themselves. Instead, they take the car from the customer, promise them it will be handled perfectly, and then pass the heavy lifting off to the specialised mechanics in the back. The mechanics can now take their time and use their specialised tools in peace, without ever having to talk to a customer. When the jobs are finally finished, the Manager steps back in, tells the customer the good news, and hands them back their finished car.

Comparing this analogy with our original topic, the incoming customers are the incoming requests that we need to process in our server. We should not make them wait as they can overload our system. So, we send some of these requests to separate workers which have proper tools and environment for the job. But, sending these requests straight to the workers can overload the workers as well because we might not have as many workers as incoming requests. So, we introduce a queue system (broker) to manage the incoming requests and send them to the workers in a controlled manner. In our case the server is the manager who is guiding the requests in processing. The server might process the some of the request right away like making some database calls, or serving some static files, but for the heavy lifting like sending emails, processing images, or videos etc, it delegates the task to the workers.

2. Individual Components

In terms of Computer Science, the actual components of this decoupling trinity are:

Producer: The component that is responsible for generating the jobs.
The Message Broker: Acts as a durable intermediary, managing logical queues and ensuring message persistence.
Worker: The component that is responsible for processing the jobs.

This decoupling allows producers to continue sending messages even if consumers are offline or operating under heavy backpressure. This is where the concept of horizontal scaling comes into play.

Note: In queueing theory, a discipline within the mathematical theory of probability, the backpressure routing algorithm is a method for directing traffic around a queueing network that achieves maximum network throughput, which is established using concepts of Lyapunov drift. Backpressure routing considers the situation where each job can visit multiple service nodes in the network. It is an extension of max-weight scheduling where each job visits only a single service node.

3. How Servers Hand-Off Work

When our server (the manager) takes a complex job from a user, it needs to hand it off to the workers (mechanics) through the broker. But how do we ensure the job doesn't get lost in transit?

We have three primary delivery models:

At-Most-Once: The server sends the job to the worker exactly once and moves on. If the worker crashes before finishing, the job is lost forever. It offers high speed but zero safety.
At-Least-Once: The server waits for a thumbs-up (an acknowledgment) from the worker. If it doesn't hear back within a certain timeframe, it assumes the job failed and sends it again. This is safer, but it means a worker might end up repeating a job if only the acknowledgment got lost.
Exactly-Once: The holy grail of delivery where the system guarantees the job is completed once—no more, no less. (This is difficult and requires strict coordination and tracking).

At the other end, an engineered worker is constantly running in a loop: checking the broker for new tasks, executing the logic, and reporting back. To keep things running smoothly, workers should be stateless. They shouldn't permanently remember the cars (data) they are working on. They just fetch what they need from a database, process it, and move on to the next.

Note: In distributed systems, designing how servers talk and ensuring data stays consistent and available is heavily restricted by the CAP theorem. By nature, queue systems choose to prioritize having the system available to accept messages even if parts of it temporarily lose connection to each other.

4. What Happens When Things Break

In the real world, things fail. A worker's tool might break, or the database might be too busy. To survive this, we build resilience into our workers.

Idempotency (Handling Duplicate Tasks): Remember the "At-Least-Once" delivery? If we accidentally send the same repair job twice, our worker must be smart enough to recognize it so we don't accidentally perform the action twice. In server terms, we use unique keys (like an order number) to check the database first: has this job been done already? If it has, skip it.
Dead Letter Queues (DLQ): What happens if a job is fundamentally broken and causes the worker to crash every time it tries to process it? Without intervention, the broker will keep retrying it forever, blocking other tasks. Instead, we set a retry limit. After failing a few times, the bad job gets tossed into a "Dead Letter Queue." This is basically a "Needs Investigation" bin where developers can manually figure out what went wrong without stopping the rest of the queue.

5. When to Hire More Workers

How do we actually scale up? How do we know when our system is overwhelmed and we need to spin up more workers?

The traditional way is to check if your server's CPU or memory is maxed out. But by the time memory is full, users are already experiencing delays (this is reactive scaling). A much smarter approach is to watch the Queue Length—the number of jobs waiting in line. If the pile of jobs is growing, you proactively spin up more workers before the system gets crushed.

To keep a close eye on everything, you look for these golden signals:

Queue Lag: How many jobs are essentially waiting to be picked up?
Throughput: How many jobs are actually getting finished every second?
Error Rate: What percentage of jobs are failing?

Note: Calculating the exact mathematical number of workers needed and predicting how long tasks will have to wait is a branch of probability called Queueing theory. For advanced distributed networks, things get even more complex—you might need Leader election algorithms just to decide which specific worker is allowed to coordinate important tasks. Focusing heavily on your fundamental concepts will implicitly help you understand the need for these complexities.

Prabhat Dwivedi

Full Stack Developer