How Message Queues Work — and When You Need One

A technical guide to message queues: the producer-consumer pattern, message acknowledgement, at-least-once vs exactly-once delivery, dead letter queues, when queues solve real problems vs when they add unnecessary complexity, and how to choose between Redis, SQS, and Kafka.

By HunchbiteMarch 30, 202613 min read

message queueBullMQRedis

Every application eventually encounters a class of work that shouldn't happen synchronously in an HTTP request: sending emails, generating reports, calling slow external APIs, processing uploaded files, triggering notifications to thousands of users. A message queue is the standard tool for handling this work reliably. Understanding how queues work — the acknowledgement model, delivery guarantees, failure modes — is what separates a reliable async system from one that silently drops jobs or processes them twice.

The core problem queues solve

Without a queue, the naive approach to async work is: do it in the HTTP handler, or fire-and-forget (spawn a Promise and hope it completes). Both are problematic.

Synchronous in the HTTP handler: The user waits for the slow work to finish. A 3-second image resize or a 500ms Mailgun API call adds directly to your response time. More critically, if the slow work fails (external API is down), the whole request fails. Retry logic in HTTP handlers is messy.

Fire-and-forget: Spawn a Promise or background task that isn't tracked anywhere. The work might complete; it might fail silently; the process might restart mid-job and the work is lost with no record it was ever started. This is fine for truly non-critical work with no reliability requirements, and a disaster for anything that matters.

A queue decouples the work from the request. The HTTP handler enqueues a job (fast — typically a Redis write, taking <1ms) and returns. A separate worker process picks up the job and does the slow work. If the worker fails, the job stays in the queue and another worker picks it up. The user's request was not affected by the worker's failure.

This is the producer-consumer pattern. Producers create jobs and add them to the queue. Consumers (workers) take jobs from the queue and execute them.

How message acknowledgement works

Acknowledgement is the mechanism that ensures jobs aren't lost when workers crash. It's the most important thing to understand about queue reliability.

Without acknowledgement (wrong):

Queue: [job_A, job_B, job_C]
Worker picks up job_A → queue removes job_A
Worker crashes while processing job_A
job_A is gone forever

With acknowledgement (correct):

Queue: [job_A, job_B, job_C]
Worker picks up job_A → queue moves job_A to "processing" state (still in queue)
Worker processes job_A successfully → worker sends ACK → queue deletes job_A
Queue: [job_B, job_C]
 
-- OR --
 
Worker picks up job_A → queue moves job_A to "processing" state
Worker crashes → lock expires after timeout
Queue: [job_A, job_B, job_C]  ← job_A is back
Another worker picks up job_A and processes it

In BullMQ (Node.js/Redis), this is handled automatically. Jobs are moved to an "active" set when picked up. They're only removed from active and moved to "completed" when job.moveToCompleted() is called (implicitly, when your job processor function resolves). If the worker crashes or the lock expires, BullMQ's stalled job recovery moves the job back to "waiting."

In AWS SQS, the equivalent is the visibility timeout — when a consumer receives a message, it becomes invisible to other consumers for the timeout duration. The consumer must explicitly delete the message after processing. If it doesn't delete within the timeout, the message reappears.

FIFO vs priority queues

FIFO (first in, first out): Jobs are processed in the order they were enqueued. Standard for most use cases where ordering matters or where fairness is the goal. SQS Standard queues are not strictly FIFO (they have at-least-once delivery with possible ordering variation); SQS FIFO queues guarantee ordered delivery at the cost of lower throughput.

Priority queues: Jobs have a numeric priority; higher-priority jobs are processed before lower-priority ones regardless of enqueue time. BullMQ supports priority natively. Useful when mixing job types with different urgency — a user-triggered action (priority: high) and a nightly batch report (priority: low) can share a worker pool, with the high-priority work always getting processed first.

Delayed jobs: A variant where jobs are enqueued with a delay — "process this in 30 minutes." Common for: retrying after an expected delay, sending a follow-up email 24 hours after signup, scheduling reminders. BullMQ supports { delay: 1000 * 60 * 30 } (milliseconds). Redis sorted sets handle the scheduled state — delayed jobs sit in a sorted set ordered by their run-at timestamp, and a scheduler process promotes them to the waiting queue when their time comes.

Dead letter queues

What happens to a job that keeps failing? Without a dead letter queue, it retries indefinitely — potentially blocking other jobs or consuming worker capacity. With a DLQ, jobs that exceed their maximum retry count are moved to a separate queue (the dead letter queue) where they stop being retried and can be inspected, debugged, or manually re-processed.

In BullMQ:

const queue = new Queue('email', { connection });
 
const worker = new Worker('email', async (job) => {
  await sendEmail(job.data);
}, {
  connection,
  settings: {
    backoffStrategy: (attemptsMade) => {
      return Math.min(Math.pow(2, attemptsMade) * 1000, 30000); // Exponential backoff, max 30s
    }
  }
});
 
// Jobs that fail after maxAttempts go to the failed state (BullMQ's DLQ equivalent)
// Add this when adding to queue:
queue.add('send-welcome', { userId, email }, {
  attempts: 5,
  backoff: { type: 'exponential', delay: 2000 }
});

In SQS, you configure a DLQ as a separate SQS queue and set maxReceiveCount on the source queue. After a message is received (and not deleted) maxReceiveCount times, SQS automatically moves it to the DLQ.

A DLQ is not optional for production queue systems. Jobs fail for reasons you can't predict in advance — downstream APIs returning unexpected responses, data that violates assumptions your processor makes, infrastructure events. Without a DLQ, failed jobs either retry forever or disappear silently. Neither is acceptable.

At-least-once vs exactly-once delivery

Every message queue has a delivery guarantee, and none of them is exactly-once by default.

At-least-once delivery: The queue guarantees every message will be delivered to a consumer at least once. In practice, most messages are delivered exactly once. But in failure scenarios (worker crashes after processing but before ACK, duplicate delivery due to network issues), a message may be delivered multiple times. Your consumers must be idempotent.

At-most-once delivery: Messages may be lost (if the consumer fails before processing) but are never delivered twice. Used when duplicate delivery is more harmful than occasional loss — uncommon for application workloads.

Exactly-once delivery: Every message is delivered exactly once, no more, no less. The dirty secret: true exactly-once delivery is theoretically impossible in distributed systems without coordination that destroys throughput. What providers call "exactly-once" is actually "effectively-once" — they use deduplication mechanisms that reduce duplicates to near-zero under normal conditions, with caveats.

SQS FIFO queues offer exactly-once processing within a deduplication window (5 minutes) using a client-provided deduplication ID. Within that window, messages with the same deduplication ID are deduplicated. Outside the window, if the same message is sent again with the same ID, it's treated as a new message.

The practical conclusion: design your job processors to be idempotent (running them twice produces the same outcome as running once). Don't rely on the queue to prevent duplicates.

// Idempotent job processor
async function processPaymentJob(job) {
  const { paymentIntentId } = job.data;
 
  // Check if already processed
  const existing = await db.payments.findUnique({
    where: { stripePaymentIntentId: paymentIntentId }
  });
  if (existing) {
    return { alreadyProcessed: true };
  }
 
  // Process and record atomically
  await db.$transaction([
    db.payments.create({ data: { stripePaymentIntentId: paymentIntentId, ... } }),
    db.orders.update({ where: { ... }, data: { status: 'paid' } })
  ]);
}

That deduplication lookup runs on every job that flows through the queue, so the column it checks needs to be backed by the right database index — otherwise the idempotency check itself becomes the bottleneck as your job volume grows.

Queue implementations across stacks

Tool	Runtime	Backing store	Key strengths
BullMQ	Node.js	Redis	Full-featured, dashboard (Bull Board), scheduling, priority, delayed jobs
Sidekiq	Ruby	Redis	Mature, fast, excellent monitoring, Rails integration
Celery	Python	Redis or RabbitMQ	Mature, flexible, broad ecosystem
AWS SQS	Any	Managed	No ops, scales infinitely, integrates with Lambda
AWS SQS + Lambda	Any	Managed	Serverless workers, auto-scaling out of the box
RabbitMQ	Any	Self-hosted	Complex routing rules, multiple exchange types
Temporal	Any	PostgreSQL/Cassandra	Durable workflows, not just simple jobs

For most Node.js applications, BullMQ (the successor to Bull) is the practical default. Redis is already in most stacks for caching; BullMQ uses Redis efficiently and has a solid dashboard. For applications already on AWS that need to scale workers independently, SQS + Lambda or SQS + ECS workers eliminates queue infrastructure entirely.

When you don't need a queue

Queues add complexity: a Redis instance (or SQS), worker processes to deploy and monitor, job state to inspect when things go wrong. Don't add this infrastructure until you have a concrete reason.

You probably don't need a queue when:

The work is fast and the failure modes are acceptable. Updating a record in the same database as the HTTP request? Do it synchronously.
You're at early scale. If you're sending 50 emails per day, the overhead of queue infrastructure isn't justified. A direct call to your email provider inside the request is fine.
The work is fire-and-forget with no reliability requirement. Logging analytics events to a third-party service? Losing occasional events is acceptable. A queue adds durability you don't need.
You're building an MVP. Synchronous processing lets you ship faster. Add the queue when slow or failing operations become a user-visible problem.

When you do need a queue

Email and notification sending. Email providers (Mailgun, Sendgrid, Resend) have API latency and rate limits. Sending email synchronously in a request adds 200-500ms and introduces a failure dependency. Enqueue the send; the user's request completes instantly.

Heavy computation. PDF generation, video transcription, image processing, ML inference. These operations take seconds to minutes. They belong in a worker, not a request handler.

External API calls with rate limits. If you're calling an API that has a 100 requests/minute limit, you need rate-limited queue processing, not unbounded concurrent HTTP handlers.

Fan-out notifications. "User posted a comment → notify 500 followers." This is 500 database writes or push notification calls. Don't do it synchronously in the request that triggers it.

Payment processing side effects. After a payment succeeds, you might need to: provision access, send a receipt, update CRM, notify the finance team, generate an invoice. These can be separate jobs triggered by the payment event, each retrying independently if they fail.

Anything that must be retried on failure. If the consequence of failure is that the work doesn't happen (user doesn't get their password reset email, order isn't fulfilled), you need retry logic. Queues provide this natively.

Queue vs pub/sub vs event streaming

These terms get conflated. They're related but solve different problems.

Message queue (BullMQ, SQS): Work distribution. A job is picked up by one worker, processed once (or retried until successful), then removed. The queue is consumed — each job disappears after processing. Good for: background jobs, task distribution.

Pub/sub (Redis pub/sub, Google Cloud Pub/Sub, SNS): Fan-out notifications. A message published to a topic is delivered to all subscribers. Multiple consumers each receive a copy. The message is not "consumed" — it's broadcast. Good for: real-time events where multiple independent components need to react.

Event streaming (Kafka, Kinesis): An append-only log of events. Consumers read the log at their own pace and maintain their own position (offset). New consumers can replay the entire history. Events are retained for a configured period (days, weeks, forever). Good for: audit logs, event sourcing, analytics pipelines, systems that need to replay history.

The key distinction between a queue and Kafka: in a queue, a job is "owned" by one consumer and disappears after processing. In Kafka, every consumer group reads the full stream independently — adding a new service that needs order events means it reads from offset 0 (or from "now") without affecting other consumers or the message producers.

Monitoring queue depth and autoscaling workers

A queue's depth (number of jobs waiting) is a key operational metric. A queue that's growing without bound means workers aren't keeping up — you need more workers, faster workers, or fewer jobs being enqueued.

BullMQ exposes queue metrics:

const counts = await queue.getJobCounts('waiting', 'active', 'failed', 'delayed');
// { waiting: 143, active: 8, failed: 2, delayed: 0 }

Useful thresholds to alert on:

Waiting jobs > X (depending on expected processing time and SLA)
Failed jobs increasing (something is failing systematically)
Oldest job age > Y minutes (jobs are being enqueued but not processed)

For autoscaling: if you're running workers on Kubernetes, you can use KEDA (Kubernetes Event-Driven Autoscaling) to scale worker deployments based on queue depth. On AWS with SQS + ECS, Application Auto Scaling supports scaling ECS services based on SQS queue depth via CloudWatch metrics.

One operational risk that's easy to miss: queue dashboards like Bull Board are frequently left publicly accessible, exposing job payloads — sometimes containing personal or payment data — to anyone who finds the URL. That's exactly the kind of exposure a penetration test is designed to surface before an attacker does.

Building the right backend architecture from day one?

Queues, background jobs, async processing patterns — getting these right early prevents the reliability problems that emerge at scale. Hunchbite helps technical leads and engineering teams design backend architecture that's robust under real production conditions.

→ Developer Experience

Call +91 90358 61690 · Book a free call · Contact form

FAQ

What's the difference between a message queue and a job queue?: The terms are often used interchangeably, but there's a conceptual distinction. A message queue is a general-purpose mechanism for passing messages between producers and consumers — the queue is the transport layer. A job queue is a higher-level abstraction built on top of a queue (often Redis or a database), specifically designed for scheduling and executing discrete units of work, with features like job scheduling, priority levels, retry configuration, progress tracking, and a dashboard UI. BullMQ, Sidekiq, and Celery are job queues; their persistence layer (Redis) is the underlying queue. AWS SQS is a pure message queue — you build job queue semantics on top of it. In practice, most application developers want a job queue (BullMQ, Sidekiq) rather than a raw message queue, because the job queue abstractions handle the annoying parts.
Do I need Kafka or will Redis be enough?: Redis (via BullMQ or similar) is enough for the vast majority of applications. Kafka solves a specific set of problems: high-throughput event streaming (millions of events per second), long-term log retention (replaying the full event history), fan-out to many independent consumer groups that each read the full stream at their own pace, and building event-sourced systems. If you need to process 50,000 transactions per second, replay the last 30 days of events for a new analytics pipeline, or have 20 independent services all consuming every order event, Kafka is the right tool. If you need to send emails asynchronously, process image uploads, or run background API calls for a few thousand users per day, Redis is completely sufficient and dramatically simpler to operate. Kafka is also operationally heavy — Kafka clusters, ZooKeeper or KRaft consensus, monitoring, partition rebalancing. Start with Redis. Add Kafka when you have a concrete problem that Redis can't solve.
What happens to messages if my queue worker crashes?: In a well-designed queue, nothing is lost. This is why message acknowledgement (ACK) exists. When a worker picks up a job, the queue moves it to a 'processing' or 'active' state but does not delete it. The job is only removed from the queue when the worker explicitly acknowledges successful completion. If the worker crashes before acknowledging, the queue detects the timeout (the job's lock expires) and makes the job available again for another worker to pick up. In BullMQ, jobs in the 'active' state that exceed their lock duration are automatically moved back to 'waiting'. In SQS, messages that aren't explicitly deleted within the visibility timeout reappear in the queue. This at-least-once delivery guarantee is why idempotency in your workers matters — a job may be processed more than once after worker crashes.

Weighing a technical decision?

Get a second opinion before you commit.

Stack choices, architecture trade-offs, build-vs-buy — a 30-minute call with senior engineers can save you months. No sales pitch, just a straight answer.

Book a Free Call Technical Due Diligence

Trusted by VMAC Industries, TKD Logistics, Astitva Jewellery & more. See our recent work →

Fixed-price, no hourly billing · No obligation · We tell you upfront if we're not a fit

Technology Decisions

Drizzle ORM Setup Guide: Type-Safe Database Access with PostgreSQL

How to set up Drizzle ORM with PostgreSQL from scratch — schema definition, migrations, query patterns, connection pooling, and the configuration decisions that matter in production Next.js applications.

11 min read Technology Decisions

How Database Indexes Work (And Why the Wrong Index Is Worse Than None)

A technical guide to database indexes: B-tree internals, composite index column ordering, covering indexes, partial indexes, the write cost of over-indexing, EXPLAIN ANALYZE interpretation, and the common indexing mistakes that degrade production performance.

14 min read

All Guides