Deep Dive into Amazon SQS: A “Queuing Design” Guide Learned by Comparing Pub/Sub Services (SNS, GCP Pub/Sub, Azure Service Bus)
Introduction (Key Takeaways)
- In this article, we’ll focus on Amazon Simple Queue Service (Amazon SQS) as the main character, while also comparing it with:
- AWS SNS / EventBridge
- Google Cloud Pub/Sub
- Azure Service Bus / Storage Queues
to整理 and organize what it means to design messaging and queuing for loosely coupled distributed systems.
- Amazon SQS is a fully managed message queue service that makes it easy to implement:
- Microservice-to-microservice communication
- Background jobs / batch processing
- Spike absorption (buffering)
- In GCP you have Cloud Pub/Sub, and in Azure you have Service Bus / Storage Queues playing similar roles. In any major cloud, the shared idea is:
A “message channel” that loosely connects the sender and receiver.
- The key design points are:
- Where to draw the line between synchronous APIs and asynchronous messages
- Choosing between FIFO (ordered) and Standard (high-throughput)
- Designing visibility timeout, retry behavior, and dead-letter queues (DLQ)
- How to define message schema and versioning
- Learning messaging patterns that work across multiple clouds
- This article is intended for:
- Backend developers / microservice developers
- Engineers designing batch processing and data pipelines
- SRE / infrastructure engineers (operations design)
- Technical leaders designing architectures with GCP / Azure in mind as well
- By the end, you should be able to explain in your own words decisions like:
“This part is a synchronous API, this goes through an SQS queue, and this should be broadcast in a Pub/Sub-like way.”
1. What Is Amazon SQS? A Message Queue that “Loosely Connects” Applications
1.1 Service Overview
Amazon SQS (Simple Queue Service) is a fully managed message queuing service provided by AWS.
In a nutshell:
“The producer (sender) puts messages into a queue,
and the consumer (receiver) later pulls them out in order and processes them.”
Key characteristics:
- Serverless and fully managed
- AWS handles building queue servers, redundancy, patching, and capacity management.
- Highly scalable
- Automatically scales as message volume and traffic increase.
- At-least-once delivery
- You must design under the assumption that the same message might be delivered more than once.
- Two queue types
- Standard queue: high throughput, ordering is best-effort
- FIFO queue: ordered and deduplicated, but with lower throughput
Unlike running your own message queue (like RabbitMQ or ActiveMQ) on-premises,
with SQS you can simply “create a queue and start using it right away”—that’s a big part of its appeal.
1.2 Similar Services on Other Clouds
Conceptually similar services include:
- Google Cloud Pub/Sub
- High-throughput publish/subscribe service; supports both push and pull.
- Azure Service Bus
- Feature-rich messaging (queues + topics + subscriptions).
- Azure Storage Queues
- Simpler and cheaper queue service built on Blob Storage.
GCP Pub/Sub and Azure Service Bus are strong at broadcast (one message to many subscribers),
while SQS is basically built on the model that “one message in a queue is processed by one consumer”
(although, you can use it in a Pub/Sub style when combined with SNS or EventBridge).
2. Standard Queue vs FIFO Queue: What’s the Difference?
2.1 Standard Queue
A Standard Queue is the default SQS queue type with the following characteristics:
- Virtually unlimited throughput
- Message order is best-effort
- Messages often arrive in the order sent, but strict ordering is not guaranteed.
- Messages may be delivered more than once (at-least-once delivery)
In exchange, Standard queues are well suited for high-traffic workloads.
2.2 FIFO Queue (First-In-First-Out)
A FIFO Queue provides:
- Guaranteed order of messages, and
- Deduplication (de-dup)
Key points:
- Ordering is guaranteed per message group,
- i.e., messages with the same
MessageGroupIdare processed in arrival order.
- i.e., messages with the same
- With a
MessageDeduplicationId, you can ensure that
“the same message sent twice is processed only once.” - Throughput is lower than Standard queues and must be considered in design.
For workloads where order must not be broken (e.g., updating a single user’s account balance),
or where double-processing is unacceptable, FIFO queues are an option.
For most other cases, however, Standard queues are simpler and scale better,
so starting with Standard queues is generally recommended.
3. SQS Basics: Sending, Receiving, and Visibility Timeout
3.1 Message Lifecycle
The basic flow of SQS is very simple:
- The producer sends a message to the queue with
SendMessage. - The consumer receives the message with
ReceiveMessage. - When processing is complete, the consumer calls
DeleteMessageto remove the message from the queue.
The key concept here is “visibility timeout.”
3.2 What Is Visibility Timeout?
Right after a message is retrieved with ReceiveMessage,
that message becomes hidden from other consumers for a certain period of time.
That period is called the visibility timeout.
- Default is 30 seconds (configurable per queue).
- If processing finishes during this period, the consumer calls
DeleteMessage. - If processing fails and
DeleteMessageis not called,
the message becomes visible on the queue again after the timeout expires,
and another consumer may then pick it up.
This mechanism:
- Allows for cases where a consumer crashes mid-processing,
- The message reappears for reprocessing after the timeout, and
- Messages are less likely to be lost,
providing a kind of “built-in auto-retry-like behavior.”
However, because the same message may be processed twice, you must design for:
- “Process each
order_idonly once” (e.g., track processed IDs in a DB), and - “Make the processing idempotent (the same input can be safely processed multiple times).”
This is critical to get right in your application.
4. Dead-Letter Queues (DLQ): A Parking Lot for Failed Messages
4.1 Why Do We Need a DLQ?
Occasionally, you get problematic messages like these:
- Messages that fail every time you try to process them (e.g., due to bugs or unexpected data format)
- Messages that cannot be processed for external reasons (e.g., a long-term outage of a downstream system)
- Invalid or corrupted messages
If you leave such messages sitting in the main queue:
- They may be retried over and over in a loop, and
- They can delay processing of healthy messages.
This is where a Dead-Letter Queue (DLQ) comes in.
4.2 How a DLQ Works
For a given source queue in SQS, you can configure:
- The maximum receive count (e.g., 5), and
- A dead-letter queue to send failed messages to.
When:
- A message on the source queue is received (via
ReceiveMessage) a certain number of times - But never deleted (i.e., it keeps failing),
it will be automatically moved to the DLQ.
Messages accumulated in the DLQ can be:
- Investigated and reprocessed in a separate batch,
- Manually corrected and re-submitted, etc.,
enabling human-in-the-loop handling of problematic messages.
GCP Pub/Sub has dead-letter topics, and Azure Service Bus has dead-letter queues as well.
All of them share the same idea: “Provide a safe parking lot for messages that can’t be handled automatically.”
5. Relationship Between SQS and SNS/EventBridge: Queues vs Pub/Sub
5.1 SQS: Point-to-Point (1:1) Messaging
SQS is essentially:
- One queue,
- Multiple consumers, but each message is processed by exactly one consumer.
This is the Point-to-Point messaging model.
5.2 SNS: Pub/Sub (1-to-Many) Notification Service
Amazon SNS (Simple Notification Service) is a publish/subscribe notification service.
- When a message is published to a topic,
- It is delivered to multiple subscribers simultaneously (SQS queues, Lambda, HTTP endpoints, email, etc.).
By combining SNS and SQS, you get:
- SNS topics (the “hub” for events), and
- SQS queues (the “mailboxes” for each system).
This lets multiple systems asynchronously process the same event.
5.3 EventBridge: An Evolved “Event Bus”
AWS EventBridge is an “event bus” that provides more advanced routing and filtering.
- It receives events from numerous AWS services and SaaS providers,
- And routes them based on rules to:
- SQS / SNS
- Lambda / Step Functions
- EC2 / ECS / API Gateway, etc.
GCP has Cloud Pub/Sub, and Azure has Event Grid + Service Bus/Queues,
all of which share the idea of “gather events in one place and route them to appropriate targets.”
6. Practical Usage in Microservices and Batch Processing
6.1 Web Frontend + Background Processing
A classic pattern is API Gateway + Lambda / ECS + SQS:
- The API receives user requests (e.g., order confirmation).
- It immediately returns a response (“Your order has been received”).
- The heavy lifting (billing, inventory check, sending emails, etc.) is placed as messages onto an SQS queue.
- Background workers (Lambda / ECS) process these messages asynchronously.
This pattern allows:
- Short response time for users, and
- Handling spikes by buffering in the queue and processing gradually,
achieving a good balance between response speed and scalability.
On GCP, you can do the same with Cloud Tasks or Cloud Pub/Sub,
and on Azure, with Storage Queues / Service Bus + Functions.
6.2 Buffer for Data Pipelines
Example pattern:
- A file is uploaded to S3.
- A Lambda function is triggered and sends file metadata to SQS.
- Multiple workers (ECS/EKS batch tasks) pull messages from SQS and process the files.
This gives you smoothing of file processing and easier reprocessing.
In big data workflows, you often see GCP Pub/Sub → Dataflow,
or Azure Event Hubs / Service Bus → Data Factory.
The underlying idea is the same.
6.3 Retry and Backoff
With SQS visibility timeouts and DLQs, you can implement a stepwise retry strategy, such as:
- 1st attempt: immediate retry (short visibility timeout).
- 2nd–3rd attempts: apply backoff (consumer waits before re-sending / uses a different queue for delayed retries).
- 4th attempt onward: send to DLQ and involve a human.
Defining such policies upfront makes system behavior much clearer during failures.
7. Designing Messages: Schema, Size, and Idempotency
7.1 Designing Message Schemas
The SQS message body is free-form text (typically JSON).
The key design concern is:
- What to include in the message.
Common approaches:
- ID-only pattern
- Example:
{"order_id": "12345"} - All details are retrieved from another data store (RDS / DynamoDB, etc.).
- Easier to keep message size small.
- Example:
- Embed-all-details pattern
- Example: embed order data, user data, product data, etc., as a full JSON document.
- Fewer DB lookups, but message size easily grows.
There’s no single right answer; you need to consider update frequency, data consistency, and limits (SQS max 256 KB).
7.2 Idempotency
SQS, GCP Pub/Sub, and Azure Service Bus all require you to assume that:
- The same message may be processed more than once without causing problems.
For example:
- Record in a DB whether an
order_idhas already been processed. - Use external APIs (e.g., payment) that support resending requests with the same request ID without double charging.
- Use UPSERT instead of plain INSERT to avoid errors when inserting the same row twice.
These patterns allow you to safely scale a system that relies heavily on queues.
8. Comparing GCP Pub/Sub and Azure Service Bus
8.1 Cloud Pub/Sub (GCP)
- Fundamentally a Pub/Sub model.
- Publish to a topic → multiple subscriptions receive messages via pull or push.
- High throughput; suitable for event-driven and streaming workloads.
- Offers dead-letter topics, ordering keys, and increasingly strong exactly-once-like guarantees.
Because its core is Pub/Sub,
Cloud Pub/Sub is often a better natural fit than bare SQS
when multiple consumers need to process the same message.
8.2 Azure Service Bus / Storage Queues
- Azure Service Bus
- Queues + topics + subscriptions.
- Feature-rich (sessions, transactions, scheduled delivery, etc.).
- Very much in the “enterprise messaging” space.
- Azure Storage Queues
- Simpler and cheaper queue service.
- Similar feel to SQS; good for simple background jobs and buffering.
Feature-wise, SQS is closer to Storage Queues in simplicity,
but when combined with SNS / EventBridge / Lambda / ECS/EKS,
you can build architectures similar to Service Bus with topics and subscriptions.
9. Practical Design Checklist (What to Decide in the First 1–2 Weeks)
- Where to switch from synchronous API to SQS
- Keep user-facing parts fast; move heavyweight processing to async.
- Queue type selection: Standard vs FIFO
- How critical is ordering?
- Is it really unacceptable if processing happens twice?
- Message structure (schema)
- ID-only, or include details?
- How to handle versioning (e.g.,
"schema_version": "v1")
- Relationship between visibility timeout and processing time
- Base it on typical processing time + margin.
- DLQ and max receive count
- After how many failures should a human step in?
- Retry strategy
- How to implement backoff and re-queuing at the consumer side.
- Metrics and alerts
- Queue length (ApproximateNumberOfMessages)
- Number of messages in the DLQ
- End-to-end latency (from producer to consumer completion)
- Multi-cloud and future expansion
- If you later migrate to Pub/Sub or Service Bus,
can you reuse your message structure and patterns as-is?
- If you later migrate to Pub/Sub or Service Bus,
10. Who Benefits and How? (Concrete Value by Role)
10.1 Backend Developers
- When you think, “I want fast responses, but the backend processes are heavy…”,
you’ll naturally consider:“Let’s push everything beyond this point into an SQS queue and let a worker handle it.”
- As your service grows and traffic increases, SQS acts as a buffer,
so your architecture tends to age more gracefully.
10.2 Microservices / SRE / Platform Engineers
- Instead of tightly coupling services with synchronous HTTP calls, you can
loosely couple them using events and queues, enabling:- Architectures where outages in one service don’t cascade across the system,
- “Valves” that help your system survive slowdowns.
- If you understand SQS / SNS / EventBridge, GCP Pub/Sub, and Azure Service Bus side by side,
you gain the ability to apply the same messaging patterns even when you change clouds.
10.3 Batch Processing and Data Platform Engineers
- With SQS queuing a large number of jobs or data tasks, you can control:
- “Number of messages processed per minute,” and
- “Number of concurrent workers,”
making it easier to tune throughput.
- This lets you smoothly regulate load on downstream DBs and external APIs.
10.4 Tech Leads, Architects, and CTOs
- You can conceptualize your system in three layers:
- Core synchronous API layer (API Gateway / Load Balancer),
- Asynchronous messaging layer (SQS / SNS / EventBridge / Pub/Sub / Service Bus),
- Batch / worker / data processing layer.
This helps you design architectures that are resilient to future load growth and service decomposition.
11. Three Things You Can Start Doing Today
- Identify one process that could be asynchronous in your existing system.
- Examples: sending emails, generating PDFs, report aggregation, image conversion, etc.
- Draw a simple diagram that splits that process into SQS + worker
- Producer → SQS → Consumer (Lambda / ECS).
- Build a small PoC
- If you have an AWS account, create a Standard queue and:
- A Lambda function that sends messages, and
- A Lambda function that processes messages (triggered via SQS event source mapping).
- If you have an AWS account, create a Standard queue and:
You’ll quickly get an intuitive feel for how it behaves.
If you try the same with GCP Pub/Sub + Cloud Functions or Azure Storage Queue + Functions,
you’ll see firsthand that “the design mindset for messaging is common across clouds.”
12. Wrap-Up: SQS as a Key Layer for Smoothing “Time and Load”
Amazon SQS is:
- A message queue that loosely connects senders and receivers,
- A buffer that absorbs spikes and supports retries and error handling during failures, and
- In microservice and serverless architectures, a component that softens the “time axis”.
GCP’s Cloud Pub/Sub and Azure’s Service Bus / Storage Queues
are peers that help realize the same “message-connected world.”
What matters is:
- Don’t try to do everything with synchronous HTTP.
- Be willing to delegate responsibility to the messaging layer.
- Design with tolerance for duplicates, reordering, and failed messages.
By gradually introducing SQS (or Pub/Sub / Service Bus)
starting with small features,
your system will slowly evolve into one that is
“more robust, more scalable, and easier to change later.”
Without rushing, review each process one by one, asking:
“Should this be synchronous or asynchronous?” “Should this go through a queue?”
And step by step, let’s grow a messaging design that fits your service perfectly.
