Vendor Evaluation·8 min read·2025-08-24·Colm Byrne, Technical Product Manager

SQS DLQ 14-Day Retention Cap: What Happens When You're On Vacation During a Webhook Incident

SQS is excellent queue infrastructure. But when a webhook incident happens while you're unavailable, the 14-day retention clock on dead-letter queues runs whether you're watching or not.

The engineering team that chose Amazon SQS for their event processing pipeline made a defensible decision. SQS has handled production messaging workloads at scale since 2006. It is AWS-native, deeply integrated with Lambda, ECS, and EC2. The managed infrastructure means you are not running a broker. The durability guarantees are real. At meaningful scale, SQS is often the correct answer. The AWS SQS documentation covers the full DLQ configuration.

But SQS is a queue, not an evidence layer. Understanding that distinction — and where it creates operational gaps — is worth doing before an incident tests your assumptions.

What SQS Does Genuinely Well

Amazon SQS is battle-tested infrastructure. The list of what it does well is long and the claims are substantiated by years of production use across the industry.

Standard queues deliver at massive throughput. There is no rate limit that a typical application will hit. Messages are stored redundantly across multiple AWS availability zones, providing durability against single-AZ failures. When your application consumer processes a message and deletes it, the message is gone. When your consumer fails and does not delete it, the message becomes visible again after the visibility timeout and another consumer picks it up. That failure model is predictable and well-understood.

FIFO queues extend the model to ordered processing and exactly-once semantics within the deduplication window. For payment processing, inventory updates, and any workflow where message ordering or idempotency is a correctness requirement, FIFO queues provide guarantees that standard queues deliberately do not.

The integration surface is comprehensive. Lambda event source mappings, SNS subscriptions, EventBridge pipes, Kinesis Data Streams — the AWS ecosystem is built to funnel events through SQS without custom plumbing. For teams already operating in AWS, the integration cost to add SQS to a new pipeline is low.

Dead-letter queues complete the model. When a message fails processing repeatedly — exceeding the configured receive count — SQS routes it to the DLQ automatically. The DLQ holds the failed message for subsequent investigation and reprocessing. For a greenfield system with an operations team that monitors the DLQ actively, this is a functional failure-capture mechanism. For a comparison with RabbitMQ's approach, see our post on RabbitMQ dead-letter exchange handling.


The Structural Constraints Every Team Discovers

SQS is honest about its operational model. The constraints are documented, not hidden. But teams regularly discover them in practice at moments that make the documentation feel abstract.

The at-least-once delivery guarantee on Standard queues means that messages can be delivered more than once. AWS documentation states this directly and advises that applications should be designed to handle duplicate message delivery. The prescription — "design your application to be idempotent" — is correct. Implementing idempotency requires explicit engineering work: a deduplication key, a storage layer to track processed IDs, a check on every message before processing. For teams coming to SQS from a background in exactly-once systems, this represents work that exists outside the queue itself.

The FIFO deduplication window is five minutes. Within a five-minute window, SQS will reject messages with the same deduplication ID. After five minutes, the window expires. If a duplicate arrives six minutes later, FIFO will not catch it. Your idempotency logic needs to handle that case independently.

Messages on Standard queues may occasionally arrive out of order. AWS documentation notes this as a structural characteristic of the distributed storage model, not a bug condition. For workflows where ordering matters, this drives teams toward FIFO queues, which carry higher per-request cost and lower throughput limits.

The tooling layer is absent by default. SQS has no built-in UI for browsing messages, searching by content, or inspecting the full history of what arrived. The AWS Console provides a basic "Poll for messages" interface that can sample messages from the queue, but does not provide searchable history, payload inspection at scale, or a production-grade operations interface. Teams that need those capabilities build them. The custom DLQ monitoring dashboard, the alerting on DLQ depth, the operational runbook for reprocessing — all of this is DIY.


The Vacation Problem

The 14-day maximum retention period is an AWS-set limit that applies to both standard queues and dead-letter queues. You configure retention between 1 minute and 14 days. The default is 4 days. At the maximum, messages persist for 14 days before SQS deletes them permanently.

For most steady-state operations, this limit is invisible. Messages are processed, deleted, and never approach the retention boundary. The limit only matters when processing is disrupted — when messages age in the DLQ because no one has investigated them yet.

A thread from an AWS community forum captures the failure mode precisely. A developer asked, in effect: what happens if I'm on vacation when a webhook incident starts and I can't investigate within 14 days? The question is not hypothetical. Real teams have real vacation schedules, real on-call rotations with gaps, and real incident timelines that stretch across organizational bandwidth constraints. A webhook incident that starts on a Friday before a two-week company holiday is, structurally, a situation where the 14-day retention clock runs before the investigation can begin.

When the DLQ messages expire, the evidence is gone. The raw payloads — the webhook bodies that arrived during the incident window — are no longer available. You can know that failures occurred (because you have CloudWatch metrics showing DLQ depth rising and falling), but you cannot answer "what was in those webhooks" from the SQS records. The payloads are deleted.

For incident forensics — for the investigation that asks "what did our upstream send us during the outage, and was the data malformed, or was the failure on our side" — the answer to that question requires the payload. And the SQS model does not guarantee that payload is available when your investigation begins. This is the same forensics gap described in our post on silent webhook failures — you know something went wrong, but the evidence has expired.


The DIY Tax Accumulates

Each structural constraint in SQS creates a corresponding engineering task. The tasks are individually manageable. The total weight of them, across a team building webhook processing infrastructure from scratch, is significant.

Idempotency requires: choosing a deduplication key, storing processed IDs in a durable store (DynamoDB, Redis, Postgres), checking on every message before processing, handling the race condition where two consumers pick up the same message simultaneously, and deciding what "processed" means for messages that partially succeed.

DLQ workflow requires: configuring the DLQ, setting the receive count threshold, monitoring DLQ depth with CloudWatch alarms, building or integrating an interface for inspecting dead messages, writing a reprocessing job that replays DLQ messages to the original queue or processes them directly, and testing that the reprocessing job handles idempotency correctly.

Retention archiving requires: deciding before 14 days elapses that you want to keep the payload, building a Lambda function or worker that reads the DLQ before expiry and archives to S3 or another store, managing the archive with its own retention policy, and building a search mechanism on top of the archive if you need to find specific payloads later.

Each of these is a solved engineering problem. None of them is unsolvable. But the accumulation of these tasks means that a team using SQS for webhook processing needs to build, test, and operate a layer of tooling before they have something equivalent to a purpose-built webhook evidence layer. That tooling has maintenance cost over time.


When HookTunnel Fits Alongside SQS

SQS and a purpose-built webhook capture layer are not mutually exclusive. The architecture that many teams arrive at naturally combines them: SQS handles scale and processing durability behind the application; a dedicated capture layer sits at the HTTP edge where webhooks arrive.

HookTunnel operates at the edge. Every inbound HTTP request to a HookTunnel hook URL is captured in full — method, headers, body, timestamp, source IP — and stored in searchable history. The capture happens at the ingress point, before the message enters any queue. The full HTTP payload is stored independently of what happens downstream.

The Free plan retains 24 hours of history. Pro at $19/month extends that to 30 days. Enterprise plans extend further. The retention is defined in terms of time, not by a message processing lifecycle, which means the payload from an incident last Tuesday is available today regardless of whether the SQS message was processed, moved to a DLQ, or has already expired.

Replay in Pro sends the original captured payload — exact headers, exact body — to any URL you specify at replay time. When you have fixed your handler and want to verify it against the actual payloads from the incident window, you replay from the HookTunnel history to your production endpoint or your staging environment. You do not need to reconstruct the payload from CloudWatch logs or hope that a DLQ archive script ran before the 14-day window closed.

The architecture becomes: webhooks arrive at the HookTunnel hook URL, are captured in full, and are forwarded to your application. Your application pushes messages to SQS for processing at scale with full durability and idempotency guarantees. SQS handles the volume and the consumer orchestration. HookTunnel holds the original HTTP evidence for as long as the plan retains it, available for forensics and replay whenever you need it, including after a two-week vacation. For the SQS redrive workflow that follows incident investigation, see our post on SQS DLQ redrive operational overhead.

HookTunnel makes no uptime or delivery guarantees in its Terms. It is not a queue, not a retry engine, not a replacement for SQS's processing durability. The value it adds is at the evidence layer: the captured payload that exists independently of the processing pipeline, available when the investigation begins regardless of when that is.


The Right Tool at the Right Layer

SQS is the correct choice at scale. The durability, throughput, and AWS integration are not matched by purpose-built tools that are trying to solve a different problem. For teams processing high-volume webhook events through Lambda consumers with DLQ-based error handling and idempotency at the handler layer, SQS is the backbone of a sound architecture.

The gap is at the HTTP edge, where the original webhook arrives. SQS captures what your application enqueued. It does not capture what your provider sent over HTTP. Those are different records. The headers, the raw body, the timestamp of the original inbound request — that information exists at the HTTP layer and is not preserved by the queue.

When the incident starts and the question is "what did Stripe actually send us at 11:42 PM on that Tuesday" — that question is not answerable from the SQS records unless something captured the raw HTTP request at ingress and stored it independently. See the webhook debugging checklist for a systematic approach to working through this evidence gap.

SQS is the right choice at scale. But at the HTTP edge, where webhooks arrive, a replay-capable evidence layer often costs less to operate than the DLQ workflow it augments — and it answers the forensic questions that a queue, by design, was never built to answer.

Stop guessing. Start proving.

Generate a webhook URL in one click. No signup required.

Get started free →

Frequently Asked Questions

What is the SQS DLQ 14-day retention limit and why does it matter?
AWS SQS sets a maximum retention period of 14 days for both standard queues and dead-letter queues (the default is 4 days). When messages expire, they are permanently deleted — including the original payload. This only creates a problem when processing is disrupted and messages age in the DLQ without investigation. An incident that starts while your team is unavailable — vacation, a long holiday weekend — can exhaust the retention window before the investigation begins.
What happens to webhook evidence when DLQ messages expire before investigation?
The raw payloads are gone. You can know failures occurred from CloudWatch metrics showing DLQ depth rising and falling, but you cannot answer 'what did our upstream send us during the outage?' from SQS records. Forensic investigation — determining whether data was malformed or the failure was on your side — requires the payload. After 14 days, SQS cannot provide that. Recovery requires asking the provider to re-deliver (if supported) or accepting the gap.
What is 'the vacation problem' with SQS DLQs?
The vacation problem is a real scenario: a webhook incident starts on a Friday before a two-week company holiday. The DLQ accumulates failed messages. The 14-day retention clock runs whether anyone is watching. When engineers return, some or all of the failed messages have expired. The forensic evidence is gone. The solution is either building a DLQ archiving pipeline (a Lambda that reads messages to S3 before expiry) or keeping original HTTP payloads in a separate capture layer with longer, time-based retention independent of queue lifecycle.
How does an HTTP-edge capture layer complement SQS for webhook processing?
The architecture combines them: webhooks arrive at a capture URL, are stored in full (headers, body, timestamp), then forwarded to your application which pushes to SQS for processing at scale. SQS handles consumer orchestration and processing durability. The capture layer holds the original HTTP evidence independently of the queue lifecycle — available for forensics and replay whenever the investigation begins, including after a two-week vacation. SQS captures what your application enqueued; the capture layer captures what the provider sent over HTTP. Those are different records.
How do I get started with HookTunnel?
Go to hooktunnel.com and click Generate Webhook URL — no signup required. You get a permanent webhook URL instantly. Free tier gives you one hook forever. Pro plan ($19/mo flat) adds 30-day request history and one-click replay to any endpoint.