Why Every Team Building Webhooks on SQS Reinvents Idempotency — and What the Alternatives Look Like
AWS docs say it plainly: design your SQS application to be idempotent. What they don't say is how many teams discover this requirement after their first duplicate charge or double-processed order.
AWS documents it clearly in the AWS SQS documentation. The SQS Standard queue delivery model is at-least-once: your consumer will receive each message at least once, and in some cases more than once. The documentation does not bury this. It states directly that "Amazon SQS stores copies of your messages on multiple servers for redundancy and high availability. On rare occasions, one of the servers that stores a copy of a message might be unavailable when you receive or delete a message. If this occurs, the copy of the message is not deleted from that server. If so, you could receive that message copy again." The prescribed response: "Design your application to be idempotent."
The advice is correct. The problem is not with the advice — it is with how consistently teams discover they needed to act on it only after their first duplicate payment charge, double-processed inventory update, or duplicated Slack notification that confused their customers.
SQS is genuinely excellent infrastructure. Before addressing the idempotency requirement, that should be stated fully. Standard queues deliver at massive throughput with high availability across AWS availability zones. The integration surface with Lambda, EventBridge, and ECS is mature and deeply supported. For high-volume event processing at AWS-native scale, SQS is often the correct answer. The at-least-once delivery model is not a design flaw — it is the explicit tradeoff that enables the throughput and availability guarantees. You get scale; you handle duplicates.
The question this post addresses is what "handle duplicates" actually requires in practice, and why the implementation is more work than the documentation phrase "design your application to be idempotent" suggests.
What idempotency actually requires in practice
Idempotency, in the context of an SQS message handler, means that receiving and processing the same message twice produces exactly the same state as processing it once. This is a correctness guarantee, not an optimization. For a read-only handler — one that fetches data and returns it — idempotency is automatic. For a handler that creates a record, charges a card, sends an email, or updates inventory, idempotency requires explicit engineering.
The implementation requires several distinct components working together.
A stable deduplication key. Every message needs a key that uniquely identifies the operation it represents, so that if the same message arrives twice, you can recognize it as a duplicate. For webhook events from providers like Stripe or GitHub, this is usually the provider's event ID — a string like evt_1Abc2DefGhiJKl that the provider guarantees is unique per event. For providers that do not include stable event IDs, you need to derive a key from a stable hash of the payload contents, which requires careful normalization to ensure the same payload always produces the same hash regardless of JSON key ordering or whitespace differences.
A durable key store. The deduplication key must be persisted somewhere durable — not in memory, which is lost on restart, and not in a cache with insufficient retention. Common choices are a dedicated database table with a unique index on the key, a Redis SET with a TTL long enough to cover the message visibility window plus expected retry delays, or a DynamoDB table with a conditional write. Each storage choice has its own operational considerations: the database table requires query performance tuning as it grows, Redis keys require TTL calibration to avoid either premature expiry or indefinite accumulation, and DynamoDB conditional writes require handling the condition-check-failure exception as a success path.
A check before every side effect. Every handler that modifies external state must check the key store before acting. The check is not optional and cannot be added later as an optimization. It is a correctness requirement. This means every handler that sends an email, charges a card, creates an order, or updates a count must include the lookup and the early-return path before reaching its side-effectful code.
Transaction-safe write semantics. The check-and-act sequence — look up the key, if not found then act and record the key — must be atomic. If two instances of your consumer receive the same message simultaneously (which is structurally possible under SQS's delivery model), both might pass the idempotency check before either has recorded the key, resulting in both executing the side effect. Preventing this requires either a database transaction that combines the check and the insert, a Redis set-if-not-exists (SETNX) operation, or equivalent atomic semantics in whatever store you are using.
A definition of "processed" for partial successes. What happens if the handler sends the email but then fails before recording the idempotency key? Or records the key before the email succeeds? The answer depends on which direction you consider the safe failure — are you more concerned about sending the email twice, or about recording that you sent it when you did not? This is a business logic question disguised as an infrastructure question, and it has a different correct answer for different types of operations.
The reinvention pattern
The sequence by which teams discover the idempotency requirement is consistent enough that it is recognizable as a pattern. For Stripe specifically, see how Stripe duplicate webhook events manifest and how idempotency fails under concurrent load.
The pipeline is built. The SQS queue is configured. The Lambda consumer is deployed. Testing looks clean — messages flow through, handlers process them, everything works. The deployment ships.
The first duplicate happens. A Stripe event is delivered twice — not because of a bug in the pipeline, but because Stripe's retry logic delivered a second copy, or because the SQS visibility timeout was shorter than the handler's execution time, or because a Lambda instance was terminated mid-execution and the message was made visible again. A payment is charged twice, or an order is created twice, or a webhook notification fires twice and a customer notices.
The team investigates, confirms the at-least-once delivery behavior, reads the documentation recommendation, and builds an idempotency layer. The build takes several days of focused engineering: designing the key store, implementing the check-and-act pattern, writing tests for the concurrent access case, and deploying the change.
The new version ships. The duplicate problem is addressed. Two weeks later, a different handler that was built without the idempotency layer produces a different duplicate incident. The idempotency layer needs to be applied retroactively to every handler in the pipeline. This is a non-trivial audit.
The pattern is not unique to inexperienced teams. Experienced teams build idempotency correctly the first time and then discover that a new handler was added by a teammate who was not aware of the convention. The convention is not enforced by the infrastructure — it is enforced by discipline, code review, and shared understanding. All three can fail.
Common idempotency implementations and their limits
Redis TTL key. The most common implementation — and the one most likely to let duplicates through due to TTL miscalibration. Store the message deduplication ID in Redis with a TTL. If SETNX returns false, the message was already processed; skip it. The implementation is simple and performant. The limit: the TTL must be at least as long as the latest possible duplicate delivery. SQS Standard queues can deliver duplicates well outside the original visibility timeout in edge cases. A TTL set too short allows duplicates through; a TTL set too long accumulates keys that consume memory. Calibrating the TTL requires understanding the tail behavior of SQS delivery, which is not fully documented.
Database unique constraint. Insert the deduplication key into a dedicated table with a unique index. Catch the unique constraint violation and treat it as a success path. This is simple, durable, and avoids TTL calibration. The limit: the table grows unboundedly without a cleanup job. A table accumulating deduplication keys for high-volume webhook traffic will grow by millions of rows per month. Without periodic archiving or deletion, query performance degrades over time. The cleanup job is additional operational work.
FIFO queue deduplication window. Switch to a FIFO queue, which rejects messages with the same deduplication ID within the five-minute deduplication window. This is the most structurally clean solution for duplicates that arrive within five minutes of the original. The limit: AWS documentation specifies the window as five minutes. Duplicates that arrive after five minutes are not deduplicated. For providers with retry schedules that extend hours or days — Stripe retries over 72 hours — the five-minute window does not provide complete deduplication. Application-level idempotency is still required for out-of-window duplicates.
When to absorb the idempotency cost
For teams processing high-volume, side-effectful webhook events at AWS scale — payment processing, order management, subscription lifecycle — the idempotency implementation cost is the correct price to pay. The SQS throughput and durability guarantees are worth more than the engineering cost of the idempotency layer for these workloads.
Teams that have accepted the idempotency tax upfront, built it correctly into every handler, and are running at scale on SQS Standard have made a defensible architectural choice. The infrastructure is reliable. The duplicate handling is robust. The system behaves correctly under the at-least-once delivery model because it was designed to.
When to sidestep it
For teams using inbound webhooks primarily for visibility, debugging, and incident recovery — not high-volume production event processing — the SQS model adds complexity without proportionate benefit.
The core problem at the HTTP edge of webhook ingestion is: capture the request, store it, and make it replayable. That problem does not require queue semantics, visibility timeouts, idempotency layers, or deduplication windows.
HookTunnel captures the inbound HTTP request at the edge — method, headers, body, timestamp — and stores it before any forwarding attempt. The capture step does not introduce duplicate delivery risk in the way SQS does. Replay on Pro sends the original captured payload to any endpoint you choose, on demand, without re-queuing. There is no idempotency layer to build for the capture operation because the capture model does not have at-least-once delivery semantics. You can also read about the SQS 14-day retention limit that affects DLQ-based recovery strategies.
HookTunnel's Terms of Service do not include uptime or delivery guarantees. For production event processing that needs durability at SQS scale, HookTunnel is not a substitute. The architectures are complementary: SQS handles processing durability downstream; HookTunnel handles capture and forensics at the HTTP ingress layer.
Pro at $19 per month includes 30 days of history and replay to any endpoint. Free accounts retain 24 hours.
Idempotency is not optional with SQS Standard
For teams evaluating whether to build on SQS or use a purpose-built tool, see our webhook vendor evaluation checklist and review SQS pricing to model the full cost including the engineering overhead.
The AWS documentation is honest about what SQS Standard requires. The requirement is not a footnote — it is a first-order design constraint that applies to every handler that modifies external state.
The question for any team adopting SQS for webhook processing is not whether to implement idempotency. It is whether they are prepared to implement it correctly — from the first handler, in every handler, with the concurrent-access case handled, with the partial-success case resolved, with a convention that propagates consistently as the handler surface grows.
That work is worth doing for the teams that genuinely need what SQS provides. For teams that primarily need capture, history, and replay at the HTTP edge, there is a category of tooling that sidesteps the requirement entirely — because it was never designed around queue semantics in the first place.
Stop guessing. Start proving.
Generate a webhook URL in one click. No signup required.
Get started free →