How Svix's Retry Schedule Handles Webhook Delivery Failures at Scale
Retries are where most webhook systems quietly fail. Svix's retry schedule was designed for the real world: customer servers down, rate-limited, flaky.
Sending one webhook is easy. Curl a URL, check the response code, call it done. You could write it in an afternoon.
Sending a hundred thousand webhooks reliably — to endpoints that go down during deployments, that rate-limit your deliveries, that return 500s when a customer's database is struggling, that time out because a handler is stuck in a blocking call — that is a different problem. The gap between "we send webhooks" and "our webhooks are reliably delivered" is where most webhook implementations quietly collapse.
Svix built a retry schedule designed for the real world. It is worth understanding in detail, not because Svix is the only tool with retry logic, but because the specific engineering choices they made reflect hard-won lessons about what actually breaks at scale.
Svix's retry schedule: how it actually works
Svix attempts delivery up to 11 times over the course of five days — a retry window long enough to survive multi-day infrastructure incidents. The Svix documentation covers the full retry configuration. See also G2 reviews of Svix for customer feedback on how the retry behavior performs in practice.
The intent behind this shape is deliberate. Most delivery failures are transient. A customer's server is in the middle of a deployment. A handler is crashing on a particular payload shape that just got fixed. A load balancer is flapping. These failures resolve within minutes or hours, and the dense early retry schedule catches them. Persistent failures — a customer's endpoint that has been returning 500s for two days — should not be hammered indefinitely. The long tail of the backoff curve gives persistent failures time to surface to a human without wasting capacity on a target that is clearly not recovering.
The retry window of five days is significant. Most teams building webhook retry logic in-house set windows measured in hours — 24 at most — before they start pruning the queue. Five days means a customer can have a multi-day infrastructure incident, fix it, and recover their missed events without calling support.
What counts as a failure is also worth noting. Svix retries on non-2xx responses and on connection timeouts. A 400 is retried, because Svix cannot know whether the customer's endpoint rejected the payload intentionally or due to a transient validation error. This is a reasonable default, though it can produce surprising behavior if a customer's endpoint is actively rejecting payloads it will never accept.
What happens at final failure is the other critical detail. When all 11 attempts exhaust without a successful delivery, Svix does not silently discard the event. The payload is retained for 90 days. The failure is surfaced in the customer portal — the end customer can see the event, see the delivery history, and trigger a manual replay. The customer fixes their endpoint, logs into the portal, finds the failed event from three days ago, and resends it without ever contacting your support team. This is the operational detail that separates serious webhook infrastructure from a retry loop bolted onto a message queue.
Hookdeck: comparable engineering, opposite direction
Hookdeck takes the same reliability problem seriously as Svix — comparable engineering quality, opposite job description. For teams evaluating which tool belongs in their inbound pipeline, see the webhook vendor evaluation checklist.
Svix is outbound webhook infrastructure. You are the provider sending events to your customers' endpoints. Svix makes sure those events get delivered reliably.
Hookdeck is inbound webhook infrastructure. You are the consumer receiving events from providers like Stripe, GitHub, and Shopify. Hookdeck makes sure those events get delivered reliably to your handlers.
Hookdeck's delivery guarantee is at-least-once with up to 50 automatic retries. The retry logic includes configurable delays and linear backoff. Hookdeck is SOC2 compliant — relevant for teams handling payment events or regulated data where you need documentation of your event processing controls. The product surfaces include event routing (send different event types to different handler URLs), transformations, and filtering.
The practical experience for a team using both: Svix handles delivery to your customers, Hookdeck handles delivery from your providers. The tools are not in competition. They are solving the same reliability problem at different points in the pipeline. Teams that run both are not doing something redundant — they have identified that they have two distinct reliability problems and deployed appropriate solutions for each.
Where Hookdeck and Svix are in tension is cost and complexity. Hookdeck's team tier starts at $39/month. Svix's managed pricing scales with event volume. If you are a small team with straightforward webhook handling on both sides, deploying both tools is overhead. The market has not yet produced a clean single-product answer to the firm that needs both.
HookTunnel: no retry schedule — different model entirely
HookTunnel does not have an automated retry schedule, and it is worth being honest about what that means and why it is a deliberate choice rather than a missing feature. See HookTunnel's webhook inspection features and the flat $19/month Pro plan for what the capture-first model includes.
Svix and Hookdeck are fundamentally delivery systems. Their job is to get an event from point A to point B, and retry is how they guarantee the delivery completes. The retry schedule is central to their value proposition.
HookTunnel is a capture system. Its job is to accept the inbound HTTP request at the boundary and hold it — reliably, permanently within your history window — regardless of what happens next. When Stripe fires a webhook to your HookTunnel URL, HookTunnel captures the full HTTP payload: method, headers, body, timestamp, everything. It returns 200 to Stripe immediately. From Stripe's perspective, the event was delivered successfully.
What happens downstream — whether your handler processes it correctly, whether your database was up, whether your handler even ran — is a separate concern. HookTunnel has the payload.
This approach sidesteps the retry problem entirely for the inbound capture layer. But it creates a different question: how do you replay events when your handler fails?
HookTunnel's answer is Pro replay: a feature that lets you replay any captured event to any target URL, using the original HTTP request. Fix the bug in your handler, deploy, then replay the events that failed during the bad window. You choose which events. You choose the target. You can replay to a local development URL to verify the fix before you replay to production.
There is no automated retry schedule deciding when and how many times to attempt delivery. The tradeoff is explicit: you get finer control over replay decisions, but you are making those decisions rather than a scheduler. For teams doing incident recovery — where the answer to "should I replay this event?" requires knowing whether a previous attempt partially committed — that explicit control is often the right tradeoff.
The inbound/outbound mental model is worth putting on paper clearly:
[Stripe / GitHub / Twilio / Shopify]
|
| webhook events firing toward you
|
↓
[HookTunnel captures here] ← inbound capture, Pro replay
|
| forwarded to your handler
|
↓
[Your service handler]
|
| events fired toward your customers
|
↓
[Svix delivers here] ← outbound delivery, retry schedule
|
| retried to customer endpoints
|
↓
[Your customers' endpoints]
Hookdeck sits at the same layer as HookTunnel but adds delivery guarantees and routing between capture and your handler. HookTunnel sits at the boundary and prioritizes capture fidelity and replay control. Different tools for different risk profiles on the inbound side.
Pricing: HookTunnel is free for one hook with 24-hour history. Pro is $19/month for 10 hooks, 30-day history, and Pro replay. Hookdeck's team tier is $39/month. Svix pricing scales with event volume and feature tier.
The mental model that matters
The confusion that creates noise in this comparison is treating inbound and outbound webhook reliability as the same problem. They have the same vocabulary — retries, delivery guarantees, replay, history — but the failure modes, the responsible parties, and the right solutions are different.
Outbound (you sending to customers): You control the sending side. You do not control the receiving side. Your job is to keep trying until the customer's endpoint accepts the event or to surface the failure clearly when it cannot be delivered. Svix and Convoy address this.
Inbound (providers sending to you): You control the receiving side. You do not control the sending side. Stripe will fire its webhook and retry on its own schedule — you cannot ask Stripe to retry differently. Your job is to capture the payload reliably at the moment of receipt so you have it regardless of what your handler does with it. HookTunnel and Hookdeck address this, with different philosophies about automation versus control.
Svix's retry schedule is excellent engineering for the outbound problem. The five-day window, the 11-attempt exponential backoff, the customer portal surfacing failures — all of it reflects real operational experience with webhook delivery at scale.
For the inbound problem — the 3 AM Stripe webhook that failed because your handler panicked on an unexpected payload shape — there is now a different tool worth knowing about.
It does not retry. It does not need to. It already has your event.
Stop guessing. Start proving.
Generate a webhook URL in one click. No signup required.
Get started free →