Why aren't webhook retries enough for reliability?

Retries solve transient transport failures — network blips, temporary 503s. They do not detect handler bugs, silent failures (handler returns 200 but doesn't process), idempotency gaps, stale data, or ordering violations. Retries resend the same event to the same code — if the code is wrong, the retry fails the same way.

What problems do retries actually solve?

Retries solve one specific problem: the event was sent but the receiver was temporarily unreachable. Network timeout, temporary server error, connection reset. If the receiver comes back up and the code is correct, the retry succeeds. Retries are a transport-layer mechanism — they ensure delivery, not correctness.

What is the difference between retries and controlled replay?

Retries are automated, stateless, and undifferentiated — the provider resends the event without knowing whether the previous attempt succeeded at the application level. Controlled replay is operator-initiated, filtered, previewed, and audited. You choose which events to replay, preview the impact, approve the batch, and verify outcomes.

How does HookTunnel handle failures that retries can't fix?

HookTunnel uses outcome receipts to detect silent failures (handler returned 200 but the application didn't confirm processing), anomaly detection to surface unusual patterns, and controlled replay to recover specific events after the underlying issue is fixed. Retries handle transport. HookTunnel handles operations.

Retries Are Not Reliability: Webhook Retry Gaps

The assumption is so widespread it barely registers as an assumption: "Our webhook provider retries failed deliveries, so we're covered."

Stripe retries up to 9 times over 24 hours. Twilio retries up to 15 times. GitHub retries up to 3 times. Every major provider has a retry policy. The documentation says so. The feature comparison chart shows a checkmark next to "automatic retries." You move on to the next item on the evaluation.

But retries solve one problem. Just one. And the problems that actually burn you in production are the ones retries can't touch.

What retries actually solve

Retries solve transient transport failures. The event was sent, the network was temporarily unreachable, the connection timed out, or the server returned a temporary error (503, 502, 429). The provider's retry scheduler waits, then sends the same event again. If the receiver comes back up and the code is correct, the retry succeeds.

This is a real problem. Networks are unreliable. Servers restart. Database connections drop and reconnect. A retry mechanism ensures that a 30-second network blip doesn't permanently lose an event. Without retries, you'd lose events every time your server restarted or your load balancer hiccupped.

Retries are a transport-layer mechanism. They ensure the bytes arrive at the destination. They work well for that purpose.

But transport is not the failure mode that causes incidents.

What retries do not solve

Here are five failure modes that retries are structurally incapable of addressing.

1. Handler bugs

Stripe sends customer.subscription.updated. Your handler has a bug in the tier calculation for annual plans — it reads the interval field but applies monthly pricing logic to annual subscriptions. The handler processes the event, writes the wrong tier to the database, and returns 200.

Stripe marks the event as delivered. Even if Stripe retried, the retry would send the same event to the same handler with the same bug. The retry mechanism has no visibility into what your handler did with the event. It doesn't know the write was wrong. It doesn't know the tier calculation is incorrect. It sees 200 and moves on.

// This handler has a bug. Retries will execute the same bug.
async function handleSubscriptionUpdate(event) {
  const subscription = event.data.object;
  const priceId = subscription.items.data[0].price.id;

  // Bug: this lookup table only has monthly price IDs
  // Annual price IDs return undefined, which falls through to 'free'
  const tier = PRICE_TO_TIER[priceId] || 'free';

  await db('subscriptions').update({ tier }).where({
    stripe_customer_id: subscription.customer,
  });

  return { statusCode: 200 };
}

The customer paid for Pro annual. The database says free. Stripe says delivered. The retry mechanism sees 200 and never fires. The customer contacts support 3 days later wondering why they can't access Pro features.

Retries don't solve handler bugs. Retries resend. They don't re-evaluate.

2. Silent failures

Your handler returns 200 but doesn't process the event. This happens more often than most teams realize, and it's covered in detail in silent webhook failure. The patterns include: returning 200 before async processing completes, catching exceptions after the response is sent, queue drops between receive and process.

The relevant point for retries: a silent failure returns 200. The provider sees 200 and does not retry. The retry mechanism is designed to act on non-2xx responses and timeouts. A silent failure produces neither. From the retry mechanism's perspective, the delivery succeeded.

app.post('/webhooks/stripe', async (req, res) => {
  // Return 200 immediately (Stripe best practice)
  res.status(200).send('OK');

  // Process asynchronously
  try {
    await processEvent(req.body);
  } catch (err) {
    // This error happens AFTER the 200.
    // Stripe already moved on. No retry will come.
    logger.warn('Event processing failed', { error: err.message });
  }
});

The retry mechanism's trigger is the HTTP response code. If the response code says success, the mechanism has nothing to act on. Silent failures are invisible to retries by definition.

3. Idempotency failures

Retries can cause problems even when they fire correctly.

Stripe sends an event. Your handler processes it and returns 200. But there's a network issue between your handler's response and Stripe's receipt of that response. Stripe doesn't see the 200, so it retries the event. Your handler receives the event again.

If your handler is idempotent — it checks whether the event was already processed before processing again — the retry is harmless. But many handlers aren't fully idempotent. They check for duplicate event IDs at the HTTP layer but not at the database layer. Or they're idempotent for the primary write but not for side effects.

async function handlePayment(event) {
  // Idempotency check at the event level
  const existing = await db('processed_events')
    .where({ event_id: event.id })
    .first();

  if (existing) {
    return { statusCode: 200, body: 'Already processed' };
  }

  // Primary write
  await db('payments').insert({
    amount: event.data.object.amount,
    customer_id: event.data.object.customer,
  });

  // Side effect: send confirmation email
  await emailService.sendPaymentConfirmation(event.data.object.customer);

  // Record processing
  await db('processed_events').insert({ event_id: event.id });

  return { statusCode: 200 };
}

This handler has a race condition. If two retries arrive within milliseconds of each other, both pass the idempotency check before either records the event ID. Both insert the payment. The customer gets charged twice. The retry mechanism caused the duplicate — it was trying to help.

The idempotency problem is well-known, but the point here is that retries don't solve it. Retries create the conditions that require idempotency. Retries and idempotency are co-dependent — you need both, and you need the idempotency to be bulletproof, which in practice many teams reinvent with subtle bugs every time.

4. Stale data

Events don't exist in isolation. They exist in a sequence. A customer creates a subscription, then upgrades it, then cancels it. Three events arrive in that order. Your handler processes them sequentially.

Now add retries. The subscription.created event fails on first delivery (network blip) and is queued for retry. The subscription.updated event arrives and is processed successfully. Then the subscription.created retry arrives — with stale data.

Timeline:
  T+0:  subscription.created  → network timeout, queued for retry
  T+1:  subscription.updated  → processed, tier set to 'pro'
  T+5:  subscription.created  → retry arrives, tier set to 'free' (stale!)

The retry of subscription.created overwrites the subscription.updated result. The customer's tier reverts from Pro to Free. Both events were "successfully processed." Both returned 200. The retry mechanism worked exactly as designed. The data is wrong.

Retries don't understand event ordering. They don't know that a later event has already been processed. They send the bytes and check the response code. Temporal semantics — "this event is stale because a newer event already changed this state" — are outside the retry mechanism's model.

5. Partial system failures

Your handler receives a webhook and needs to update three systems: the subscription database, the feature flag service, and the billing ledger. The subscription database update succeeds. The feature flag service is temporarily down. The billing ledger is fine.

Your handler catches the feature flag error, logs it, and returns 200 because the primary write (subscription update) succeeded. From the handler's perspective, the event was mostly processed. From the retry mechanism's perspective, it was delivered.

But the feature flag was never set. The customer has the correct tier in the database and no access to Pro features. The retry, if it fired, would succeed on the subscription update (it's already there), succeed on the billing ledger (it's already there), and maybe succeed on the feature flag service (if it's back up). But the retry didn't fire because the handler returned 200.

Partial system failures produce partial processing. Retries are all-or-nothing — the event is either retried or it isn't. There's no mechanism to retry only the failed part of a multi-step handler.

The retry assumption

All five of these failure modes share a common thread. They expose the assumption built into every retry mechanism:

"If the delivery succeeded (2xx), the system worked correctly."

This assumption is false. A 2xx response proves the handler received the bytes and returned a success code. It does not prove the handler processed the event correctly, completely, or at all. For a detailed analysis of this gap, see why delivered doesn't mean applied.

The retry mechanism operates on HTTP semantics. The failure modes above operate on application semantics. These are different layers with different failure modes. The retry mechanism handles its layer well. It has no ability to address the layer above it.

The hierarchy of webhook reliability

There's a useful hierarchy for thinking about what different capabilities actually provide.

Level 1: Transport. The event was sent and received. Retries ensure this happens despite transient network failures. This is what providers call "at-least-once delivery." It means the bytes arrived. It does not mean they were processed.

Level 2: Inspection. The event was received and you can see the request and response. Request loggers provide this. You can debug after the fact by looking at payloads, headers, and status codes. But inspection is reactive — you're looking after something went wrong.

Level 3: Outcome verification. The event was received and your application confirmed the business outcome was committed. Outcome receipts provide this. You know not just that the handler returned 200, but that the database write committed and the side effects fired. See how webhook platforms cannot stop at HTTP request logging for why this level matters.

Level 4: Anomaly detection. Patterns of failures are surfaced before they become incidents. Missing receipts accumulate. SLA breaches cluster around specific handlers or event types. The system doesn't just record what happened — it flags what shouldn't be happening.

Level 5: Controlled operations. When something goes wrong, recovery is filtered, previewed, approved, and verified. Not "retry everything" but "replay these 40 specific events after fixing the handler, preview the impact, approve the batch, and verify every outcome." For how controlled replay works in practice, see how HookTunnel provides webhook reliability.

Retries operate at Level 1. Most webhook tools operate at Level 2. The failure modes that actually cause incidents — the handler bug, the silent failure, the stale data — require Levels 3 through 5.

Why the distinction matters for architecture decisions

Understanding that retries operate at the transport layer changes how you architect webhook handlers.

If you believe retries are your safety net, you write handlers that assume any failure will be retried. You return 200 aggressively to prevent retry storms. You build minimal idempotency. You don't invest in outcome tracking because the provider "handles reliability."

If you understand that retries are only transport, you design differently:

You track outcomes independently. After the database write commits, you send an outcome receipt to confirm the business logic succeeded. Events without receipts within the SLA window are flagged as Applied Unknown — not "delivered," not "failed," but "we don't have proof it worked." The SLA timer starts on delivery and creates a bounded window for the receipt to arrive.

You build handler-level observability. Not just "did the handler respond?" but "did the handler do the right thing?" This means tracking which events produced receipts, which produced errors, which fell into the gap between 200 and confirmed outcome.

You design for replay, not retry. When a handler bug is discovered, you fix the bug, then replay the affected events through the fixed handler. The replay is filtered to only the events that need it. The replay is previewed before execution. The replay is audited. This is a fundamentally different operation from "send it again and hope."

Retry:
  Provider → same event → same handler (maybe same bug)
  Automatic. No filter. No preview. No audit.

Replay:
  Operator → filtered events → fixed handler
  Manual trigger. Filtered. Previewed. Audited. Verified.

The shift from retry to replay is the shift from transport to operations. Retries are something the provider does to you. Replay is something you do deliberately, with visibility and control.

The Stripe tier calculation bug, revisited

Let's revisit the annual pricing bug from the beginning and trace how it plays out under different reliability models.

Retries only (Level 1). The handler returns 200 with the wrong tier. Stripe doesn't retry. The bug persists until a customer complains. Discovery: days to weeks. Recovery: manual database fix for each affected customer. Audit trail: none.

Retries + inspection (Level 2). Same as above, but you can look at the request payload after the customer complains. You can see the event, see that the handler returned 200, and infer what went wrong. Discovery: still days to weeks (you're looking after the complaint). Recovery: still manual.

Retries + outcome verification (Level 3). The handler returns 200 and sends an outcome receipt with the tier it wrote. HookTunnel records Applied Confirmed with tier: free. The customer's Stripe subscription says pro_annual. A reconciliation query finds the mismatch: Stripe says Pro, receipt says Free. Discovery: within the SLA window — minutes, not days. Recovery: still requires a handler fix, but you know exactly which events are affected.

Retries + outcome verification + controlled replay (Level 5). Same as Level 3 for discovery. After fixing the handler, you filter the affected events (Applied Confirmed where receipt tier doesn't match Stripe tier), preview the replay, approve it, and execute. Each replayed event goes through the fixed handler. New outcome receipts confirm the correct tier. Recovery: controlled, verified, audited. The full lifecycle — initial delivery, incorrect outcome, handler fix, filtered replay, corrected outcome — is recorded in the event lineage.

The retry mechanism didn't help at any level. The handler returned 200. Stripe saw success. The bug was invisible to the transport layer. Detection and recovery happened at the outcome and operations layers — the layers that retries don't touch.

What retries are good for

This isn't an argument against retries. Retries are necessary. Without them, every network blip would lose events permanently. Every server restart would create gaps. Transient failures are real and retries handle them well.

The argument is against the belief that retries are sufficient. They're the floor, not the ceiling. They handle the easiest failure mode (transient transport) and are structurally blind to the harder ones (handler bugs, silent failures, stale data, partial processing).

If you're evaluating webhook infrastructure and the vendor's reliability story begins and ends with "we retry up to N times with exponential backoff," ask what happens when the handler returns 200 and the business logic fails. Ask how you detect silent failures. Ask how you recover from handler bugs. Ask whether replay is filtered and previewed or just "send it again."

The answers will tell you whether you're looking at a transport tool or a reliability platform. Retries are transport. Reliability is everything above it. For a structured approach to evaluating webhook tools across all five levels, see replay safely after code changes and enterprise webhook reliability.

Retries Are Not Reliability: Why Webhook Retry Mechanisms Leave You Exposed