Engineering·5 min read·2025-03-31·Colm Byrne, Technical Product Manager

Webhook Outage Recovery Playbook: Replay Without Making Things Worse

2 hours and 37 minutes of failed webhook deliveries. Some may have been retried and partially processed. You don't know which ones. Replay everything and risk duplicates. Replay nothing and have revenue gaps. Or follow this playbook.

The incident is over. The database migration that caused the outage finished at 2:43 AM. Your handler is responding again. You have 2 hours and 37 minutes of failed webhook deliveries — 168 events that either timed out or returned 503. Some of them may have been retried by Stripe and partially processed. Some may have gotten through before the 503s started. You don't know which ones.

Your options are: replay everything, replay nothing, or figure out which ones to replay.

Replaying everything is the intuitive first move. It feels thorough. But duplicates are often worse than gaps. An order created twice, a payment processed twice, an account upgraded twice — these require manual reconciliation, refunds, support tickets, customer trust damage. Depending on your system's idempotency guarantees, replaying blindly may create a cascade of data integrity problems that takes days to clean up.

Replaying nothing leaves revenue gaps. Customers who upgraded during the outage window don't have access. Payments that processed during the window haven't been reconciled. The gaps will surface in customer support tickets, in your own QA, in a Stripe reconciliation discrepancy.

This is the hardest moment of any webhook incident. The technical emergency is over, but the recovery work is just beginning, and doing it wrong makes things materially worse.

This playbook walks through the six steps to recover cleanly.

Why duplicates are worse than gaps

Before the steps, it's worth being explicit about why "replay everything" is the wrong default. A gap is a missing action. A duplicate is a wrong action that was taken — and wrong actions require coordination with external systems to undo. See Stripe webhook best practices for Stripe's own guidance on idempotent event handling. For the incident response context that leads to replay decisions, see the incident response best practices guide.

A gap is a missing action. A duplicate is a wrong action that was taken. Missing actions can be taken retroactively. Wrong actions that were taken require correction — and correction often requires coordination with external systems (Stripe, your customer's accounting system, your own database) where you can't simply undo a row.

Stripe reconciliation is a good example. If you replay a customer.subscription.updated event for a customer whose subscription was already updated, you now have a subscription record that was updated twice with the same data. That's benign. But if you replay a payment_intent.succeeded event for a payment that was already acknowledged, your system may attempt to fulfill the order twice, trigger duplicate email confirmations, or double-count the revenue in your analytics.

The correct framing is not "replay vs. no replay" — it's "replay with confidence about what's safe to replay." The playbook below gets you there.

Step 1: Identify the window

The first thing you need is a precise boundary for the outage. Not "it was down for about 2.5 hours" — you need the exact timestamps.

In HookTunnel, the event stream shows every delivery attempt with its timestamp and delivery status. Filter the event log for the time range around the outage. You're looking for:

  • The first event that returned a non-2xx or timed out — this is your outage start
  • The last event that returned a non-2xx — this is your outage end
  • The first event after that that returned 2xx — confirmation the handler recovered
Event timeline:
2:03:14 AM  evt_1P...  customer.subscription.updated  → 200 OK  (last successful before outage)
2:06:47 AM  evt_1P...  invoice.payment_succeeded      → 503     (first failure — outage start)
...
4:43:18 AM  evt_1P...  customer.subscription.updated  → 503     (last failure)
4:43:52 AM  evt_1P...  customer.subscription.updated  → 200 OK  (recovery confirmed)

The window is 2:06 AM to 4:43 AM. Every event that arrived in that window is a recovery candidate.

Write down the precise window boundaries. You'll use them in the API calls below.

Step 2: Assess what was delivered vs. applied

Not every event in the outage window has the same status. Some may have succeeded before the outage fully took hold. Stripe's retry mechanism may have succeeded for some events on a later attempt. A few may have gotten through to your handler and been processed correctly. The Applied Confirmed events are the only ones you can skip with confidence — every other status requires investigation. For understanding the three states, see why delivered doesn't mean applied and webhook revenue leakage.

In HookTunnel, filter the event stream for the outage window and group by delivery status:

  • Applied Confirmed: receipt received, database write confirmed. These are done. Do not replay.
  • Applied Unknown: delivered (2xx) but no receipt within SLA. The handler returned 200 but you have no evidence of the database write. These are high-priority candidates for investigation and possible replay.
  • Failed / Timeout: the handler returned non-2xx or the connection timed out. The event was not processed. These are definite replay candidates.

The critical insight is that "Applied Confirmed" events are proven-safe to skip. If you have outcome receipts wired up and your handler sent a receipt after each successful database commit, you can trust the Applied Confirmed status. These events are done. Replaying them would be replaying an event that already applied.

If you don't have outcome receipts wired up, every event in the outage window is Applied Unknown — which means you can't distinguish between events that committed and events that didn't. This is the situation where "replay everything" is tempting but dangerous. The playbook still applies, but Step 2 becomes a manual cross-reference between your database and the event IDs, which is slow and error-prone.

This is also the argument for wiring up outcome receipts before you need them. The 30 minutes of setup work pays for itself the first time you need to run this playbook.

Step 3: Dry-run to preview the batch

Before executing any replay, run a dry-run. A dry-run calculates the batch — which events would be replayed, which would be skipped, what the risk score is — without actually delivering anything.

# Dry-run: preview the replay batch without executing
curl -X POST https://hooks.hooktunnel.com/api/v1/replay/batch \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "hook_id": "your-hook-id",
    "window_start": "2026-02-19T02:06:00Z",
    "window_end": "2026-02-19T04:44:00Z",
    "skip_applied_confirmed": true,
    "dry_run": true
  }'

Response:

{
  "dry_run": true,
  "total_events_in_window": 168,
  "would_skip_applied_confirmed": 42,
  "would_skip_applied_unknown": 0,
  "would_replay": 126,
  "risk_score": 0.23,
  "risk_factors": [
    {
      "factor": "events_without_receipts",
      "count": 12,
      "description": "12 events returned 200 but have no receipt. Replay may duplicate if handler processed them."
    }
  ],
  "estimated_duration_seconds": 385
}

The dry-run tells you:

  • How many events are in the recovery window
  • How many will be skipped because they're Applied Confirmed
  • How many will be replayed
  • The risk score and specific risk factors
  • How long the controlled replay will take

The risk factors are where you pay attention. If the response says "12 events returned 200 but have no receipt," those are the ones that might duplicate. Before proceeding, you have options:

  1. Accept the risk (if your handler is idempotent, duplicates don't matter)
  2. Manually verify the 12 events (check your database for the specific customer IDs)
  3. Exclude those 12 from the replay batch (use skip_applied_unknown: true to be conservative)

The dry-run gives you the information to make this decision deliberately rather than reactively.

Step 4: Execute the guardrailed replay

Once you're satisfied with the dry-run results, execute the actual replay. The guardrails work automatically during execution. The skip_applied_confirmed guardrail is the primary duplicate prevention mechanism — without it, replay is dangerous. For the retry storm context that makes outage recovery so complex, read webhook retry storms. The HookTunnel Pro plan includes guardrailed batch replay as a core feature.

# Guardrailed batch replay — skips Applied Confirmed, stops on receipt arrival
curl -X POST https://hooks.hooktunnel.com/api/v1/replay/batch \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "hook_id": "your-hook-id",
    "window_start": "2026-02-19T02:06:00Z",
    "window_end": "2026-02-19T04:44:00Z",
    "skip_applied_confirmed": true,
    "stop_on_receipt": true,
    "backoff_ms": 2000,
    "max_concurrent": 1
  }'

The guardrail behaviors:

skip_applied_confirmed: true — Any event in the batch that receives an Applied Confirmed receipt (either pre-existing or received mid-batch while another event is being replayed) is skipped. This is the primary duplicate prevention mechanism.

stop_on_receipt: true — If a receipt arrives for an event while the batch is executing — meaning your handler just processed it and sent the confirmation — the replay for that specific event stops. The event is already applied. No need to replay it again. Execution continues for the remaining events in the batch.

backoff_ms: 2000 — Wait 2 seconds between each event delivery. This gives your recovering system room to breathe rather than flooding it with a burst of replayed events. Your system just came through an outage; it doesn't need 126 events delivered as fast as possible.

max_concurrent: 1 — Serial delivery. One event at a time, in order. This is the safe default. If your handler is stateless and idempotent, you can increase this to 2-3 for faster recovery.

The response streams progress in real time:

Starting guardrailed batch replay: 126 events
[1/126] evt_1P... customer.subscription.updated → 200 OK (receipt: applied_confirmed)
[2/126] evt_1P... invoice.payment_succeeded → 200 OK (receipt: applied_confirmed)
[3/126] evt_1P... customer.subscription.updated → SKIPPED (applied_confirmed pre-existing)
...
[67/126] evt_1P... payment_intent.succeeded → 200 OK (receipt: applied_confirmed)
[68/126] evt_1P... customer.subscription.updated → STOPPED (receipt arrived mid-batch)
...
[126/126] Complete. 119 delivered, 7 skipped (applied_confirmed).

The 7 skipped events were either pre-existing Applied Confirmed or became Applied Confirmed mid-batch (meaning a late receipt arrived). Both are safe skips.

Step 5: Verify with the reconciliation dashboard

Executing the replay is not the end. You need to verify that the recovery actually closed all the gaps.

Open the reconciliation dashboard in HookTunnel and filter for the outage window. The dashboard compares two dimensions:

  • Paid (Stripe events received): every payment_intent.succeeded, invoice.payment_succeeded, and customer.subscription.updated event that arrived during the window
  • Applied (receipt confirmed): every event in that same window that has Applied Confirmed status

The gaps are the events where Paid is true but Applied Confirmed is false. Before the replay, you had 126 gaps. After the guardrailed replay, you should have 0 gaps.

If gaps remain, they show up in the reconciliation view with the specific event IDs, event types, and timestamps. Each gap has actions: Replay Again, Trace (view full delivery history for this event), or Resolve with Note (mark as investigated and close with an audit note explaining why it's acceptable to leave this gap).

The goal before you close the incident is reconciliation: zero unresolved gaps for the outage window, with every resolved gap having either an Applied Confirmed status or an audit note explaining the resolution.

Step 6: Document and communicate

Once reconciliation is clean, you have everything you need to close the incident properly. The evidence package — timestamps, event counts, audit log — is what turns a vague manager update into a provable statement. See the webhook incident runbook for the full manager message template.

The HookTunnel incident summary for the window includes:

  • Exact outage start and end timestamps
  • Total events affected: 168
  • Events applied before outage closed (pre-existing Applied Confirmed): 42
  • Events replayed: 126
  • Events skipped mid-batch (guardrail triggered): 7
  • Events confirmed applied after replay: 119
  • Remaining gaps: 0

This is your evidence package. It goes into the postmortem, it goes to your manager, and if any customers ask what happened, you can show them the exact timestamp when their event was recovered.

For the manager message:

Outage window: 2:06 AM – 4:43 AM (2h 37min)
Events affected: 168

Recovery:
- 42 events applied successfully before outage closed
- 126 events replayed at 4:51 AM using guardrailed replay
- 7 events skipped (already applied — guardrail triggered)
- All 168 events reconciled by 5:03 AM

No duplicates created. Zero open gaps.

That message, backed by the HookTunnel reconciliation evidence, closes the customer impact question definitively.

What the alternative looks like

Without a structured replay tool, the recovery process is manual. After the handler comes back up, you:

  1. Pull the Stripe event log for the outage window
  2. Cross-reference each event ID against your database to determine if it was processed
  3. Manually replay events through the Stripe dashboard (one at a time, or via Stripe's "Resend" button)
  4. Re-check the database after each replay to confirm it applied
  5. Track your progress in a spreadsheet

For 168 events, this takes 3-4 hours of focused work. It's error-prone — you'll mistype an event ID, skip a row in the spreadsheet, lose track of which events you already verified. And you're doing this at 4 AM after a 3-hour incident.

More importantly, the manual approach has no guardrails. Every event you replay is a potential duplicate. You're relying on your own memory and judgment to avoid replaying events that were already processed.

The structured playbook — identify window, assess status, dry-run, guardrailed replay, verify reconciliation — takes 20-30 minutes and produces evidence. The manual approach takes 3-4 hours and produces uncertainty.

The difference between these two recoveries isn't luck or skill. It's infrastructure. Having HookTunnel's event storage, outcome receipts, and guardrailed replay in place before the incident means the incident recovery is a procedure, not a crisis.

The best time to set this up is before your next outage. The second best time is right now.

Stop guessing. Start proving.

Generate a webhook URL in one click. No signup required.

Get started free →