Engineering·5 min read·2025-04-07·Colm Byrne, Technical Product Manager

From "Something Broke" to "Here's the Evidence": Running a Webhook Incident

A Slack message: 'We're getting reports from enterprise customers that their integrations aren't working.' The incident is already 40 minutes old. You have no investigation, no timeline, no context. What follows determines whether this is a 90-minute incident or a 6-hour one.

It is 11:43 AM on a Tuesday. A Slack message from customer support: "We're getting reports from enterprise customers that their integrations aren't working."

That sentence, in various forms, is how most webhook incidents start — from a customer report, not from monitoring. The incident is already 40 minutes old by the time someone saw the message and alerted engineering. You have no investigation. No timeline. No context beyond a vague report from a customer who probably does not have the technical language to describe what failed.

What follows determines whether this is a 90-minute incident with a clean postmortem or a 6-hour scramble that ends with a half-assembled story assembled from Slack history and memory.

This is the runbook for the 90-minute version.


Phase 1: Detection

The fact that this incident started with a customer report is itself a signal that your detection is not working. By the time a customer notices their integration is broken and the report reaches engineering, the incident has been running for 30-60 minutes minimum. For enterprise customers with SLAs, that gap is expensive. See incident response best practices for the industry standard on detection systems. Stripe webhook documentation covers the provider-side retry behavior that complicates incident scope assessment.

Detection should happen before the customer report. There are three signals that HookTunnel surfaces automatically:

SLA alerts for Applied Unknown. Every delivered event that does not produce a receipt within your configured SLA window is flagged. If your SLA is 60 seconds and a processing failure started 10 minutes ago, you have 10 minutes of unresolved Applied Unknown events generating alerts — well before the customer emails.

Error rate spike on outcome receipts. A sudden drop in receipt arrival rate — even if delivery continues successfully — is an early signal that your application's processing layer is broken. Deliveries are succeeding. Outcomes are not being confirmed. The gap between Delivered and Applied is widening.

Anomaly score crossing threshold. HookTunnel scores each hook's event stream continuously. When the anomaly score crosses your configured threshold, an alert fires. Anomaly scoring catches patterns that SLA alerts miss: gradual degradation, a specific event type failing while others succeed, a provider sending unusual volume.

If you had these three signals configured, the scenario in the opening paragraph changes. You would have received an alert 35 minutes before the customer email. The investigation would have started before the report arrived.

For the purposes of this runbook, we will start where most teams actually start: the customer report has arrived and the incident is real.


Phase 2: Triage

First question: what is the scope? Do not start investigating root cause until you know the scale — the difference between 3 events and 300 events changes the severity classification entirely. The webhook outage recovery playbook covers what to do after triage when you need to replay events safely.

Open HookTunnel. Filter to the relevant time window — if the customer report says "integrations aren't working," start with the last 2 hours. You need to answer four questions before you can make a severity judgment:

How many events are affected? The difference between "3 events failed" and "300 events failed" is the difference between a Notable and an Incident severity. Do not start investigating root cause until you know the scale.

Which hooks are affected? Is this all hooks from one provider? All hooks of one event type? All hooks for one tenant? Or is it randomized across all traffic? Scoped failures point toward configuration or business logic issues. Random failures point toward infrastructure.

Which customers are affected? For enterprise customers with SLAs, you need this information within the first 15 minutes of the investigation. If the affected set includes any enterprise accounts, that changes your communication cadence.

Is it ongoing? Are events still failing right now, or did the failure window end? If it is still happening, your first priority is containment — understanding what is actively breaking before you start root cause. If it stopped, you have more time to investigate carefully.

With these four answers, you can set severity: Noise (cosmetic, no customer impact), Notable (some customers affected, no SLA breach), Incident (enterprise customers affected or SLA breach), Critical (systemic failure, revenue impact, provider-level problem).


Phase 3: Investigation

Create a HookTunnel investigation. This sounds like an administrative step but it is the most operationally important thing you can do in the first 10 minutes of an incident.

An investigation is a container. It holds the failing events, your notes, the timeline you are building, the hypotheses you are testing, and eventually the evidence that closes the incident. By creating it now, you ensure that every observation is recorded in one place — not scattered across Slack threads, not in someone's head, not lost when the incident ends and everyone moves on.

Attach the failing events to the investigation. HookTunnel's event selection allows bulk-attach by filter: select all events with Applied Unknown status in the last 2 hours, attach to investigation. Done.

Add a note with your initial hypothesis. Even if the hypothesis is wrong, writing it down forces clarity and creates a record of your reasoning that will be useful in the postmortem. "Initial hypothesis: DB connection pool exhaustion during deploy at 11:20 AM. Checking deploy logs."

The anomaly score on the investigation shows the cluster severity from 0-10. A score of 8 means the pattern of failures is statistically unusual compared to your baseline. A score of 2 means it is within normal variation and may not warrant a full incident response. The cause confidence percentage — built from the event patterns — gives you a starting point for root cause hypotheses.

Share the investigation link in the incident Slack channel. Now everyone working the incident has the same context. No one needs to ask "what have you found so far?" — they can read the investigation notes. When you hand off to another engineer, you send them the link. When the incident ends, the investigation is the record.


Phase 4: Root Cause

Open the event detail modal for a representative failing event. There are four signals here:

What did the handler return? HTTP status, response body, timing. If it returned 500, the failure is in your handler. If it returned 200 but no receipt arrived, the failure is in the processing logic after the handler's return. If the response time is 9.9 seconds (near timeout), the failure is likely a database slow query or a downstream API that is hanging.

What is the reason code? HookTunnel's reason code taxonomy gives you a starting label. RCPT_AUTH_INVALID means the receipt signature does not match — the receipt secret may have been rotated. RCPT_EVENT_NOT_FOUND means the receipt references a request log ID that does not exist — a DB write may have failed before the receipt was sent. RCPT_MISSING_SLA means the SLA window elapsed without a receipt — the processing is slow or hanging.

When did the failure pattern start? The timestamp cluster in the investigation tells you. If failures started at exactly 11:20 AM and your deploy was at 11:20 AM, the deploy is the probable cause. If failures started gradually over 45 minutes, it is more likely a resource exhaustion pattern.

Is it widespread? HookTunnel's cross-customer pattern detection checks whether the same failure pattern is appearing across multiple tenants. If 3 or more distinct users are experiencing the same failure profile within a 24-hour window, HookTunnel marks it as a widespread issue. Widespread failures point toward provider-level problems, infrastructure issues, or a code change that affects all tenants rather than a tenant-specific configuration.

Cross-reference with your application logs for the timestamp window. The event detail modal shows you the exact timestamp of each delivery. Search your application logs for errors in that window. You should be looking for correlation, not causation yet — the goal at this phase is to narrow the hypothesis, not to prove it.


Phase 5: Fix and Verify

Deploy the fix. This phase is not unique to webhook incidents — it is standard software deployment. What is different with HookTunnel is the verification step. "Receipt rate has returned to baseline" is verification. "It looks like it's working" is not. For the recovery phase, see the webhook outage recovery playbook for the full guardrailed replay procedure.

After deployment, watch the outcome receipt rate for new events. The visual shift from Applied Unknown to Applied Confirmed is real-time verification that your fix worked. You do not need to inspect individual events and infer success — you can watch the confirmation rate return to baseline.

If the fix is correct: receipt rate returns to normal within 1-2 SLA windows. New events coming in are transitioning to Applied Confirmed. The failure pattern stops.

If the fix is partial: receipt rate improves but does not return to baseline. Some event types are still failing. The anomaly score comes down but does not reach normal. This is a signal to look more carefully at the subset that is still failing.

If the fix is wrong: receipt rate does not change. Applied Unknown continues to accumulate. The investigation anomaly score stays elevated. Go back to Phase 4.

The verification is not a formality. "It looks like it's working" is not verification. "Receipt rate has returned to baseline over the last 15 minutes and the anomaly score has dropped from 7 to 1" is verification. That distinction matters for the manager message.


Phase 6: Recovery

You have fixed the forward path. New events are processing correctly. But the events that failed during the incident window are still in Applied Unknown state. Customers are still not receiving their outcomes. Recovery means replaying those events safely.

Guardrailed replay starts with a dry run. Select all Applied Unknown events from the incident window. Run dry-run. HookTunnel shows you:

  • How many events would be replayed
  • How many would be skipped because a receipt has already arrived (these arrived after the incident ended and resolved themselves)
  • How many are in scope for replay
  • The risk assessment: any events that might produce duplicate outcomes if replayed

Review the dry run. If the scope looks right and the risk assessment is acceptable, execute.

Watch the Applied Confirmed count rise as receipts arrive from the replayed events. The reconciliation dashboard shows the gap closing. Each row that transitions from the "Paid and Delivered" bucket to the "Paid and Applied" bucket represents a customer whose outcome is now confirmed.

When the gap reaches zero, the recovery is complete. Not "probably complete." Not "we believe everyone has been recovered." Complete, with receipts as evidence.

The batch replay audit log records every event that was replayed, every event that was skipped and why, the timestamp of each receipt, and the final state of each event. This log is not for you — it is for finance and for enterprise customers who ask for an impact report.


Phase 7: Communication

The manager message is the final deliverable of the incident. It needs to answer four questions: What happened? Who was affected? What did you do about it? What prevents recurrence? The manager message is a summary of evidence that already exists — not a story assembled from memory. The webhook revenue leakage post covers how to frame the business impact of webhook processing failures for finance and leadership.

The mistake most teams make is writing this from memory. By the time the incident ends, memory is unreliable — the timeline is compressed, the cause is oversimplified, and the impact is approximate. The manager message becomes a story, not a report.

With tooling, the manager message is a summary of evidence that already exists.

What happened: "At 11:20 AM, a deploy introduced a regression in the subscription handler that caused DB writes to fail silently. The handler continued returning 200, but outcome receipts were not arriving. HookTunnel detected the Applied Unknown accumulation and raised an alert at 11:23 AM. Customer reports began arriving at 11:43 AM."

This sentence comes directly from the investigation timeline. The 11:23 AM timestamp is in the alert log. The 11:43 AM timestamp is in Slack. The cause is from the root cause notes in the investigation.

Who was affected: "43 events across 12 customers were affected. Two are on Enterprise plans with SLAs. No SLA breach occurred — the longest gap was 38 minutes for customer ID CUST-9842. Their outcomes were recovered via guardrailed replay at 1:15 PM."

These numbers come from the reconciliation dashboard. The SLA calculation comes from the event timestamps. The customer IDs come from the investigation's affected-event list.

What you did: "We reverted the deploy at 12:47 PM and confirmed the fix via receipt rate normalization. Batch replay of 43 affected events was executed at 1:10 PM. Dry run ran first with no high-risk events identified. All 43 events reached Applied Confirmed status by 1:22 PM. Reconciliation shows zero gap."

These numbers come from the replay job audit log and the reconciliation dashboard. Every statement is provable.

What prevents recurrence: "We are adding outcome receipt rate as a tier-1 alert signal. We are adding a pre-deploy check that verifies receipt rate baseline before enabling traffic on new deployments. Investigation link: [link]"


The Same Incident, Without Tooling

For contrast: the same incident without HookTunnel.

11:43 AM — customer report arrives. Engineer starts investigating. There is no event history organized in one place. They check Stripe logs (shows delivered), check application logs (shows 200 responses), check the database (looks like the subscription records are there — actually they are in a broken state but it is not immediately obvious).

12:30 PM — second hour of investigation. Another engineer joins. They have to be briefed via Slack. Context is reconstructed from memory. The database state issue is finally identified.

1:15 PM — fix is deployed. The engineer needs to determine which customers were affected. This requires a manual SQL query joining Stripe webhook logs to application records, which requires knowing the schema of both, which requires looking at source code.

2:00 PM — manual DB updates are applied for the 12 affected customers. No dry run was done. Two customers receive duplicate subscription activations because the manual update was run against records that had already self-resolved.

3:30 PM — postmortem document is started. The timeline is assembled from Slack messages. The impact count is approximate. The root cause is described as "a regression in the subscription handler" without specifics because the specific code path was identified in conversation and not written down.

4:00 PM — manager message goes out. "We experienced a webhook processing issue that affected some customers between 11:20 AM and approximately 1:15 PM. We have investigated and believe all affected customers have been recovered."

That message contains one measurable claim ("11:20 AM") and three hedged ones ("approximately," "some," "believe"). It is the best the team can do with the information they have. It does not close an enterprise customer's support ticket. It does not satisfy a finance audit. It is a story assembled from memory, not a report built from evidence.


Building the Habit Before the Incident

The runbook described here is not complicated. It is a sequence of steps that anyone can learn in an afternoon. The difficulty is not the steps — it is doing them under pressure, on an unfamiliar incident, when the instinct is to skip to root cause without completing triage.

HookTunnel's Incident Lab lets you run through the complete sequence before the real thing happens. A simulated failure is introduced. You go through detection, triage, investigation, fix, replay, reconciliation, and manager message with a guided checklist. The first time you do this, it takes an hour. The second time — when it is a real incident at 11:43 AM on a Tuesday — you have done it before. The steps are familiar. The tooling is familiar. The pressure is lower because you know where you are going.

The difference between a 90-minute incident and a 6-hour one is almost never technical knowledge. It is almost always process: knowing what to do next, having the evidence organized before you need it, and being able to communicate with precision when precision matters.

Stop guessing. Start proving.

Generate a webhook URL in one click. No signup required.

Get started free →