Investigations: Stop Rebuilding Context Every Time Webhooks Break

It is 2:06pm on a Thursday. Your support channel pings. Three enterprise customers say their integrations broke around noon. You check Slack. The message from your monitoring tool came in at 12:03pm — twenty-two webhook failures in a six-minute window. You click through to your log viewer. You start grepping. You copy event IDs into a Notion doc. You DM the engineer who was deploying at 11:50am. You paste timestamps into a shared doc and write: "suspect deploy regression."

By 4pm, the immediate problem is resolved. But here is what is not:

The context for how you diagnosed it lives in three Slack threads and a half-finished Notion page
The twelve events you identified as the core failure cluster are not linked to anything
Your anomaly score is in your head: "this looked bad, maybe 8 out of 10"
Tomorrow morning, a different engineer will read the incident thread and have to reconstruct everything you assembled between 2 and 4pm

The incident resolved. The context evaporated. And the next time it happens — slightly differently, with a different set of events, on a different day — you start from scratch again.

The problem is not the tools. It is that nothing holds the investigation together.

Most teams have good log storage. They have Slack. They have Notion or Confluence. What they do not have is a place where the webhook events, the timeline, the notes, the cause hypothesis, and the resolution are all attached to each other and accessible the next time someone needs them.

When you debug a webhook incident, you are doing investigation work: forming hypotheses, gathering evidence, testing explanations, arriving at a cause, closing the loop. That is a structured process. But nothing in your current toolchain treats it as one. Logs are logs. Notes are notes. Events are events. You are the glue holding them together, and when you are unavailable, the investigation is unavailable.

HookTunnel investigations give that process a first-class home.

What an investigation is

An investigation in HookTunnel is a named, tracked object that groups webhook events with the context needed to understand and resolve them.

Every investigation has:

A name and description you set
A severity level: Noise, Notable, Incident, or Critical
A status: Open or Resolved
An anomaly score (0-10) computed from the event cluster you attach
A cause confidence percentage updated as you add notes and evidence
A notes timeline where you record observations, hypotheses, and findings
Attached events — the specific webhook events that are part of this incident
Auto-linked related events found by pattern (CallSid, customerId, payload fingerprint)
A resolution record when you close it

The investigation panel is one click away from any event

You do not have to go to a separate investigation tool. When you are looking at a webhook event in HookTunnel — in the event detail modal — there is an Investigate button. Click it to escalate that event to a new investigation, or attach it to an existing open investigation. The panel surfaces inline: you see the anomaly score for the current cluster, the notes timeline, the attached events, all without leaving your debugging context.

You do not have to decide upfront whether something is worth investigating. You can escalate any single event to an investigation in one click, and add events to it as you discover they are related.

Anomaly scoring: is this actually bad?

Not every webhook failure is an incident. Some failures are noise — transient network issues, provider hiccups, expected retries. Others are genuine anomalies that indicate a broken handler or a provider incident.

HookTunnel's anomaly scorer evaluates the cluster of events attached to an investigation and produces a score from 0 to 10 based on:

Failure rate in the event cluster vs. your baseline failure rate
Temporal density (failures concentrated in a narrow time window)
Provider error codes (some codes indicate transient noise, others indicate systemic issues)
Cross-event patterns (same error code across multiple events, same source IP, same event type)
Deviation from expected delivery latency

A score of 2-3 means you are probably looking at noise. A score of 7-9 means something systemic is happening. A score of 10 means the pattern is highly anomalous and should be treated as a critical incident until proven otherwise.

The anomaly score is not static. It updates as you attach more events. If you add twenty events from the same failure window and all of them have the same error code, the score rises. If you add events and they are spread across different providers with different error codes, the score adjusts.

Cause confidence

Alongside the anomaly score, investigations track cause confidence: a percentage from 0-100% representing how confident you are in the identified cause. You update it manually as you add notes. Starting an investigation at 0% (we don't know yet) and moving it to 85% (we're fairly certain it was the 11:50 deploy) creates a legible record of how the diagnosis progressed.

When you close an investigation, the final cause confidence is recorded with the resolution. Over time, you can look back at closed investigations and see how quickly your team identifies causes for different severity levels.

Auto-linking by payload patterns

Investigations become more useful when related events are automatically surfaced.

When you create an investigation and attach events, HookTunnel analyzes the attached event set and searches for other events that share:

A CallSid field (Twilio voice events that are part of the same call)
A customerId, user_id, or account_id field matching the affected customers
A charge_id or payment_intent_id field from the affected transactions
Identical error codes or response status patterns from the same provider

These related events are surfaced in the "Auto-Linked Events" section of the investigation panel. You can review them and decide whether to officially attach them to the investigation or leave them as context.

This auto-linking is what makes investigations useful for incidents that span multiple event types. A payment failure might generate a payment_intent.payment_failed event and a customer.subscription.updated event. Both end up in the investigation automatically.

Severity tiers and the investigations list

Not every investigation gets equal attention. The severity system gives you a way to triage:

Noise: expected failures, not worth immediate action
Notable: worth watching, may escalate if it continues
Incident: active problem affecting customers, requires resolution
Critical: significant customer impact, requires immediate action

The investigations list view shows all open and recently resolved investigations, filterable by severity, status, provider, and time range. When something happens and you need to see what is currently active — before opening a Slack thread, before pinging the on-call engineer — you open the investigations list.

If there is already an open Critical investigation for the provider you are looking at, you do not create a new one. You attach your events to the existing investigation and add a note. One investigation, all the context, one link to share.

Alex's Thursday

Alex is a senior SRE. At 2:06pm, the support channel pings: three enterprise customers, integrations broken, started around noon. Alex opens HookTunnel.

There are 12 failing events in a cluster from 12:01pm to 12:07pm. All Stripe. All payment_intent.payment_failed. Alex clicks the first event, hits Investigate, names it "Stripe payment failures — noon window — suspect deploy regression." Sets severity to Incident. Attaches all 12 events.

Anomaly score: 8.4. The score reflects that 12 events in 6 minutes is a tight cluster, and that the failure rate for these event types in this window is 6x their baseline.

Alex adds a note: "We shipped at 11:50am. Deploy touched the webhook handler config. Suspect this is the cause. Investigating now." Cause confidence: 20%.

Alex confirms with the engineer who deployed: the deploy changed the Stripe signing secret in staging, and a misconfigured environment variable pulled the staging secret into production. Alex adds a note with the finding, bumps cause confidence to 95%.

The fix is deployed at 3:12pm. Alex marks the investigation Resolved, records the resolution: "Environment variable misconfiguration during deploy — staging signing secret pulled into prod. Fixed in deploy abc123." Final cause confidence: 98%.

Two weeks later, a junior engineer is reviewing incident patterns. They open the closed investigations list. They see Alex's Thursday investigation. Everything is there: the event cluster, the anomaly score, the note timeline, the cause, the resolution. They do not have to find Alex. They do not have to find the Slack thread. They read the investigation and understand what happened in two minutes.

That is the value that evaporates when investigations live in Slack: not the resolution, but the transferable knowledge.

How investigations complement the rest of HookTunnel

Cross-Customer Pattern Detection can tell you whether the failure pattern in your investigation is limited to your hooks or affecting multiple users of the same provider. If 5 other HookTunnel users are also seeing Stripe payment_intent.payment_failed failures in the same window, HookTunnel flags it as a possible provider incident. That data surfaces in your investigation panel automatically.

Outcome Receipts let you see, for each event in your investigation, whether it was actually applied by your application. An event that was delivered (your handler returned 200) but never confirmed applied (no receipt) is a gap — the webhook arrived, your handler said OK, but the database write may not have committed. Investigations show you that gap inline, so you know whether resolution requires a replay or just a handler fix.

Reconciliation gives you the revenue impact of an investigation. If the failing events in your investigation represent Stripe payments that were never applied, the reconciliation view shows you the dollar amount of the gap and the specific orders affected. When you brief your CTO on the Thursday incident, you have a number.

Comparison with alternatives

| Capability | HookTunnel Investigations | Datadog | Webhook.site | Manual Slack/Notion | |---|---|---|---|---| | Events attached to incident | Yes | No (log-based, not event-linked) | No | Manual | | Anomaly scoring | Yes | Yes (log anomaly) | No | No | | Cause confidence tracking | Yes | No | No | No | | Notes timeline on incident | Yes | Yes (on monitors) | No | Manual | | Auto-link by payload field | Yes | No | No | No | | Cross-customer pattern overlay | Yes | No | No | No | | Receipt state on attached events | Yes | No | No | No | | Revenue impact from reconciliation | Yes | No | No | No |

The distinction is specificity. Datadog is a general-purpose observability platform adapted for logs and metrics. HookTunnel investigations are built specifically for webhook incidents, which means they understand the payload structure, the delivery state, the receipt status, and the provider context that makes webhook debugging different from application log debugging.

Frequently asked questions

How many events can I attach to one investigation?

Up to 500 events per investigation. For incidents involving more than 500 events, attach a representative sample and describe the broader scope in the investigation notes.

Can I have multiple open investigations at once?

Yes. The investigations list shows all open investigations sorted by severity and creation time. There is no limit on concurrent open investigations.

Do investigations persist after events expire from my retention window?

Yes. The investigation metadata — name, notes, anomaly score, resolution — persists beyond your retention window. The attached event payloads follow normal retention rules and may no longer be accessible, but the event IDs and metadata remain linked to the investigation permanently.

Can I link an investigation to a GitHub issue or a Jira ticket?

You can add any URL as a note in the notes timeline. There is no native integration with issue trackers at this time, but a note with the ticket link creates a bidirectional reference if you also paste the investigation URL in your ticket.

Who can see investigations?

Investigations are visible to all members of your HookTunnel account. They are not visible to other accounts. Row-level security in the database enforces this boundary.

Can I export an investigation for a post-incident report?

Yes. Any investigation can be exported as a structured JSON document including the notes timeline, attached event metadata, anomaly scores, and resolution record. The JSON export can be attached to a post-incident report or imported into your documentation system.

How does cause confidence differ from anomaly score?

Anomaly score is computed by the system from the event data — you do not set it. Cause confidence is set by you as you investigate — it reflects how confident you are that you have identified the cause. They measure different things: anomaly score measures how unusual the event pattern is, cause confidence measures how well you understand why it is unusual.