From Webhook Observability to Webhook Operations: Why Seeing the Problem Is Not Enough
Your observability dashboard shows a latency spike at 2:47 PM. Error rate went from 0.1% to 2.3%. You can see the problem. But you cannot identify the 23 affected events, replay the 19 that failed, verify the 4 that were partial, and document the recovery. That is the gap between observability and operations.
You have structured logs. You have a dashboard. You have error rate metrics and latency percentiles. You can see when things go wrong with your webhook integrations.
None of that helps you fix it.
At 2:47 PM on a Wednesday, your observability dashboard lights up. The Stripe webhook error rate jumped from 0.1% to 2.3%. Latency P99 went from 200ms to 1,400ms. The spike started 6 minutes ago. You see the problem clearly — your monitoring is working.
Now what?
Which of the 1,200 events in the last 6 minutes were affected? Were they all failures, or did some succeed? Can you identify the specific 23 events that returned errors? Of those 23, which ones are safe to replay given the current state of your application? Were there events that returned 200 but processed incorrectly — events your error rate metric does not count because the handler returned success?
Your observability dashboard does not answer any of these questions. It was built to show trends, not to enable recovery. This is the gap between observability and operations, and it is the gap where webhook incidents become expensive.
The maturity model
Webhook infrastructure matures in four levels. Most teams plateau at Level 2 because the jump to Level 3 requires capabilities that observability tools do not provide.
Level 0: No visibility
The handler exists. It receives events. There are no logs, no metrics, no way to know what happened after the event was received. Failures are discovered when customers report them.
This is more common than it should be. The handler was written during a two-day integration sprint. It works. Nobody added monitoring because there were more urgent features to ship. Months later, when something breaks, the only investigation tool is reading application logs on the server.
Few teams stay at Level 0 for long. The first incident pushes them to Level 1.
Level 1: Request logging
The handler logs incoming requests — timestamp, event type, HTTP status, maybe the request body. You can see what came in, when it arrived, and what response your handler returned.
This is meaningful progress. You can now answer: "Did we receive the event?" and "What did our handler respond?" These are the first two questions in any webhook investigation.
Level 1 cannot answer: "Did the event produce the correct outcome?" and "Which events are in a failed state?" The logs show delivery. They do not show application. For a deeper analysis of this gap, see how webhook platforms cannot stop at HTTP request logging.
The limitation of Level 1 is that it is retrospective and passive. You can look at the logs after someone reports a problem. You cannot use the logs to detect a problem proactively, to identify which events need attention, or to take action on them.
Level 2: Observability
Level 2 adds aggregation, trending, and alerting on top of the raw logs. You can see error rates over time. You can see latency distributions. You can set alerts for anomalous patterns. You can build dashboards that show the health of your webhook integrations at a glance.
This is where most teams plateau. Level 2 is sufficient for detecting problems. Grafana, Datadog, or a custom dashboard backed by your structured logs gives you visibility into trends and anomalies. When something goes wrong, you know about it quickly.
But Level 2 is a read-only layer. It tells you what happened. It does not help you do anything about it. The gap between "we detected a problem" and "we resolved the problem" is filled with ad-hoc work:
- Querying the database directly to identify affected events
- Writing one-off scripts to extract event IDs that match the failure pattern
- Manually replaying events by calling the handler endpoint with curl
- Checking the provider's dashboard to see if they will retry
- Updating a Slack thread with the status of the recovery
- Hoping you got all the affected events, with no way to verify
Level 2 is necessary. It is not sufficient.
Level 3: Operations
Level 3 adds four capabilities that transform webhook monitoring from passive observation into active control:
Inspect. Not just logs — full HTTP capture with outcome status. Every webhook event is stored with the complete request (headers, body, query parameters), the complete response (status, headers, body), timing information, and the outcome: did the event produce the correct side effect? The outcome is not inferred from the HTTP status. It is tracked independently, through receipts or status reconciliation.
Replay. Not just retry — controlled re-delivery with safety. Replay is a workflow, not a button. It includes filtering (which events to replay), dry-run preview (what would happen), receipt-aware skip logic (do not re-process events that already succeeded), batch risk assessment (flag conflicts before execution), operator approval (human reviews before committing), and audit trail (who replayed what, when, why).
Verify. Not just health checks — proof-backed canaries. Synthetic webhook events traverse the full pipeline on a schedule: ingress, processing, outcome verification. Status is derived from the last successful proof, not from component health checks. If the canary fails, the pipeline is not working, regardless of what health checks say.
Recover. Not just alerts — structured workflows. When an incident is detected, the recovery path is: identify affected events (inspect), determine which can be safely re-delivered (replay dry-run), execute with approval (replay), verify the pipeline is correct (verify), and document everything (audit trail). Each step is a capability of the operations layer, not an ad-hoc action.
The Wednesday incident with observability (Level 2)
Return to the Wednesday 2:47 PM spike. Here is how it plays out at Level 2.
2:47 PM. Alert fires. Stripe webhook error rate exceeded threshold.
2:49 PM. Engineer opens the dashboard. Confirms the spike. Error rate is 2.3%. Latency P99 is elevated.
2:52 PM. Engineer queries application logs to find the error. The handler is returning 500 for events with a specific subscription plan ID. The plan ID was recently added by a Stripe configuration change and the handler's switch statement does not recognize it.
2:55 PM. Engineer identifies the bug. Starts a fix.
3:10 PM. Fix is deployed. New events process correctly.
3:12 PM. Engineer needs to identify which events failed during the 23-minute window. There is no built-in way to do this. The dashboard shows the error rate trend. It does not list the specific events.
3:20 PM. Engineer queries the database directly. Finds 23 events that returned 500 during the window. Extracts the event IDs.
3:30 PM. Engineer needs to replay the 23 events. There is no replay mechanism. The engineer writes a script that reads the stored payloads and POSTs them to the handler endpoint.
3:45 PM. The script runs. 19 events process correctly. 4 events fail because the customers' subscription state changed during the 23-minute window (customers who contacted support and got manual fixes).
3:50 PM. Engineer manually investigates the 4 remaining events. Determines that 3 were already fixed by support. 1 needs a manual database update.
4:00 PM. Engineer applies the manual fix. Writes a summary in Slack.
4:05 PM. Engineer realizes there is no way to verify that the pipeline is now fully correct — only that the specific fix for the new plan ID works. There could be other unrecognized plan IDs. The health check passes. The error rate is back to normal. But normal is not the same as correct.
Total recovery time: 78 minutes. Total engineering time: 78 minutes (the engineer was fully occupied the entire time). Documentation: a Slack thread that will be buried by next week.
The Wednesday incident with operations (Level 3)
Same incident. Different capabilities.
2:47 PM. Alert fires. Stripe webhook error rate exceeded threshold.
2:49 PM. Engineer opens the operations dashboard. Sees the spike. Clicks into the affected time window.
2:50 PM. The inspect capability shows all events in the window with their full HTTP capture and outcome status. 23 events returned 500. The error messages are visible inline. The common pattern is immediately apparent: an unrecognized plan ID.
2:52 PM. Engineer identifies the bug from the captured request/response data — no separate log query needed. The failing events are tagged with processing status "failed."
2:55 PM. Fix is deployed.
2:57 PM. Engineer initiates a replay for the 23 failed events. The replay system runs a dry-run preview:
Dry-run results:
23 events with processing_status = 'failed'
19 events: no conflicts — customer state unchanged since failure
3 events: WARNING — subscription modified by support ticket after failure
1 event: WARNING — customer cancelled since failure time
Risk assessment: 19 LOW, 3 MEDIUM, 1 HIGH
2:59 PM. Engineer approves replay for the 19 low-risk events. Reviews the 3 medium-risk events individually — all 3 were fixed by support and have outcome receipts. The replay system skips them automatically. The 1 high-risk event is flagged for manual review.
3:01 PM. Replay executes for 19 events. All 19 succeed. The 1 remaining event is resolved manually.
3:05 PM. The scheduled canary probe fires. It sends a synthetic event with the previously-unrecognized plan ID through the full pipeline. The event is processed correctly. The outcome is verified. The pipeline is confirmed working — not just for events in general, but specifically for the failure case that caused the incident.
3:06 PM. Recovery documented automatically in the audit trail:
Incident: Unrecognized Stripe plan ID in handler switch statement
Duration: 2:47 PM - 2:55 PM (8 minutes of failed processing)
Affected events: 23
Recovery: 19 replayed (job rj_wed_001, approved by engineer@company.com),
3 skipped (receipt-confirmed, support-resolved),
1 manually resolved
Verification: Canary proof succeeded at 3:05 PM with target plan ID
Total recovery time: 19 minutes. Total engineering time: 19 minutes. Documentation: complete audit trail, permanent, searchable. The difference is not speed alone — it is confidence. At the end of the Level 3 recovery, the engineer knows which events were affected, how each was resolved, and that the pipeline is verified for the specific failure case. At the end of the Level 2 recovery, the engineer hopes they got everything.
Why observability plateaus
Teams plateau at Level 2 because the jump to Level 3 requires building capabilities that are fundamentally different from monitoring.
Monitoring is a read path. You instrument your application, collect metrics, aggregate them, and display them. The tools are mature: Prometheus, Grafana, Datadog, New Relic. Adding monitoring to a webhook handler is straightforward.
Operations is a write path. You need to take action on specific events, modify processing state, execute controlled re-delivery, and verify outcomes. This requires:
- Event storage with full fidelity. Not just metrics — the actual HTTP requests and responses, stored and queryable.
- Stateful processing tracking. Not just "success/failure" — the current processing state of every event, including partial states and receipts.
- Replay infrastructure. Not just a script — a system that can filter, preview, execute, and audit controlled re-delivery.
- Canary probe infrastructure. Not just a health check — synthetic events that traverse the full pipeline and verify outcomes.
- Workflow orchestration. Not just alerts — structured recovery paths with approval gates and audit trails.
These are not extensions of observability. They are a different category of infrastructure. You cannot add replay to Grafana. You cannot add canary probes to Datadog. You can integrate with them — use the metrics, send alerts through their channels — but the operational capabilities are a separate layer.
This is why most teams stay at Level 2. Building the operational layer is a significant investment, and the value is only apparent after an incident where Level 2 was not enough. The first time you spend 78 minutes on manual recovery for 23 events, it feels manageable. The tenth time, with 200 events during a provider outage, it does not.
The agent delivery amplifier
The observability-to-operations gap gets wider when webhook payloads drive automated systems rather than simple database writes.
If your webhook handler triggers an agent workflow — an AI agent that processes the payload, makes decisions, and takes actions — the failure modes multiply. The agent may succeed, partially succeed, or fail in ways that are not visible from the HTTP response. A 200 OK from the handler means the agent was dispatched. It does not mean the agent completed its work. It does not mean the agent's actions were correct.
For teams building agent delivery systems, the operational gap is not just "did the webhook process?" but "did the agent do the right thing?" Observability shows you the webhook was delivered. Operations lets you inspect what the agent did, verify the outcome, and replay with safety if the agent needs to re-process.
What HookTunnel provides
HookTunnel is built as a Level 3 operations layer. Not because Level 2 is wrong — you need observability. But observability alone leaves you with manual recovery scripts, ad-hoc replays, health-check-based status, and Slack threads as incident documentation.
Inspect: Full HTTP capture — request headers, body, response status, response body, timing — with independent outcome tracking. Every event has a processing status that is updated by receipts, not inferred from the HTTP status code. You can query events by provider, type, status, time window, and outcome.
Replay: Controlled re-delivery with filtering, dry-run preview, receipt-aware skip logic, batch risk assessment, operator approval, and audit trail. Every replay is a documented operation, not a curl command in a terminal.
Verify: Scheduled canary probes that send synthetic events through the real pipeline and verify outcomes. Status is derived from the last successful proof. When a canary fails, you know the pipeline is broken — not that a health check endpoint is unreachable.
Recover: Structured workflows that combine inspect, replay, and verify into a recovery path with documentation. The audit trail is automatic, permanent, and searchable.
The progression from observability to operations is not about replacing your existing tools. It is about adding capabilities that your existing tools cannot provide. Your APM shows the error rate spike. Your log aggregator shows the individual errors. HookTunnel shows the affected events, lets you replay them safely, verifies the fix with proof-backed canaries, and documents the recovery.
Observability is the foundation. Operations is what you build on it when "I can see the problem" stops being enough and you need "I can fix the problem, verify the fix, and prove it to the auditor."
Stop guessing. Start proving.
Generate a webhook URL in one click. No signup required.
Get started free →