What is a webhook reliability control plane?

A reliability control plane provides operational truth about whether webhooks are being processed correctly — not just whether HTTP traffic is flowing. It includes delivery inspection (full HTTP capture), outcome verification (receipts proving the side effect committed), canary probes (synthetic end-to-end tests), replay with safety (controlled re-delivery with lineage), anomaly detection (pattern changes in delivery metrics), and proof-backed status (platform health derived from real pipeline tests, not component pings).

How do canary probes work in HookTunnel?

HookTunnel sends a synthetic webhook payload through your entire pipeline on a configurable schedule — every 5 minutes by default. The canary is a real HTTP request to a real hook endpoint. HookTunnel then verifies the payload was captured, checking that the full pipeline — ingress, storage, retrieval — worked end-to-end. If the canary fails, HookTunnel knows the pipeline is broken before any customer event is affected. Platform status is derived from the last successful canary, not from health check pings.

What is proof-backed status and how is it different from health checks?

Health checks test whether individual components are alive: is the server running, is the database accepting connections. Proof-backed status tests whether the pipeline works: did an event sent 5 minutes ago make it through ingress, get stored, and become retrievable? A system where every component health check passes but a configuration error silently drops 5% of events will report healthy on health checks and broken on proof-backed status. Proof-backed status tells you what health checks cannot: the system works right now.

Can HookTunnel replay webhooks safely after code changes?

Yes. HookTunnel's replay allows you to select specific events, review their payloads, and replay them to any target endpoint — including a different environment or a fixed version of your handler. Events already in Applied Confirmed state are filtered out by default to prevent duplicate processing. Each replay is linked to the original delivery with full lineage tracking. For batch replay, HookTunnel provides a dry-run preview with risk assessment before any events are re-sent.

How does outcome verification work?

After your webhook handler commits the database write, your application sends a signed HTTP callback to HookTunnel with the event ID and a status (processed, failed, or queued). The callback is HMAC-signed with a shared secret to prevent spoofing. HookTunnel records this as the outcome receipt for that event. If no receipt arrives within the configurable SLA window (default 60 seconds), the event is flagged Applied Unknown. This gives you three states per event — Delivered, Applied Confirmed, Applied Unknown — instead of just Delivered.

How HookTunnel Provides Webhook Reliability Controls

Most webhook tools answer one question: what arrived? HookTunnel answers a different question: is the system working?

The difference is not a matter of degree. It is a difference in what you can prove. A tool that shows you HTTP request logs proves that traffic reached your endpoint. A reliability control plane proves that your webhook pipeline is processing events correctly right now, and gives you the mechanisms to fix it when it is not.

HookTunnel provides six concrete mechanisms. Each one addresses a specific gap in webhook operations. None of them alone is sufficient. Together, they change what operators can trust about their webhook infrastructure.

Mechanism 1: Delivery inspection

This is the foundation. Every webhook that passes through HookTunnel is captured with full HTTP detail.

What is captured:

HTTP method, URL, and protocol version
All request headers, including provider signatures and content types
The complete request body, byte-for-byte
The response status code, headers, and body
Round-trip latency from request received to response completed
Timestamp with millisecond precision

What this enables:

When a Stripe webhook fails, you do not need to reconstruct the event from Stripe's dashboard. The full payload is in HookTunnel, including the Stripe-Signature header, the raw JSON body, and the exact timestamp. When a Twilio webhook sends a form-encoded body with an unexpected field, you see the raw body, not a sanitized version. When a GitHub webhook arrives with a new X-GitHub-Hook-Installation-Target-ID header that your handler does not recognize, the header is captured even though your application code ignored it.

Inspection is the debugging primitive. Without it, you are reconstructing evidence from fragments — provider dashboards, server logs, database queries. With it, the raw HTTP transaction is preserved exactly as it happened.

The honest limitation: inspection shows you what happened at the transport layer. It does not show you what happened inside your handler after the request was received. That requires the next mechanism.

Mechanism 2: Outcome verification

Outcome verification closes the gap between "the webhook was delivered" and "the webhook was processed." This is the most important capability HookTunnel provides, and the one that differentiates a reliability control plane from a request logger.

How it works:

After your webhook handler processes the event and commits the database write, your application sends a signed receipt to HookTunnel:

// Your handler processes the event
await db.transaction(async (trx) => {
  await trx('orders').insert({
    stripe_invoice_id: event.data.object.id,
    customer_id: customerId,
    amount: event.data.object.amount_paid,
    status: 'paid',
  });
});

// Transaction committed. Send the receipt.
await fetch(process.env.HOOKTUNNEL_RECEIPT_URL, {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${process.env.RECEIPT_SECRET}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    event_id: event.id,
    receipt_id: crypto.randomUUID(),
    status: 'processed',
    processed_at: new Date().toISOString(),
    side_effect_refs: { order_id: orderId },
  }),
});

The receipt is sent after the commit, not before. This is the guarantee. If the transaction rolls back, no receipt is sent. If the handler throws after the 200 response but before the database write, no receipt is sent. The receipt is evidence that the side effect committed.

The three states:

For every webhook event, HookTunnel tracks three possible outcome states:

| State | Evidence | Meaning | |---|---|---| | Applied Confirmed | Signed receipt received within SLA window | The side effect committed. Proof exists. | | Applied Unknown | SLA window expired without a receipt | The event may or may not have been processed. Investigate. | | Applied Failed | Receipt with failure status received | The handler explicitly reported a processing failure. |

Applied Unknown is the critical state. It does not mean failure. It means you do not know. That is the correct thing to report when you have no evidence either way. It is the state that triggers investigation, not the state that triggers blind retry.

Why HMAC signing matters:

The receipt is signed with a shared secret (HMAC-SHA256) that HookTunnel verifies before recording. This prevents a third party from spoofing receipts. It also prevents a buggy handler from accidentally sending a receipt for an event it did not process — the receipt must include the correct event ID and a valid signature.

HookTunnel supports secret rotation with a 24-hour grace period. During rotation, both the current and previous secrets are accepted. This means you can rotate credentials without a coordination window between your application and HookTunnel.

What changes for operators:

Without outcome verification, you know what arrived. With it, you know what worked. The shift is from "Stripe says delivered" to "my application confirms applied." When the two disagree — delivered but not confirmed — you have an actionable signal within the SLA window, not a customer complaint days later.

For the detailed technical analysis of why 200 OK does not prove processing, see webhook platforms cannot stop at HTTP request logging.

Mechanism 3: Canary probes

Outcome verification tells you whether individual events were processed. Canary probes tell you whether the pipeline is working right now.

How canary probes work:

HookTunnel sends a synthetic webhook payload to a designated hook endpoint on a configurable schedule — every 5 minutes by default. The canary payload is a real HTTP request with a known body and a marker that identifies it as a canary. HookTunnel then verifies that the payload was captured by checking the request log.

The verification is end-to-end: the canary tests ingress (did the HTTP request reach the endpoint), storage (was the payload persisted to the database), and retrieval (can the payload be read back). If any layer in the pipeline is broken — a misconfigured load balancer, a full database disk, a crashed ingress process — the canary fails.

What canaries catch that health checks miss:

A health check endpoint returns 200 if the application process is running and can reach the database. It does not test whether a webhook payload sent to the ingress endpoint would be correctly stored and retrievable.

Consider a specific failure: a Vercel deployment introduces a new middleware that accidentally strips the request body from POST requests to certain routes. The health check endpoint is a GET request — it is unaffected. The middleware bug only affects POST requests with bodies larger than 1KB. Small webhook payloads pass through. Large ones (Stripe events with expanded objects, GitHub push events with multiple commits) arrive with empty bodies.

The health check returns 200. Small webhooks work. Large webhooks are silently truncated. The canary probe, which sends a payload of known size, detects the truncation immediately and marks the pipeline as degraded.

The schedule and what it means:

A canary every 5 minutes means the maximum detection time for a pipeline failure is 5 minutes. Compare this to customer-reported failures, which have detection times measured in hours or days. The canary does not prevent the failure. It bounds the detection window.

Mechanism 4: Replay with safety

Inspection shows you what went wrong. Outcome verification tells you which events were affected. Replay lets you fix them.

But replay without safety creates new problems. Blind re-delivery of webhook events risks duplicate processing (replaying an event that was actually applied), out-of-order application (replaying events in the wrong sequence), and cascading failures (replaying into a system that is still broken).

HookTunnel's replay model:

Replay is operator-initiated, not automatic. The operator selects specific events from the request log, reviews the payloads, and chooses a target endpoint. The target can be the original endpoint or a different one — a staging environment, a fixed version of the handler, a local development server via tunnel.

Each replayed event carries lineage metadata: the original delivery ID, the original timestamp, the replay timestamp, the operator who initiated the replay, and an optional audit note. The downstream handler can use this metadata to distinguish replays from original deliveries, which matters for idempotency logic that relies on delivery timestamps.

Filtering for safety:

When replaying a batch of events, HookTunnel filters out events that are already in Applied Confirmed state. If an event was delivered, confirmed as applied via receipt, and then selected for replay, HookTunnel warns the operator that this event has already been confirmed and excludes it from the batch by default.

The operator can override this with an explicit audit note explaining why a confirmed event is being replayed. This is the right design: the default prevents duplicate processing, the override allows it when the operator has a specific reason (for example, replaying to a different target for testing purposes).

Batch replay with dry-run:

For larger incidents where dozens or hundreds of events need replay, HookTunnel provides batch replay with a dry-run preview. The dry-run shows which events would be replayed, which would be filtered out, and a risk assessment based on the event types and target endpoint. No events are sent during the dry-run. The operator reviews the preview and confirms before any payloads leave the system.

The dry-run ensures the operator understands the scope before committing.

Mechanism 5: Anomaly detection

Individual failures are caught by outcome verification. Systemic degradation is caught by anomaly detection.

What anomaly detection monitors:

Latency distribution shifts. P50 response time increases from 40ms to 150ms over 24 hours — early signal of a connection pool filling or upstream degradation.
Error rate changes. A sustained shift from 0% to 0.5% errors, invisible request-by-request but obvious in aggregate.
SLA violation patterns. A specific hook showing Applied Unknown outcomes at a higher rate than its baseline.
Response body structural changes. Different JSON structure, headers, or content types appearing from a downstream service.

A single 120ms response in a service that usually responds in 40ms is noise. Ten consecutive 120ms responses over the last hour is a signal. Anomaly detection compares the current window to the historical baseline and flags deviations — detection that scales with volume rather than requiring human attention on every request.

Mechanism 6: Proof-backed status

The last mechanism ties the others together. Platform status in HookTunnel is derived from the last successful canary probe, not from component health checks.

What proof-backed status means:

The platform status page does not say "API: healthy, Database: healthy, Ingress: healthy." It says "Last successful end-to-end canary: 3 minutes ago." That statement is backed by evidence: a specific canary probe, at a specific time, that sent a specific payload through the entire pipeline and verified it was stored and retrievable.

If the canary has not succeeded within the expected window, the status degrades to a state that reflects the uncertainty. Not "unhealthy" — that would imply a specific component is down. The status reflects what is actually known: "The last successful end-to-end test was X minutes ago, which exceeds the expected Y-minute interval."

How this differs from health checks:

Health checks answer: "Are the components running?" Proof-backed status answers: "Did the pipeline work recently?"

These are different questions with different implications.

A system where every health check passes but a networking misconfiguration prevents webhook payloads from reaching the database reports "healthy" on health checks and "degraded" on proof-backed status. The proof-backed status is correct. The health checks are misleading.

A system where the database health check briefly fails during a connection pool reset but the pipeline is processing events normally reports "unhealthy" on health checks and "operational" on proof-backed status (because the last canary succeeded within the window). The proof-backed status is correct. The health check is triggering on a transient that did not affect pipeline correctness.

For the philosophical argument — why trust requires evidence from end-to-end tests, not pings to individual components — see trust requires proof not pings.

How the mechanisms compose

Each mechanism handles one part of the operational loop. None is sufficient alone.

Normal operation: Canary probes confirm the pipeline works every 5 minutes. Delivery inspection captures every event. Outcome receipts confirm processing. Anomaly detection monitors for drift. Status reflects the last successful canary.

Incident: The canary fails or receipts stop arriving. The operator is alerted within 5 minutes. Delivery inspection provides raw HTTP evidence. Outcome states show the blast radius — which events are Applied Confirmed, Applied Unknown, or Applied Failed.

Remediation: The handler is fixed. Replay selects affected events, filters out those already confirmed, and re-delivers with lineage tracking. Each replayed event's outcome is tracked via receipts. The canary resumes succeeding. Status returns to operational.

Inspection without verification means you can see failures but cannot prove they were fixed. Verification without replay means you know what failed but cannot re-deliver. Replay without filtering means you can re-deliver but risk duplicates. The mechanisms depend on each other.

What HookTunnel is not

Specificity about what a tool does requires honesty about what it does not do.

HookTunnel does not guarantee delivery. It is not a queue or a message broker. If your downstream handler is unreachable, HookTunnel captures the request for later replay — but it does not automatically retry. Automatic retry with backoff is what tools like Hookdeck and SQS provide.

HookTunnel does not transform payloads. There is no routing rule engine, no filter configuration, no payload mapping. The webhook arrives and is captured as-is.

HookTunnel does not replace your monitoring stack. It provides webhook-specific operational truth and integrates with Datadog, Grafana, or PagerDuty — it does not replace them.

These boundaries are deliberate. The narrowness is the point. The six mechanisms above are the complete surface, not a subset of a larger platform.

The shift

The webhook tooling category has been defined by visibility: see what arrived, browse the payloads, search the history. That is genuinely valuable and was the right first problem to solve.

But visibility is not reliability. Seeing that a webhook arrived is not the same as knowing the system processed it correctly. Browsing request history is not the same as having operational control over what happens when processing fails.

The shift is from "I hope it worked" to "I can prove it worked." Delivery inspection provides the evidence. Outcome verification provides the proof. Canary probes provide continuous confidence. Replay provides remediation. Anomaly detection provides early warning. Proof-backed status provides operational truth.

That is what a reliability control plane does. That is what HookTunnel provides.

For the argument that HTTP request logging alone is insufficient, see webhook platforms cannot stop at HTTP request logging. For why agent systems specifically need a delivery plane beneath them, see agent systems need a reliable delivery plane. For the enterprise reliability case — audit trails, controlled recovery, compliance evidence — see enterprise webhook reliability. For why retries alone do not constitute reliability, see retries are not reliability.

How HookTunnel Provides Webhook Reliability