Trust Requires Proof, Not Pings: Why Health Checks Lie About Webhook Reliability
Your status page shows green. Your /health endpoint returns 200. But your webhook handler has been silently dropping events for 3 hours because a dependency's TLS certificate expired on a path the health check never touches. This is the difference between component liveness and system truth.
Your status page is green. It has been green for 47 days. The /health endpoint returns 200 in 3 milliseconds. Your uptime monitor shows 99.99%. Everything is fine.
Except your webhook handler stopped processing Stripe events 3 hours ago. A downstream dependency renewed its TLS certificate, and the new certificate uses a root CA that your HTTP client's trust store doesn't include. The TLS handshake fails on every outbound call to the dependency. But the health check endpoint doesn't call that dependency — it returns 200 from an in-memory check. The process is alive. The pipeline is broken.
Fourteen customers upgraded to Pro during those 3 hours. Stripe fired customer.subscription.updated for each one. Your handler received the event, attempted the downstream call, failed the TLS handshake, caught the error, logged a warning, and returned 200 to avoid triggering Stripe's retry logic. The subscription records were never updated. Your dashboard shows green.
This is the gap between component liveness and system truth. It is the gap that burns platform buyers who rely on status pages to make trust decisions. And it is the gap that most monitoring architectures are structurally incapable of closing.
What a health check actually proves
A health check that pings /health and gets 200 proves exactly one thing: the process is running and can respond to HTTP requests on that path.
It does not prove:
- Webhooks are being delivered. The ingress path and the health check path share a process, but they may not share dependencies. A failure in the forwarding layer doesn't affect the health endpoint.
- The database is accepting writes. Many health checks don't test database connectivity. The ones that do test it with a
SELECT 1, which proves the connection is alive but not that writes succeed, that the schema is correct, or that the table has space. - Downstream systems are reachable. If your handler calls a payment service, a feature flag service, or an internal API after receiving the webhook, none of those dependencies are tested by the health check.
- The event pipeline is end-to-end functional. The health check tests a single component. The pipeline is a chain of components. A chain fails when any link fails, but a health check only tests one link.
The health check is a component liveness test. It answers: "Is this process alive?" That's a useful signal. It is not the signal that matters for platform trust.
The TLS certificate scenario in detail
Let's walk through the TLS failure scenario step by step, because it illustrates why health checks are structurally blind to real pipeline failures.
Your webhook handler receives events from Stripe and processes them by calling an internal service — say, a subscription management API that updates feature flags.
Stripe → Your Webhook Handler → Subscription Service (internal)
↓
PostgreSQL
The subscription service renewed its TLS certificate. The new cert chains to a different root CA. Your webhook handler's HTTP client doesn't trust that root CA. The TLS handshake fails.
But your health check endpoint doesn't call the subscription service. It looks like this:
app.get('/health', (req, res) => {
res.status(200).json({ status: 'ok', uptime: process.uptime() });
});
Or maybe it's slightly more sophisticated:
app.get('/health', async (req, res) => {
const dbAlive = await db.query('SELECT 1');
res.status(200).json({
status: 'ok',
database: dbAlive ? 'connected' : 'disconnected',
});
});
Neither version calls the subscription service. Neither version tests the TLS path that the webhook handler uses. The health check is green. The pipeline is broken.
This is not a contrived example. TLS certificate rotations, DNS changes, firewall rule updates, and dependency version mismatches all affect specific code paths without affecting the health endpoint. The health check tests the shortest, simplest path through your system. The webhook pipeline traverses the longest, most complex path. Those are different paths.
The taxonomy of liveness vs. truth
There's a useful way to think about what different monitoring signals actually tell you.
Component liveness answers: "Is this process running?" A health check endpoint, a TCP port check, a Kubernetes liveness probe — these all test component liveness. The process is alive. It can accept connections. It responds to pings.
Component readiness answers: "Is this process ready to serve traffic?" A readiness probe that checks database connectivity, queue connectivity, or configuration validity. The process is alive and its immediate dependencies are reachable.
System truth answers: "Is the pipeline working end-to-end right now?" This requires exercising the actual pipeline — not pinging a component, but sending a real event through the real path and verifying the real outcome.
Most monitoring stops at component readiness. Status pages show component health. Uptime monitors check endpoint availability. None of them answer the question that matters: "If Stripe sends a webhook right now, will it result in the correct database write?"
That question can only be answered by doing it and checking.
The canary probe pattern
A canary probe is a synthetic event that traverses the entire pipeline and verifies the outcome. It's not a health check. It's a proof of work.
The pattern works like this:
- Send a synthetic webhook through the ingress layer — the same path that real webhooks traverse. Not a test endpoint. Not a mock. The actual ingress path.
- The event traverses the full pipeline — ingress, storage, forwarding to the target, processing by the handler.
- Verify the outcome — confirm that the event was stored, forwarded, and (if outcome receipts are configured) that the handler reported a successful application.
- Record the result — timestamp, latency, success/failure, failure reason if applicable.
- Derive status from the result — the platform's status is the result of the last canary, not the result of a component ping.
The canary proves the pipeline worked at the time the canary ran. It doesn't predict the future — nothing does. But it gives you a bounded-time guarantee: "The pipeline was working N minutes ago."
If the canary runs every 5 minutes and the last canary succeeded, the pipeline was working within the last 5 minutes. If the last canary failed, you know before your customers do. The failure happened on a synthetic event, not on a real customer's payment.
Why synthetic probes catch what health checks miss
The TLS certificate scenario above is invisible to health checks because the health check doesn't exercise the failing path. A canary probe exercises the entire path, so it fails when any component in the chain fails.
Here are specific failure modes that canary probes catch and health checks miss:
Database schema drift. A migration added a required column but the handler code hasn't been deployed yet. Health check returns 200 — the process is alive. The handler tries to INSERT without the new column and gets a schema error. Canary probe fails on the outcome verification step.
Queue backpressure. Your handler pushes events to a queue for async processing. The queue is at capacity. The push call silently drops the message or blocks indefinitely. Health check returns 200. Canary probe sends an event, waits for the outcome receipt, never receives it — the SLA window expires and the canary is marked failed.
Rate limit exhaustion. Your handler calls an external API with a rate limit. You've hit the limit. The handler catches the 429 error and silently drops the event. Health check returns 200 — rate limits are per-path. Canary probe detects the outcome never arrived.
DNS resolution failure. An internal DNS record changed. Your handler's HTTP client has a cached DNS entry that's now stale. The cached entry works for the health check path (which doesn't make external calls) but the handler's outbound calls fail with ENOTFOUND. Canary probe fails because the forwarding step times out.
Memory pressure. The process is alive but garbage collection pauses are causing timeouts on database writes. The health check completes in 3ms because it doesn't allocate or write. The handler's database write takes 12 seconds and times out. Canary probe catches the timeout.
Each of these failures has the same signature: the process is alive, the health check is green, and the pipeline is broken. The canary probe catches all of them because it doesn't test the process — it tests the pipeline.
Status derived from proof
The difference between a status page backed by health checks and a status page backed by canary probes is the difference between "healthy" and "working."
A health-check-backed status page says: "All components are responding to pings." You look at the green dots and feel confident. The confidence is based on component liveness — the weakest possible signal.
A proof-backed status page says: "The last end-to-end canary probe succeeded at 14:32 UTC. The probe sent a synthetic webhook, verified storage, confirmed forwarding, and received an outcome receipt in 847ms." The confidence is based on evidence. The pipeline demonstrably worked 8 minutes ago.
When the canary fails, the status page degrades immediately. Not "we noticed a component is down" but "the pipeline stopped producing proofs." The failure reason is specific: "Canary probe failed at forwarding step — target returned 503." Or "Canary probe failed at outcome verification — no receipt within SLA window."
This is the status page that platform buyers actually need. Not "are your servers running?" but "if I send you a webhook right now, will it work?" The canary probe is the closest you can get to answering that question without actually sending a customer event.
The trust hierarchy
There's a hierarchy of trust signals, ordered from weakest to strongest:
- Process liveness — the process responds to pings. (Health check)
- Component readiness — the process and its immediate dependencies are reachable. (Readiness probe)
- Synthetic delivery — a test event was accepted and stored. (Synthetic ping)
- End-to-end proof — a synthetic event traversed the full pipeline and the outcome was verified. (Canary probe)
- Continuous proof — canary probes run on a schedule and status is derived from the last N results. (Proof-backed status)
Most platforms operate at level 1 or 2. They show a status page based on health checks and readiness probes. When the status page is green and the pipeline is broken, the gap between level 2 and level 4 is the gap that burns customers.
HookTunnel operates at level 5. Scheduled canary probes run through the full ingress-to-outcome pipeline. The platform status page shows the result of the last successful proof, not the result of a component ping. When the proof fails, the status degrades immediately — before customers notice, before support tickets arrive, before the postmortem starts. For the technical details of how HookTunnel provides reliability, see how HookTunnel provides webhook reliability. For the broader argument about why "healthy" and "working" are different claims, see healthy is not working.
What happens when the canary fails
The canary failure is the valuable event. Not the canary success — success just confirms the pipeline is still working. The failure is where the trust model earns its value.
When a canary probe fails, three things happen:
Immediate status degradation. The status page reflects the failure. No waiting for a human to acknowledge. No waiting for multiple failures to confirm. One failed canary is sufficient to degrade status, because one failed canary means the pipeline demonstrably did not work.
Specific failure diagnosis. The canary probe records which step failed: ingress, storage, forwarding, or outcome verification. This is not "the system is degraded" — it's "the forwarding step failed with a connection timeout to target X." The on-call engineer knows where to look before they open a terminal.
Bounded blast radius. Because the canary is synthetic, the failure happened on a test event, not on a customer's payment. The canary caught the failure before real events hit the broken path. If canaries run every 5 minutes, the maximum undetected failure window is 5 minutes — compared to the 3-hour window in the TLS certificate scenario, where the health check never detected the failure at all.
The honest status page
There's a philosophical point here about what status pages owe their users.
A status page backed by health checks is technically accurate: all components are responding. But it's misleading. Users interpret "all green" as "everything works." The status page doesn't claim that — it claims component liveness. But the user reads system truth.
A proof-backed status page closes that gap. When it says the pipeline is working, it means the pipeline demonstrably worked within the last N minutes. When it can't prove the pipeline is working, it says so. The status page is honest because it's backed by evidence, not inference.
This matters for enterprise webhook reliability specifically because enterprise buyers make purchasing decisions based on status page history. A status page that shows 99.99% uptime based on health checks is a weaker signal than a status page that shows 99.9% uptime based on canary probes. The second number is lower, but it's real. The first number is higher, but it's measuring the wrong thing.
Trust requires proof. Not pings. Not component liveness. Not "the process is running." Proof that the pipeline worked, end to end, recently enough to matter.
That's the difference between "healthy" and "working."
Stop guessing. Start proving.
Generate a webhook URL in one click. No signup required.
Get started free →