Why do health checks miss real failures?

Health checks test component liveness: is the process running, can it respond to HTTP, is the database connection alive. They do not test the full pipeline — whether a webhook event can traverse ingress, processing, and outcome commitment. A component can be healthy while the system is broken.

What is the difference between healthy and working?

Healthy means individual components respond to liveness probes. Working means the full pipeline produces correct outcomes — a webhook event enters the system, gets processed by the handler, and the side effect is committed to durable storage. You can have every component healthy and the system not working.

How do canary probes detect failures that health checks miss?

Canary probes are synthetic webhook events that traverse the entire pipeline on a schedule. They enter through the real ingress endpoint, get processed by the real handler, and verify that the outcome was committed. If the canary fails, the system is not working — regardless of what health checks say.

What should a webhook status page actually show?

The time and result of the last successful end-to-end canary proof, not the last health check. If the last proof was 10 minutes ago and succeeded, the pipeline was working 10 minutes ago. If the last proof was 6 hours ago, you have a 6-hour gap where you do not know if the pipeline works.

Healthy Is Not Working: Green Dashboards Hide Failures

Friday, 4:47 PM. Your infrastructure team pushes a load balancer configuration change. The change is routine — updating routing weights for a canary deployment. The change is reviewed, approved, and applied. Health checks pass on all targets. The deployment dashboard shows green across every service. The team goes home for the weekend.

Monday, 9:14 AM. A customer reports that their Stripe webhook events from Saturday are not reflected in their account. You check Stripe's dashboard: 847 events sent, all showing delivery status "succeeded." Your monitoring shows no errors, no latency spikes, no anomalies.

You dig deeper. The load balancer config change on Friday shifted 30% of webhook traffic to the canary deployment. The canary deployment is running an older handler version — one from before the refactor that fixed the subscription status mapping. The older handler processes events but maps subscription tiers incorrectly. It returns 200 for every event. It writes to the database. The writes are wrong.

The health check on the canary passes. The health check on the primary passes. The database connection is alive. The process is running. The response times are normal. Every signal your monitoring tracks says the system is healthy.

The system is healthy. The system is not working.

What health checks actually prove

A standard health check endpoint does three things:

Confirms the process is running and can respond to HTTP requests
Optionally checks that the database connection pool has available connections
Optionally checks that dependent services are reachable

This is what Kubernetes uses for liveness and readiness probes. It is what load balancers use for target health. It is what status pages aggregate into the green/yellow/red indicators that operators watch.

Here is what a typical health check looks like:

app.get('/health', async (req, res) => {
  try {
    await db.query('SELECT 1');
    res.status(200).json({
      status: 'healthy',
      uptime: process.uptime(),
      timestamp: new Date().toISOString()
    });
  } catch (err) {
    res.status(503).json({ status: 'unhealthy', error: err.message });
  }
});

This endpoint will return "healthy" as long as the process is running and can execute a trivial query against PostgreSQL. It will return "healthy" while:

The webhook handler has a logic bug that silently drops every third event
A queue between ingress and processing is full and events are being discarded
The handler is writing to the wrong database table due to an environment variable misconfiguration
A schema migration changed a column name and the handler's INSERT is silently failing because of a column mismatch caught by a try/catch that swallows the error
The handler is returning 200 before async processing completes, and the async processing is failing 100% of the time

None of these failures are visible to the health check. The process is running. The database is reachable. The health check passes. The status page says "All systems operational."

The healthy/working gap

The gap between "healthy" and "working" is the gap between component liveness and system correctness.

Component liveness answers: "Is this thing running?" System correctness answers: "Does this thing produce the right outcomes?"

For webhook infrastructure, the right outcome is specific and verifiable: a webhook event enters the system, traverses the processing pipeline, and produces a committed side effect in durable storage. Health checks test none of this. They test whether the front door is unlocked, not whether the factory behind it is producing the right product.

This distinction matters because the failure modes that cause the most damage are precisely the ones that health checks cannot detect. A process crash is loud — the health check fails, the load balancer removes the target, the alert fires, someone investigates. A logic bug that produces incorrect outcomes is silent — the health check passes, the load balancer continues routing traffic, no alert fires, and events accumulate in a broken state until someone notices.

The Friday load balancer scenario is real because it exploits this gap. The canary deployment is healthy — it runs, it responds, it connects to the database. It is not working — it processes events with an older handler version that produces incorrect outcomes. The health check cannot distinguish between these two states because it does not test the processing logic.

Four ways "healthy" hides "broken"

The load balancer scenario is one pattern. There are others, and they are common enough that most teams will encounter at least one.

Handler version mismatch. Blue-green deployments, canary releases, and rolling updates all create windows where multiple handler versions process events simultaneously. If the new version changes how events are mapped, processed, or stored, events that hit the old version may be processed with stale logic. Health checks on both versions pass. The events appear to succeed. The outcomes are inconsistent.

Configuration drift. Your handler reads configuration from environment variables, a config file, or a remote config service. A config change in one environment — staging values in production, a feature flag in the wrong state, a connection string pointing to a read replica instead of the primary — causes the handler to behave incorrectly. The handler runs, responds, and returns 200. Health checks pass. The pipeline produces wrong results.

Downstream dependency degradation. Your handler processes an event and writes to a downstream service — a queue, an external API, a cache. The downstream service is degraded: it accepts writes but does not commit them, or it commits them with high latency that causes your handler to time out after returning 200. The handler is healthy. The downstream is technically reachable. The end-to-end pipeline is broken.

Silent exception swallowing. Your handler has a try/catch around the processing logic. The catch block logs the error and returns 200 to prevent the provider from retrying (a common pattern to avoid retry storms). The error is logged but not alerted on. The health check does not inspect error logs. Events fail processing, the handler returns success, and the monitoring shows nothing because the health check tests connectivity, not correctness.

Each of these is a scenario where "healthy" is factually correct and completely misleading. The components are alive. The system does not work.

The time gap problem

Even if your health check could somehow test the full processing pipeline — and a sufficiently sophisticated check could — there is a second problem: time.

Health checks run on intervals. Common intervals are 10 seconds, 30 seconds, or 60 seconds. Between checks, the system could transition from working to broken and back to working, and the health check would never detect it.

But the more insidious time gap is between the health check and the last time you actually proved the full pipeline worked.

Your health check runs every 30 seconds. It confirms the process is running and the database is reachable. The last health check was 12 seconds ago. When was the last time a real webhook event successfully traversed the full pipeline — ingress, processing, outcome commitment?

If your traffic is steady, the answer might be "a few seconds ago." If your traffic is bursty, the answer might be "three hours ago." If it is a weekend or a holiday, the answer might be "eighteen hours ago."

During that gap, your status page says "healthy" based on the most recent health check. But the health check is testing the wrong thing. It is testing component liveness, not pipeline correctness. The last proof of pipeline correctness was however long ago the last real event was successfully processed.

For most teams, this distinction never surfaces because traffic is frequent enough that the gap is small. But the failure that matters — the silent one, the configuration drift, the handler version mismatch — does not announce itself. It starts during the gap and continues until someone notices. The longer the gap between proofs, the more events are affected before detection.

What proof-backed status looks like

The alternative to health-check-based status is proof-backed status. Instead of asking "is the component alive?", you ask "when did I last prove the full pipeline works?"

A canary probe is a synthetic webhook event that traverses the entire pipeline on a schedule:

Ingress. The canary is sent to the real webhook ingress endpoint — the same endpoint that receives real provider events. It goes through the same load balancer, the same routing, the same TLS termination.
Processing. The canary is processed by the real handler. It uses a canary marker (a specific header or payload field) so the handler can route it to a test path that does not create real side effects. But it exercises the same parsing, validation, and routing logic as real events.
Outcome verification. After processing, the canary system checks that the expected outcome occurred. Did the handler write the expected record? Did the expected side effect commit? This is the step that health checks skip entirely.
Status derivation. The status of the system is not "the last health check passed." The status is "the last canary proof succeeded at [timestamp]." If the last proof was 2 minutes ago and succeeded, the pipeline was working 2 minutes ago. If the last proof was 6 hours ago, you have a 6-hour gap of unknown pipeline correctness.

The critical difference is what the probe tests. A health check tests: "Can the process respond to an HTTP request?" A canary probe tests: "Can a webhook event enter the system and produce a correct, committed outcome?"

The Friday scenario with canary probes

Return to the Friday load balancer change. At 4:47 PM, the routing weights shift. At 4:50 PM, the next scheduled canary fires. It enters through the real ingress endpoint. 30% probability it hits the canary deployment.

If it hits the canary deployment: the handler processes the canary event with the older handler version. The outcome verification step checks for the expected result. The older handler produces an incorrect mapping. The verification fails. The canary is marked as failed. The status transitions from "working" to "degraded." An alert fires at 4:50 PM — three minutes after the change, not 63 hours later on Monday morning.

If it hits the primary deployment: the canary succeeds. The next canary fires 5 minutes later. Within a few canary cycles, the 30% routing weight means the probe will hit the canary deployment. The failure is detected within minutes.

The difference between "detected at 4:50 PM Friday" and "detected at 9:14 AM Monday" is 847 incorrectly processed events.

What your status page should show

A webhook status page that is backed by proofs rather than health checks shows fundamentally different information.

Instead of: "All systems operational" (based on: health check passed 12 seconds ago)

Show: "Pipeline verified at 2026-04-03T16:50:22Z" (based on: canary probe traversed ingress, processing, and outcome verification successfully 2 minutes ago)

Instead of: Component-level green/yellow/red indicators (API: green, Database: green, Queue: green)

Show: Pipeline-level proof status with the age of the last proof. If the last proof is older than the expected interval, the status is "unknown" — not "healthy," because you do not know.

Instead of: Uptime percentage calculated from health check availability

Show: Proof coverage — what percentage of the last 24 hours is covered by successful canary proofs, and what are the gaps.

This is not a cosmetic difference. It changes the operational posture from "we assume things work until something breaks" to "we continuously prove things work and know immediately when they stop."

The operational cost of false confidence

The real cost of health-check-based status is not the monitoring infrastructure. It is the false confidence that lets teams delay investigation.

When the status page says "All systems operational," operators trust it. They do not investigate vague customer reports because the dashboard says everything is fine. They do not dig into minor anomalies because the health checks are green. They go home on Friday knowing the status page will alert them if something breaks.

The status page will alert them if a process crashes or a database becomes unreachable. It will not alert them if the pipeline silently produces incorrect outcomes. And that is the failure mode that causes the most customer impact — not the loud crash that triggers alerts, but the quiet corruption that accumulates until someone external notices.

For teams operating webhook infrastructure at scale, this is not an edge case. It is the primary failure mode. The loud failures are handled — auto-scaling, restarts, failover, retry. The silent failures are the ones that cost money, cost customers, and cost trust.

HookTunnel's canary probes run on a schedule, traverse the full pipeline through the real ingress endpoint, and verify that the expected outcome was committed. Status is derived from the last successful proof. When the proof fails, the status changes and an alert fires — not when a health check detects a crashed process, but when the pipeline stops producing correct outcomes.

The question is not whether your status page is green. The question is: when did you last prove that green means working? If you do not have canary probes, you do not know. And the gap between trust and proof is where silent failures live.

Healthy Is Not Working: Why Green Dashboards Hide Broken Webhook Pipelines