Your Webhook Provider Is Making Failures Worse

Here is something no webhook tool tells you: when your downstream target starts failing, blind retries do not help it recover. They slow recovery down.

Every retry is a new HTTP connection. A new TLS handshake. A new request the failing service has to parse and reject. If your webhook provider retries at standard intervals — every 30 seconds, every minute, every five minutes — and you are running high-volume traffic, you are sending hundreds of requests per hour to a service that cannot currently handle any of them. A database connection pool that is exhausted does not recover faster because more requests are piling up on the socket backlog. A service restart that needs 90 seconds to warm up its cache does not appreciate 60 retries landing during those 90 seconds.

The circuit breaker pattern is one of the most well-understood reliability primitives in distributed systems engineering. It is in every SRE handbook. It appears in every microservices architecture guide. It is the standard answer to "how do you stop a failure cascade."

No webhook tool ships it. HookTunnel does.

What Is Happening Without a Circuit Breaker

Your Stripe webhook handler goes down at 11:47am. Maybe it is a database connection issue. Maybe a bad deploy. Maybe a dependency your handler calls is unavailable.

Stripe begins retrying. Your other webhook provider begins retrying. HookTunnel, if you were using a tool without circuit breaking, would also begin retrying. Every tool is doing what it is supposed to do: persistently attempt delivery.

The problem is that "persistently attempt delivery" and "let the failing service recover" are in direct conflict. Recovery requires headroom — CPU that is not answering malformed requests, database connections that are not being held open waiting for timeouts, memory that is not being consumed by in-flight request objects that will never complete.

A circuit breaker solves this by treating failure as information. When enough failures accumulate, the circuit opens. Delivery stops. Events queue safely. The failing target gets silence — exactly what it needs to recover. And when the circuit probes the target again and gets a success, it closes. Delivery resumes. No manual intervention required.

This is not experimental. It is how Netflix, Amazon, and every other company that runs high-volume distributed systems protects their infrastructure. It is now in your webhook dashboard.

The Three States, Explained

Closed

This is the normal operating state. Every inbound webhook event is delivered to the target URL. Delivery outcomes — HTTP status codes, response times, connection errors — are tracked continuously.

A failure counter increments on every non-2xx response and every connection error. When the counter crosses the configured threshold (for example, 10 failures within 60 seconds), the circuit opens.

The exact thresholds are configurable per tier. Higher tiers get tighter thresholds and more granular tuning. Free tier uses conservative defaults that protect against obvious storms without being too sensitive to transient blips.

Open

The circuit is open. Delivery to this target has stopped.

Inbound events are not dropped. They are queued. The queue is durable — a Vercel deployment restart, a Railway restart, neither will lose your events. The target URL is marked as unavailable in the dashboard with a timestamp showing when the circuit opened.

Your team gets this information immediately. No polling. No alert fatigue from a monitoring system you configured six months ago and have not tuned since. The circuit breaker dashboard shows you exactly which targets are open, how long they have been open, and how many events are queued waiting for delivery.

While the circuit is open, a probe timer runs. Every configured probe interval — typically 30 seconds to 2 minutes depending on tier — the circuit sends one request to the target. A single request. Not a flood. One carefully chosen probe with a representative payload.

If the probe fails, the circuit stays open. The probe timer resets. No additional events are delivered.

If the probe succeeds, the circuit moves to half-open.

Half-Open

The circuit is probing. The target returned a 2xx to the probe request, which is a signal — not a guarantee — that it is recovering.

In the half-open state, delivery resumes at reduced volume. A percentage of the queued events are delivered, and their outcomes are watched closely. If the success rate holds, the circuit closes fully and normal delivery resumes. If failures reappear, the circuit opens again immediately, and the probe interval increases (exponential backoff on probe frequency, not on delivery — a subtle but important distinction).

The half-open state exists because "it answered one request" and "it is fully recovered" are not the same thing. A service that just restarted might handle the first 5 requests fine and then OOM again on the 6th because its cache has not warmed up. Half-open gives you progressive confidence, not a binary flip.

The Dashboard Experience

What the SRE Sees at 11:52am

Five minutes after the handler goes down, the SRE on call opens the Circuits page. The circuit for api.internal.company.com/webhooks/payments is red. State: Open. Opened at 11:47am. 47 events queued.

They do not need to look at logs to understand what happened. The failure timeline is right there: request rate, error rate, the exact moment the threshold crossed. They can see the probe history — two probes sent since the circuit opened, both returned connection refused.

They page the backend team. The backend team identifies a misconfigured environment variable in the latest deploy. They roll back. 90 seconds later, the probe returns 200. The circuit moves to half-open. The dashboard updates in real time.

The SRE watches the half-open delivery play out. Five events. Ten events. All successful. The circuit closes. The 47 queued events begin flowing. Normal operation resumes.

Total time from open to closed: 8 minutes. Events lost: zero. Duplicates created: zero. Manual intervention beyond identifying the root cause: none.

The Force-Close

Sometimes you do not want to wait for the probe cycle. You deployed the fix. You are confident the target is healthy. You want to close the circuit now and start delivering the queued backlog.

The dashboard has a Force Close button. It requires confirmation. It creates an audit log entry with a timestamp. It immediately transitions the circuit from open to half-open, skipping the probe wait. Half-open then runs its normal progressive confidence check.

Force-close is not "bypass the circuit breaker." It is "I have information the circuit does not have yet — apply it." The half-open validation still runs. If the target is not actually healthy, the circuit re-opens within seconds of the force-close.

Why No Other Webhook Tool Has This

The circuit breaker pattern requires state. You need to track failure counts per target, maintain circuit state across requests, run probe timers on a background schedule, and queue events in a durable store while circuits are open.

Most webhook tools are stateless forwarders. They receive a webhook, attempt delivery, log the outcome, and schedule retries using a job queue. The retry queue is the entire extent of their failure handling. Adding circuit breaking on top of a stateless forwarder requires rearchitecting the delivery layer.

HookTunnel's delivery infrastructure was built with per-target state tracking from the beginning. The circuit breaker is not bolted on. It is the delivery layer.

Comparison

| Capability | HookTunnel | ngrok | Webhook.site | Hookdeck | Svix | |---|---|---|---|---|---| | Circuit breaker (closed/open/half-open) | Yes | No | No | No | No | | Dashboard circuit state visibility | Yes | No | No | No | No | | Durable event queue while open | Yes | No | No | No | No | | Automatic probe and close | Yes | No | No | No | No | | Manual force-close with audit log | Yes | No | No | No | No | | Per-target failure rate tracking | Yes | No | No | Partial | No |

Hookdeck tracks delivery attempts and provides retry configuration. It does not have circuit state — it retries indefinitely, which is exactly the retry storm problem the circuit breaker solves.

Scenarios Where This Matters

Database connection storm. Your payment handler connects to Postgres. Postgres hits connection limit. Every webhook delivery creates a new failed connection attempt, which further exhausts the connection pool. With a circuit breaker, delivery stops after 10 failures. The connection pool drains. Postgres recovers. Delivery resumes. Without a circuit breaker, you are hammering a starved connection pool until something crashes entirely.

Third-party dependency brownout. Your handler calls a fraud-scoring API. The fraud-scoring API enters a brownout — returning 503s with retry-after headers. Your handler returns 500. Your webhook provider retries. You are now creating load on both your handler and the fraud-scoring API during its recovery window. Circuit open: both services get the silence they need.

Post-deploy warm-up window. New deploy of a handler that initialises a large ML model on first request. First several requests while the model loads: timeouts. Circuit opens. Model finishes loading. Probe succeeds. Circuit half-opens, resumes delivery at reduced rate while the second instance catches up. Zero events dropped.

FAQ

Does the circuit breaker affect inbound webhook acceptance?

No. Webhooks from your providers are always accepted and acknowledged. The circuit breaker operates on the delivery side — after HookTunnel has received and stored the event. Providers get their 200. Your target gets protected silence while the circuit is open.

What happens to queued events if I update my target URL?

Queued events are associated with the hook, not the circuit state. If you update the target URL while a circuit is open, queued events will attempt delivery to the new URL when the circuit closes. This is intentional — it allows you to point to a healthy replacement target as a recovery action.

Can I configure different thresholds per hook?

Threshold configuration is per-tier. Within your tier, thresholds are consistent across hooks. Per-hook threshold tuning is on the roadmap for Team and Enterprise tiers, where individual hooks may have very different traffic patterns and SLA requirements.

How does the circuit breaker interact with replay?

Replay checks circuit state before executing. If the circuit for the target is open, replay is blocked with a Risky rating in the Replay Prediction score. If the circuit is half-open, replay proceeds at reduced volume, contributing to the half-open confidence check. This prevents the common mistake of replaying a backlog into a recovering target before it is actually ready.

Is circuit state persisted across restarts?

Yes. Circuit state is stored in the database. A deployment restart does not reset open circuits. Probe timers resume on startup. Queued events are not lost.