Webhook Retry Storms: Stop Hammering Your Broken Endpoints

2:17 AM. A database migration runs long. You knew the window was supposed to be 15 minutes — it turned into 40. During those 40 minutes, your webhook target started returning 503s.

Stripe began retrying: 30 seconds, 1 minute, 5 minutes, 10 minutes. Then Twilio began retrying. Then GitHub. Then your payment processor, your CRM sync, your fraud detection service. Each provider has its own retry schedule, its own backoff curve, its own idea of how hard to hammer your endpoint.

By 3 AM, the migration finishes and your target comes back up. You have 12 providers, 50 customer events each, and up to 5 retry attempts queued per event. That's 3,000 inbound requests hitting your target in the first 90 seconds after recovery — before your connection pool has stabilized, before your caches are warm, before your DB read replicas have caught up.

The retries are making your outage worse. Your target comes up, immediately buckles under the retry storm, and goes back down. You've turned a 40-minute maintenance window into a 4-hour incident.

This is the webhook retry storm problem. It's predictable, it's common, and there's a structural fix. For what happens after the storm passes and you need to replay events safely, see the webhook outage recovery playbook.

How provider retry policies work

Every major webhook provider has a retry policy, but no two are the same. A Stripe outage window can generate active retries for the next 24 hours — long after your target has recovered. See Stripe webhook documentation for the full retry schedule, and the HTTP reference for status code behavior. The webhook debugging checklist covers how to identify retry-induced issues in your event logs.

Stripe retries failed webhook deliveries up to 9 times over 24 hours. The schedule uses exponential backoff: retries happen at roughly 5 minutes, 10 minutes, 25 minutes, 1 hour, 2 hours, 5 hours, 10 hours, and 24 hours. A non-2xx response or a connection timeout both trigger a retry. This means a 30-minute outage window can have active retries piling up for the next 24 hours.

Twilio retries failed requests up to 15 times with its own backoff curve. Twilio's retries are more aggressive in the first few minutes — the goal is fast recovery for voice and SMS, not graceful degradation.

GitHub webhooks attempt delivery up to 3 times for failed deliveries, with shorter windows. Lower retry count, but GitHub sends webhooks for every push, every PR, every issue comment — the event volume can be high.

The compounding problem is not about any single provider. It's about all of them at once. Here's the math:

12 providers
× 50 events per provider per customer
× 100 customers
× 3-5 retry attempts queued

= 180,000 to 300,000 pending retries
hitting your target in the first few minutes after recovery

Your target was handling maybe 2,000 requests per minute before the outage. It now has 200,000 retry attempts trying to drain in the first few minutes. The target never had a chance.

The retry storm math

The compounding happens because providers are independent. Each provider's retry scheduler doesn't know about the other providers. Each one saw 503s and queued retries according to its own schedule. When the target comes back, all of those independent queues flush simultaneously.

Let's make this concrete with a smaller example:

Stripe:    500 events × 4 pending retries = 2,000 requests
Twilio:    200 events × 6 pending retries = 1,200 requests
GitHub:    1,000 events × 2 pending retries = 2,000 requests
Others:    800 events × 3 pending retries = 2,400 requests

Total: 7,600 requests queuing up in the first 90 seconds

A system that normally handles 50 requests per second is hit with 84 requests per second the moment it recovers — before it's stable, before connection pools are full, before caches are warm.

The shape of this curve is what kills you. If those 7,600 requests drained over 30 minutes instead of 90 seconds, your system would handle it fine. The problem is the instantaneous spike.

The circuit breaker pattern

The circuit breaker pattern comes from electrical engineering. An electrical circuit breaker trips when current exceeds a threshold — it cuts the circuit before the wire melts. The breaker doesn't care why the current spiked. It trips on the measurement, not the cause. The circuit breaker is the structural fix that prevents a brief outage from becoming a multi-hour incident. The webhook-driven architecture post covers circuit breakers as one of five reliability principles.

A software circuit breaker does the same thing for service calls. There are three states:

Closed — Normal operation. Requests flow through. The circuit breaker counts failures. Below the failure threshold, it stays closed.

Open — The failure threshold was crossed. The breaker has tripped. Requests are rejected immediately without attempting the call. Events accumulate in a queue. The upstream caller (the webhook provider's retry) never reaches the downstream target.

Half-open — After a configurable timeout, the circuit enters half-open state. One probe request is allowed through. If the probe succeeds, the circuit closes and queued events start flowing. If the probe fails, the circuit trips open again and the timeout resets.

Here's the state machine in Node.js:

class CircuitBreaker {
  constructor(options = {}) {
    this.state = 'closed'; // closed | open | half-open
    this.failureCount = 0;
    this.successCount = 0;
    this.failureThreshold = options.failureThreshold ?? 5;
    this.halfOpenProbeCount = options.halfOpenProbeCount ?? 1;
    this.resetTimeoutMs = options.resetTimeoutMs ?? 60_000;
    this.lastFailureTime = null;
  }

  async call(fn) {
    if (this.state === 'open') {
      if (Date.now() - this.lastFailureTime > this.resetTimeoutMs) {
        this.state = 'half-open';
      } else {
        throw new Error('CIRCUIT_OPEN');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (err) {
      this.onFailure();
      throw err;
    }
  }

  onSuccess() {
    if (this.state === 'half-open') {
      this.successCount++;
      if (this.successCount >= this.halfOpenProbeCount) {
        this.state = 'closed';
        this.failureCount = 0;
        this.successCount = 0;
      }
    } else {
      this.failureCount = 0;
    }
  }

  onFailure() {
    this.failureCount++;
    this.lastFailureTime = Date.now();
    if (this.failureCount >= this.failureThreshold || this.state === 'half-open') {
      this.state = 'open';
      this.successCount = 0;
    }
  }
}

This is the standard pattern. The failure threshold is configurable. The reset timeout is configurable. The key insight is that during open state, the downstream system sees zero requests — it's fully shielded while recovering.

How HookTunnel implements this at the ingress layer

The important distinction is where the circuit breaker lives. In most implementations, the circuit breaker is built into your application — your service wraps its calls to downstream dependencies with circuit breaker logic.

HookTunnel runs the circuit breaker at the ingress layer, before events reach your target at all.

When Stripe sends a webhook to your HookTunnel URL, HookTunnel is the entity forwarding that event to your local handler. HookTunnel tracks delivery outcomes for each target. When failures exceed the threshold, HookTunnel trips the circuit — it stops forwarding events to your target and queues them internally.

From the provider's perspective, HookTunnel still accepted the event with a 200 OK (the event is safely stored). The provider has no reason to retry. The retry storm never starts.

Your target, meanwhile, sees zero webhook traffic. It can recover without load.

The thresholds are per-target and tier-aware. A circuit configured to trip after 5 consecutive failures with a 60-second reset window will:

Accept events from providers and store them during the open window
Attempt one probe delivery when 60 seconds passes
On probe success, replay the queued events with backoff between each delivery
The replay uses guardrails: any event that was already applied (receipt confirmed) is skipped

The replay guardrail matters here. During the outage window, some events may have been partially processed. Your handler may have gotten some requests through before the 503s started. The guardrailed replay skips anything with an Applied Confirmed status, so you don't replay events that already committed.

Manual override: force-close from the dashboard

The circuit has its own recovery logic, but sometimes you know something it doesn't. The force-close audit trail is not for you — it's for postmortems and enterprise customers who ask for impact reports. Explore HookTunnel's webhook inspection features to see the circuit breaker dashboard in action. You've deployed the fix, you've tested the handler, you've confirmed the database is stable — you're confident the target is ready and you want to start draining the queue now rather than waiting for the next half-open probe.

HookTunnel's circuit breaker dashboard lets you force-close the circuit. The force-close action:

Sets circuit state to closed
Immediately begins replaying queued events
Logs the force-close to the audit trail with your user ID and timestamp

The audit trail matters for postmortems. You want a record of when the circuit tripped, when it would have auto-recovered, and whether a human intervened and why.

The alternative: manual disable and re-enable in every provider's dashboard

Without a circuit breaker at the ingress layer, your mitigation options during a retry storm are limited.

You can try to disable the webhook endpoint in each provider's dashboard. Log into Stripe, find the webhook endpoint, disable it. Log into Twilio, disable. Log into GitHub, disable. Repeat for every provider that was sending events. Each provider has a different UI, different access controls, different confirmation steps.

Then your target recovers. You re-enable in each provider. But disabling doesn't clear the retry queue — Stripe's retry scheduler was still tracking which events failed. Re-enabling may or may not drain those retries depending on the provider. Some providers restart the retry sequence from scratch on re-enable.

And this all assumes you can log into every provider's dashboard at 2 AM while also debugging the outage. In practice, the person who knows the Stripe dashboard credentials is not the same person who's awake at 2 AM fixing the database migration.

The comparison is stark:

| Scenario | HookTunnel circuit breaker | Manual disable/enable | No protection | |---|---|---|---| | Retry storm prevented | Yes | Partially | No | | Target shielded during recovery | Yes | Only if you got to every dashboard | No | | Events preserved | Yes | Maybe — depends on provider | Some lost on timeout | | Requires 2 AM dashboard access | No | Yes, in multiple dashboards | N/A | | Replay is guardrailed (skips duplicates) | Yes | No | No | | Force-close when ready | Yes | N/A | N/A | | Audit trail | Yes | No | No | | Works while you sleep | Yes | No | No |

What the incident looks like with circuit breakers

With a circuit breaker at the ingress, the 2:17 AM scenario plays out differently.

2:17 AM — Migration starts running long. Your target starts returning 503s.

2:19 AM — After 5 consecutive failures, HookTunnel trips the circuit open for your target URL. Incoming events from all providers are accepted and queued at HookTunnel.

2:17 AM to 2:57 AM — 168 events arrive from various providers. All are stored. None are forwarded to your target. Providers receive 200 OK from HookTunnel, so no retry storm starts.

2:57 AM — The reset timeout expires. HookTunnel sends one probe request. Your target is still down, returns 503. Circuit stays open. Timeout resets.

3:17 AM — Migration completes. Your target comes back up.

3:18 AM — Probe request succeeds. Circuit closes. HookTunnel begins replaying the 168 queued events with exponential backoff between deliveries: 1 event per 2 seconds initially, gradually increasing. Your target drains the queue over 8 minutes.

3:26 AM — All 168 events delivered. Reconciliation dashboard shows zero gap.

4:00 AM — You wake up for an early start, check the dashboard, see the incident timeline, and close the postmortem with a clean audit trail.

The migration window was 40 minutes of downtime. Without circuit breakers, it became 4 hours of incident response. With circuit breakers at the ingress layer, it was 40 minutes of downtime and 8 minutes of controlled replay — all automated.

That's the difference. Not a clever pattern. Not a difficult implementation. Just having the circuit breaker in the right place.

Webhook Retry Storms: How Broken Targets Make Outages Worse