Engineering·5 min read·2025-04-24·Colm Byrne, Technical Product Manager

Building a Webhook-Driven Architecture That Survives Real Failures

Most webhook integrations are built in an afternoon. They work fine for the first 500 customers. Then you hit a DB connection storm, a provider sending 10x volume, or a deploy during peak hours — and the architecture that worked fine becomes a 3 AM incident.

Most webhook integrations are built in an afternoon. They work fine for the first 500 customers. Then you hit a DB connection storm, or a provider starts sending 10x the expected volume, or a deploy goes wrong during peak hours — and the architecture that "worked fine" turns into a 3 AM incident. See the webhook incident runbook for how to run those incidents systematically, and webhook retry storms for how volume spikes compound outages.

The core problem is not that webhook handlers are hard to write. The problem is that the easy path — receive event, process synchronously, return 200 — works right up until it doesn't, and when it breaks, it breaks in ways that are nearly impossible to diagnose after the fact. Events are gone. State is inconsistent. Customers are affected.

This post covers five architecture principles that make webhook-driven systems genuinely reliable. Each principle is independent. You can adopt them one at a time and get real improvements at each step. You do not need to rearchitect everything at once.


Principle 1: Capture Before You Forward

The first and most important rule: every inbound event must be persisted before any delivery attempt. Not queued in memory. Not logged to stdout. Written to a durable store — a database table, an S3 object, a message broker with persistence enabled. The Stripe webhook best practices guide covers the rationale from the provider's side, and Stripe webhook documentation covers the full delivery model.

Here is what this protects against. Your webhook handler receives an event. While your handler is processing it — doing a DB lookup, making a downstream HTTP call, running business logic — your process crashes. Without capture-first, that event is gone. The provider may retry, or may not. If they do retry, you may have already partially processed the first attempt. If they do not, you have a permanent gap in your event history with no trace that the event ever arrived.

With capture-first, the event is in your store before any processing begins. If the process crashes, the event is still there. You can replay it. You can audit it. You can see exactly what arrived and when.

This principle is not expensive to implement. In Postgres it looks like this:

CREATE TABLE inbound_events (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  provider    TEXT NOT NULL,
  event_type  TEXT NOT NULL,
  payload     JSONB NOT NULL,
  headers     JSONB NOT NULL,
  received_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  processed   BOOLEAN NOT NULL DEFAULT FALSE
);

Your handler inserts into this table as the first operation, before doing anything else. If the insert fails, return 500 and let the provider retry. If the insert succeeds, proceed. If anything after the insert fails, the event is still there and can be reprocessed.

HookTunnel makes this the default behavior. Every event that arrives at your hook URL is stored regardless of what happens to delivery. The store is the source of truth. Delivery is derived from it.


Principle 2: Acknowledge Fast, Process Async

Return 200 within 5 seconds. Period.

This is not a guideline. It is a hard constraint that most providers enforce. Stripe times out at 30 seconds. Twilio times out at 15 seconds for voice status callbacks. GitHub times out at 10 seconds. If your handler exceeds the timeout, the provider marks the delivery as failed and may retry — meaning your handler runs again with the same event, potentially on top of a partially completed first attempt.

The pattern is: receive, store, return 200, process asynchronously.

// Handler: acknowledge fast
app.post('/webhooks/stripe', async (req, res) => {
  // Step 1: Verify signature (fast, in-memory)
  const event = stripe.webhooks.constructEvent(
    req.body,
    req.headers['stripe-signature'],
    process.env.STRIPE_WEBHOOK_SECRET
  );

  // Step 2: Capture to DB (fast, single INSERT)
  await db.query(
    `INSERT INTO inbound_events (provider, event_type, payload, headers)
     VALUES ($1, $2, $3, $4)`,
    ['stripe', event.type, event, req.headers]
  );

  // Step 3: Enqueue for async processing
  await jobQueue.add('process-stripe-event', { eventId: event.id });

  // Step 4: Return 200 immediately
  res.status(200).json({ received: true });
});

// Worker: process async
jobQueue.process('process-stripe-event', async (job) => {
  const event = await db.query(
    'SELECT * FROM inbound_events WHERE provider = $1 AND payload->>\'id\' = $2',
    ['stripe', job.data.eventId]
  );
  await processStripeEvent(event.rows[0]);
});

A note on queue choice. Redis-backed queues like BullMQ are popular and fast. But they carry a risk: if the worker crashes after dequeuing a job but before processing it, and the job was not made durable, it is lost. BullMQ's removeOnComplete: false helps. Postgres-backed job queues (pg-boss, graphile-worker) are slower but give you ACID guarantees — a job cannot disappear without a trace. For webhook processing where event loss is a business problem, Postgres-backed queues are worth the overhead.


Principle 3: Idempotency at Every Layer

Webhook providers retry. Networks are unreliable. Queues can deliver messages more than once. Your handler will receive the same event multiple times. You need idempotency at every layer of your stack. The DB unique constraint is not optional — it is the only safe guarantee against duplicate processing. See the webhook handler async pattern for complete implementation patterns.

There are three layers, and you need all three.

Layer 1: Application-level check. Before processing, check if you have already processed this event ID:

const existing = await db.query(
  'SELECT id FROM processed_events WHERE event_id = $1',
  [event.id]
);
if (existing.rows.length > 0) {
  return { status: 'already_processed' };
}
await processEvent(event);
await db.query(
  'INSERT INTO processed_events (event_id) VALUES ($1)',
  [event.id]
);

This works most of the time. But there is a race condition. Two concurrent executions of the same event can both pass the check before either completes the insert. Under low concurrency this is rare. Under high load it happens regularly.

Layer 2: DB unique constraint. The only safe floor:

CREATE TABLE processed_events (
  event_id TEXT PRIMARY KEY,
  processed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

With this constraint, the second concurrent insert will fail with a unique violation. Your application code catches that exception and treats it as "already processed." The constraint is not optional — it is the guarantee. The application-level check is a performance optimization (avoid the failed insert), not the safety mechanism.

Layer 3: Idempotent business logic. Your handler should use INSERT ... ON CONFLICT DO UPDATE (upsert) rather than plain INSERT wherever possible:

INSERT INTO subscriptions (user_id, plan, status, updated_at)
VALUES ($1, $2, 'active', NOW())
ON CONFLICT (user_id)
DO UPDATE SET plan = EXCLUDED.plan, status = EXCLUDED.status, updated_at = NOW();

Running this twice with the same inputs produces the same state. A plain INSERT run twice throws an error. A plain UPDATE run twice on a different transaction state than expected produces a corrupted result. Upsert is the only correct default.


Principle 4: Outcome Receipts

Principles 1 through 3 get your events stored and processed reliably. Principle 4 closes the observability gap. Without it, you know an event was delivered (your handler returned 200). You do not know whether the outcome was applied — and that gap is where webhook revenue leakage silently accumulates. See why delivered doesn't mean applied for the full explanation.

The distinction matters because your handler can return 200 from a try/catch block that silently swallowed an exception. It can return 200 before an async write completes. It can return 200 when the operation technically succeeded but wrote the wrong thing. A 200 from your handler is evidence that your handler ran. It is not evidence that the outcome was committed.

Outcome receipts fix this. After your DB write commits, your application sends a signed receipt to HookTunnel:

async function processStripeEvent(event: InboundEvent): Promise<void> {
  // Business logic
  await db.transaction(async (trx) => {
    await trx.query(
      `INSERT INTO subscriptions (user_id, plan, status)
       VALUES ($1, $2, 'active')
       ON CONFLICT (user_id) DO UPDATE SET plan = EXCLUDED.plan, status = 'active'`,
      [event.payload.data.object.metadata.user_id, event.payload.data.object.plan.id]
    );
  });

  // Receipt: only sent after transaction commits
  await sendOutcomeReceipt({
    hookId: event.hook_id,
    requestLogId: event.request_log_id,
    status: 'applied',
    secretKey: process.env.HOOKTUNNEL_RECEIPT_SECRET,
  });
}

async function sendOutcomeReceipt(params: ReceiptParams): Promise<void> {
  const body = JSON.stringify({
    request_log_id: params.requestLogId,
    status: params.status,
    timestamp: new Date().toISOString(),
  });

  const signature = createHmac('sha256', params.secretKey)
    .update(body)
    .digest('hex');

  await fetch(`https://hooks.hooktunnel.com/api/v1/receipts`, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'X-HookTunnel-Signature': signature,
    },
    body,
  });
}

With receipts in place, your monitoring shifts from "did my handler respond?" to "are my outcomes being confirmed?" HookTunnel tracks three states: Paid (confirmed by Stripe), Delivered (your handler returned 2xx), Applied (receipt confirmed DB commit). Any event that has been Delivered but not Applied within your SLA window generates an alert.

The SLA timer is the backstop. You configure it per hook (60 seconds is a reasonable default). If no receipt arrives within the window after a successful delivery, the event is flagged as Applied Unknown. You know something may have gone wrong. You can investigate before the customer reports it.


Principle 5: Circuit Breakers at the Ingress

When a target handler starts failing, the default behavior of most retry systems is to keep trying. Circuit breakers at the ingress layer prevent retry storms from turning a brief outage into a multi-hour incident. The full mechanics are covered in webhook retry storms. Explore HookTunnel's webhook inspection features to see circuit breaker dashboards in action. Exponential backoff slows the rate. But if the target is genuinely broken — not transiently unavailable, but actually broken — continued retries accomplish nothing except adding load to an already stressed system.

Circuit breakers change the behavior. When a target crosses a failure threshold (configurable: percentage of failures in a rolling window, or consecutive failures), the circuit opens. Subsequent events are queued at the ingress rather than forwarded. The target gets no traffic. When the circuit enters half-open state after a cooldown period, a single probe request is sent. If it succeeds, the circuit closes and queued events are replayed. If it fails, the circuit remains open.

interface CircuitState {
  status: 'closed' | 'open' | 'half-open';
  failureCount: number;
  lastFailureAt: Date | null;
  cooldownUntil: Date | null;
}

async function forwardWithCircuitBreaker(
  targetUrl: string,
  payload: unknown,
  circuitState: CircuitState
): Promise<ForwardResult> {
  if (circuitState.status === 'open') {
    if (new Date() < circuitState.cooldownUntil!) {
      return { status: 'queued', reason: 'circuit_open' };
    }
    // Transition to half-open
    circuitState.status = 'half-open';
  }

  try {
    const response = await fetch(targetUrl, {
      method: 'POST',
      body: JSON.stringify(payload),
      signal: AbortSignal.timeout(10_000),
    });

    if (response.ok) {
      // Reset on success
      circuitState.status = 'closed';
      circuitState.failureCount = 0;
      return { status: 'delivered', statusCode: response.status };
    }

    throw new Error(`Target returned ${response.status}`);
  } catch (err) {
    circuitState.failureCount++;

    if (circuitState.failureCount >= FAILURE_THRESHOLD) {
      circuitState.status = 'open';
      circuitState.lastFailureAt = new Date();
      circuitState.cooldownUntil = new Date(Date.now() + COOLDOWN_MS);
    }

    return { status: 'failed', error: (err as Error).message };
  }
}

The circuit breaker prevents a cluster of retries from compounding a recovery. If your target went down during a DB migration and comes back up 10 minutes later, without a circuit breaker your ingress has been queuing retries the entire time and releases them in a burst that overwhelms the recovering service. With a circuit breaker, traffic held during the outage is released at a controlled rate as the circuit transitions through half-open.

HookTunnel implements circuit breakers per target with configurable thresholds per tier. Free tier uses conservative defaults. Team and Enterprise tiers allow per-hook configuration. The dashboard shows current circuit state for all targets with a force-close option for when you know a target has recovered and do not want to wait for the cooldown.


Putting It Together

These five principles are not a monolith. You can implement them in order, getting measurable value at each step:

  1. Add capture-first storage — you now have a complete event history and replay capability
  2. Add async processing — you eliminate delivery timeouts and protect against handler crashes
  3. Add idempotency constraints — you can safely replay without producing duplicate state
  4. Add outcome receipts — you go from delivery confirmation to outcome confirmation
  5. Add circuit breakers — you prevent compounding failures during outages

The dependency graph is directional. Outcome receipts require capture-first (you need a request log ID to reference in the receipt). Guardrailed replay requires idempotency (you need to know which events are safe to replay). Circuit breakers work at any point — they are an ingress concern, independent of what happens downstream.

Most webhook integrations start at step zero. They work until they encounter a failure mode that the synchronous, stateless handler pattern cannot survive. When that failure mode arrives — and it will — the cost of rearchitecting under pressure is ten times higher than building it right in the first place.

HookTunnel implements Principles 1, 4, and 5 out of the box. You bring the queue and the idempotency keys.

Stop guessing. Start proving.

Generate a webhook URL in one click. No signup required.

Get started free →