Webhook Handler Async Pattern: Avoid Provider Timeout Bugs

A Stripe payment_intent.succeeded event arrives at your handler at 14:03:07. The handler runs synchronously: it fetches the customer record from the database, inserts the order, sends a confirmation email via SendGrid, provisions the user's account access, logs the event to your analytics table, and then returns 200.

That all takes about 22 seconds.

At 14:03:37, Stripe marks the delivery as timed out. It had a 30-second limit and your handler was at 30 seconds with no response yet. Stripe queues a retry.

At 14:03:43, your handler finally returns 200. Stripe records this, but the retry is already in the queue.

At 14:04:01, the retry arrives. Your handler runs again. It inserts a second order. It sends a second email. It tries to provision access again (which may fail with a unique constraint, or may silently create a duplicate).

The customer gets two emails. You get a second Stripe charge record. Your support queue gets two tickets.

This is not a hypothetical. It happens on every webhook-heavy system that does not handle the timeout problem correctly. The webhook debugging checklist covers how to diagnose these retry-induced duplicates after the fact. See Stripe webhook documentation for Stripe's retry schedule and timeout behavior.

Provider Timeout Limits

Each major provider has a timeout after which they consider the delivery failed and schedule a retry. Fifteen seconds sounds like a lot until you factor in database latency, external API calls, and cold starts — all of which compound. See Stripe webhook documentation for the full retry behavior spec.

| Provider | Timeout | Retry policy | |----------|---------|--------------| | Stripe | 30 seconds | 8 retries over 3 days | | Twilio | 15 seconds | 1 retry (4 hours later) | | GitHub | 10 seconds | 3 retries | | Shopify | 5 seconds | 19 retries over 48 hours | | PagerDuty | 16 seconds | Varies by plan |

Fifteen seconds sounds like a lot until you factor in: database query latency under load, email provider API calls, third-party enrichment lookups, cold starts in serverless environments, and connection pool waits when traffic spikes.

Real processing work — anything involving a database write, an external API call, or a file operation — should not happen inside the 200 response window. The webhook handler should acknowledge receipt and get out.

The Acknowledge-Fast Pattern

The pattern is simple in concept: return 200 immediately after validating the event signature, then do the work asynchronously in a separate process. This is the single most important change you can make to prevent provider timeouts. For the complete architectural picture including capture-first storage, see webhook-driven architecture.

The synchronous (broken) version:

app.post('/webhooks/stripe', express.raw({ type: 'application/json' }), async (req, res) => {
  // Signature verification — correct
  const sig = req.headers['stripe-signature'];
  let event;
  try {
    event = stripe.webhooks.constructEvent(req.body, sig, process.env.STRIPE_WEBHOOK_SECRET);
  } catch (err) {
    return res.status(400).send(`Webhook error: ${err.message}`);
  }

  // Everything below this line is the problem
  if (event.type === 'payment_intent.succeeded') {
    const paymentIntent = event.data.object;

    // DB query — could take 100ms or 5s depending on load
    const customer = await db.customers.findByStripeId(paymentIntent.customer);

    // Another DB write — under a transaction
    const order = await db.orders.createFromPaymentIntent(paymentIntent, customer.id);

    // External API call — latency you do not control
    await sendgrid.send({
      to: customer.email,
      subject: 'Order confirmed',
      html: renderOrderEmail(order),
    });

    // Another external API call
    await provisioningService.grantAccess(customer.id, order.planId);
  }

  // This 200 might arrive 20 seconds later
  res.status(200).send('ok');
});

The timeout risk is clear: every await is a potential delay, and the delays compound. If the database is slow that day, if SendGrid has elevated latency, if your provisioning service is cold — any of these can push total time past the provider's limit.

Three Queue Options

The right approach depends on your infrastructure constraints. Here are three options in order of increasing durability.

Option 1: In-Process Queue (Simple, No Infrastructure)

For low-volume applications or prototypes, an in-process job queue requires nothing beyond your application server. The tradeoff: if the process crashes, the job is lost.

import { EventEmitter } from 'events';

const jobQueue = new EventEmitter();

jobQueue.on('payment_intent.succeeded', async (paymentIntent) => {
  try {
    const customer = await db.customers.findByStripeId(paymentIntent.customer);
    const order = await db.orders.createFromPaymentIntent(paymentIntent, customer.id);
    await sendgrid.send({ to: customer.email, ... });
    await provisioningService.grantAccess(customer.id, order.planId);
  } catch (err) {
    logger.error({ err, paymentIntentId: paymentIntent.id, msg: 'Failed to process payment' });
    // No retry logic — event is lost if this throws
  }
});

app.post('/webhooks/stripe', express.raw({ type: 'application/json' }), (req, res) => {
  const event = stripe.webhooks.constructEvent(req.body, req.headers['stripe-signature'], process.env.STRIPE_WEBHOOK_SECRET);

  // Acknowledge immediately — under 1ms
  res.status(200).send('ok');

  // Queue the work — also under 1ms (emitting an event is synchronous)
  jobQueue.emit(event.type, event.data.object);
});

This removes the timeout risk. The response is now consistently under 5ms.

The durability problem: a process restart between receiving the event and processing it drops the job. Stripe may or may not retry depending on whether it received the 200 before the crash. If it did receive the 200, it considers the event delivered. Your job is gone.

Use this only if you can tolerate occasional event loss, or if Stripe's retry behavior covers your durability requirement.

Option 2: Redis with BullMQ (Production)

BullMQ is the standard durable job queue for Node.js. Jobs are persisted in Redis and survive process crashes.

import { Queue, Worker } from 'bullmq';

const paymentQueue = new Queue('payment-processing', {
  connection: { host: process.env.REDIS_HOST, port: 6379 },
});

// In the webhook handler
app.post('/webhooks/stripe', express.raw({ type: 'application/json' }), async (req, res) => {
  const event = stripe.webhooks.constructEvent(req.body, req.headers['stripe-signature'], process.env.STRIPE_WEBHOOK_SECRET);

  res.status(200).send('ok');

  if (event.type === 'payment_intent.succeeded') {
    // This is fast — just a Redis write
    await paymentQueue.add('process-payment', event.data.object, {
      jobId: event.id, // idempotency: same Stripe event ID = same job ID, won't duplicate
      attempts: 3,
      backoff: { type: 'exponential', delay: 5000 },
    });
  }
});

// In a separate worker process (or same process with worker thread)
const worker = new Worker('payment-processing', async (job) => {
  const paymentIntent = job.data;
  const customer = await db.customers.findByStripeId(paymentIntent.customer);
  const order = await db.orders.createFromPaymentIntent(paymentIntent, customer.id);
  await sendgrid.send({ to: customer.email, ... });
  await provisioningService.grantAccess(customer.id, order.planId);

  // Send outcome receipt after confirmed DB write
  await sendOutcomeReceipt(paymentIntent.id, order.id);
}, { connection: { host: process.env.REDIS_HOST, port: 6379 } });

Using event.id as the jobId gives you application-level idempotency for the queue: if Stripe retries the same event, BullMQ sees the same jobId and skips the duplicate enqueue.

Note that this requires Redis infrastructure. If Redis is unavailable when the webhook arrives, the await paymentQueue.add(...) call will fail. Handle that case explicitly — either with a try/catch that falls back to in-process processing, or by returning 500 so Stripe retries later when Redis is healthy.

Option 3: Database Queue (Simplest Durable)

If you already have a database but not Redis, you can use a simple jobs table as a queue. This is less performant than Redis but requires no new infrastructure.

// Migration: CREATE TABLE webhook_jobs (id, stripe_event_id, event_type, payload, status, attempts, created_at, processed_at)

app.post('/webhooks/stripe', express.raw({ type: 'application/json' }), async (req, res) => {
  const event = stripe.webhooks.constructEvent(req.body, req.headers['stripe-signature'], process.env.STRIPE_WEBHOOK_SECRET);

  res.status(200).send('ok');

  // Idempotent insert — ON CONFLICT DO NOTHING means duplicate Stripe events are ignored
  await db.query(`
    INSERT INTO webhook_jobs (stripe_event_id, event_type, payload, status)
    VALUES ($1, $2, $3, 'pending')
    ON CONFLICT (stripe_event_id) DO NOTHING
  `, [event.id, event.type, JSON.stringify(event.data.object)]);
});

// Separate worker — poll every 5 seconds
async function processWebhookJobs() {
  const job = await db.query(`
    UPDATE webhook_jobs
    SET status = 'processing', attempts = attempts + 1
    WHERE id = (
      SELECT id FROM webhook_jobs
      WHERE status = 'pending'
      ORDER BY created_at
      FOR UPDATE SKIP LOCKED
      LIMIT 1
    )
    RETURNING *
  `);

  if (job.rows.length === 0) return;

  const { id, event_type, payload } = job.rows[0];
  const data = JSON.parse(payload);

  try {
    if (event_type === 'payment_intent.succeeded') {
      await processPayment(data);
      await sendOutcomeReceipt(data.id);
    }
    await db.query(`UPDATE webhook_jobs SET status = 'completed', processed_at = NOW() WHERE id = $1`, [id]);
  } catch (err) {
    await db.query(`UPDATE webhook_jobs SET status = 'failed' WHERE id = $1`, [id]);
    logger.error({ err, jobId: id, eventType: event_type });
  }
}

The FOR UPDATE SKIP LOCKED pattern is important: it lets multiple worker processes safely pull jobs from the queue without the same job being processed twice.

The Hidden Problem with Async: Did the Job Actually Run?

You implemented one of the three queue options above. The webhook handler returns 200 in under 5ms. Provider timeouts are not a concern. You feel good.

Here is the problem that async processing introduces: you have decoupled the proof of receipt from the proof of completion.

Your webhook handler returned 200. That proves the event arrived and was queued. It says nothing about whether the queue job ran, whether it ran successfully, or whether the database write committed. These are three separate facts.

In the synchronous version, a 200 meant "everything ran." It was slow and caused timeouts, but it was at least semantically honest. In the async version, a 200 means "I have the event and will try to process it." You have no built-in mechanism to know the outcome.

This matters because:

Workers crash after dequeuing but before completing
Database writes fail inside the worker with no observable side effect to the caller
Jobs get stuck in "processing" status and never complete
Redis connection errors silently drop jobs after the 200 is sent

Six hours later, the customer is still waiting. You look at Stripe — delivered. You look at your queue — the job completed. You look at your worker logs — no errors. You look at the database — no row.

The worker completed successfully, but the database write was rolled back due to a constraint violation that you caught and logged at WARN level in a separate log group that nobody checks.

Outcome Receipts as the Final Piece

The async pattern without receipts gets you: "The event arrived and the job started." What you actually need is: "The job finished and the data committed." Without receipts, you cannot distinguish between a job that completed successfully and one that silently failed. See why delivered doesn't mean applied and the HookTunnel flat $19/month Pro plan to add outcome receipts to your webhook pipeline.

An outcome receipt is a signed HTTP callback that your async worker sends to the webhook infrastructure after the database write commits and the transaction closes. Not after the function returns. Not after the job status updates to "completed." After the actual durable state change.

The implementation in the worker is straightforward:

async function processPayment(paymentIntent) {
  const client = await pool.connect();
  try {
    await client.query('BEGIN');

    const customer = await client.query('SELECT * FROM customers WHERE stripe_customer_id = $1', [paymentIntent.customer]);
    const order = await client.query(
      'INSERT INTO orders (customer_id, amount, stripe_payment_intent_id) VALUES ($1, $2, $3) RETURNING id',
      [customer.rows[0].id, paymentIntent.amount, paymentIntent.id]
    );

    await client.query('COMMIT');

    // Receipt goes AFTER COMMIT — not before, not during
    await sendOutcomeReceipt({
      stripeEventId: paymentIntent.id,
      orderId: order.rows[0].id,
      status: 'applied_confirmed',
    });
  } catch (err) {
    await client.query('ROLLBACK');

    await sendOutcomeReceipt({
      stripeEventId: paymentIntent.id,
      status: 'applied_failed',
      reason: err.message,
    });

    throw err;
  } finally {
    client.release();
  }
}

async function sendOutcomeReceipt(payload) {
  const timestamp = Date.now();
  const body = JSON.stringify({ ...payload, timestamp });
  const hmac = crypto.createHmac('sha256', process.env.RECEIPT_SIGNING_SECRET);
  hmac.update(body);
  const signature = hmac.digest('hex');

  await fetch(`${process.env.HOOKTUNNEL_RECEIPT_URL}`, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'X-HookTunnel-Signature': signature,
    },
    body,
  });
}

The receipt is signed with an HMAC using a secret shared with HookTunnel. The timestamp is included in the signed payload to prevent replay attacks. HookTunnel verifies the signature before accepting the receipt.

Now you have three facts with timestamps:

The event was received by your handler (Delivered)
The database transaction committed (Applied — from the receipt)
The gap between them, if any

When the customer calls, you can look at the event in HookTunnel and see Applied Confirmed at 14:03:51, 44 seconds after the event arrived. Or you see Delivered but no Applied state, and you know the async processing failed and can check the worker logs for that specific event ID.

The acknowledge-fast pattern solves timeouts. Outcome receipts solve observability. You need both.

Why Your Webhook Handler Should Never Do Heavy Work Synchronously