Silent Webhook Failures: When 200 OK Means Nothing at All

It's Sunday afternoon. You shipped a refactor Thursday. Everything looks fine — zero errors in your monitoring, P99 latency normal, error rate 0%. But 14 customers upgraded to Pro over the weekend and 6 of them can't access Pro features.

Your webhook handler returned 200 for all 20 events. Stripe says delivered for all 20. Your database has 14 updated subscription records, not 20.

You open your logs. Nothing. No exceptions, no warnings, no timeouts. The handler processed 20 events and returned 200 for all of them. Your monitoring saw nothing unusual. Stripe saw nothing unusual.

Six customers are on hold, waiting for access they paid for. You have no idea which events are the broken ones without manually cross-referencing Stripe event IDs against your subscription table.

This is a silent webhook failure. It's the hardest kind to debug because every system that's supposed to catch it says everything is fine. The delivery layer worked perfectly. The problem is downstream of delivery, invisible to everything that monitors delivery. If you're using RabbitMQ downstream, see also how RabbitMQ's ack model can catch failures that the HTTP layer misses.

Why 200 OK proves nothing

This feels counterintuitive. If your handler returned 200, didn't it process the event?

Not necessarily. The 200 response is about receipt, not completion. The standard webhook pattern — the one Stripe webhook documentation recommends — is to return 200 immediately and process asynchronously:

app.post('/webhooks/stripe', async (req, res) => {
  // Return 200 FIRST to acknowledge receipt
  // (prevents Stripe from retrying while you process)
  res.status(200).send('OK');

  // THEN process the event asynchronously
  const event = stripe.webhooks.constructEvent(req.body, req.headers['stripe-signature'], process.env.STRIPE_WEBHOOK_SECRET);
  await processSubscriptionUpdate(event);
});

This is correct pattern for preventing duplicate retries. But it means the 200 response and the database write are completely decoupled. The 200 tells Stripe "I received the event." It says nothing about whether the database write succeeded.

If processSubscriptionUpdate throws after the 200 is sent, Stripe sees a successful delivery. Your monitoring sees a successful response. The database write never happened. Silent failure.

There are four specific ways this happens in production code, and each one is subtler than it looks.

Silent failure mode 1: DB connection pool exhausted

Your application maintains a connection pool to PostgreSQL. The pool has a maximum size — say, 10 connections. Under normal load, the pool is sized correctly.

But Sunday afternoon, a batch job is running. It's holding 9 connections for a long-running aggregation query. Your handler gets a webhook event. It tries to acquire a connection from the pool. No connections are available.

What happens next depends on your database client's configuration. With some defaults, the pool client queues the request and waits. The wait has a timeout — maybe 30 seconds, maybe 5 minutes. Your handler returned 200 ten seconds ago. Stripe is happy. The processing is sitting in a queue, waiting for a connection.

If the connection becomes available within the timeout, the write succeeds. If not, the timeout triggers an error.

// This fails silently when the pool is exhausted
app.post('/webhooks/stripe', async (req, res) => {
  res.status(200).send('OK');

  try {
    // This call may time out waiting for a pool connection
    const client = await pool.connect();
    await client.query(
      'UPDATE subscriptions SET tier = $1 WHERE stripe_customer_id = $2',
      [newTier, customerId]
    );
    client.release();
  } catch (err) {
    // This catch block runs AFTER the 200 was sent.
    // Nobody hears this error.
    console.error('Subscription update failed:', err.message);
    // No retry. No alert. No delivery failure from Stripe's perspective.
  }
});

The console.error is there. If you know to look in the logs at the exact right time with the exact right query, you'll find it. But it generated no alert, no metric spike, no delivery failure. Stripe's dashboard shows 200 OK.

Silent failure mode 2: Unhandled promise rejection

Asynchronous code that throws after the response is sent is particularly invisible.

app.post('/webhooks/stripe', async (req, res) => {
  res.status(200).send('OK');

  // Fire and forget — no await, no catch
  updateSubscriptionInDatabase(customerId, newTier);
});

async function updateSubscriptionInDatabase(customerId, tier) {
  const customer = await db.customers.findOne({ stripeId: customerId });

  if (!customer) {
    // This throws, but nobody is awaiting this function.
    // It becomes an unhandled promise rejection.
    throw new Error(`Customer not found: ${customerId}`);
  }

  await db.subscriptions.update({ customerId: customer.id }, { tier });
}

In Node.js, an unhandled promise rejection triggers process.on('unhandledRejection', ...). If you have that handler, it logs the error somewhere. If you don't, Node.js (since v15) terminates the process — which is a visible failure, but the process restart may be fast enough that monitoring misses the blip.

Either way, Stripe saw 200 OK. The customer's subscription was not updated. There's no retry queued.

The specific bug here is a customer ID mismatch — maybe the refactor you shipped Thursday changed how you store the Stripe customer ID. The handler finds no customer record, throws, and the unhandled rejection disappears into the void.

Silent failure mode 3: Queue drop

The recommended pattern for webhook handlers is to return 200 quickly and push the actual work to a queue. This is correct. It prevents Stripe from retrying while you process. But it introduces a new failure point: the queue itself.

app.post('/webhooks/stripe', async (req, res) => {
  // Push to Redis queue
  await redisQueue.push('stripe-events', {
    type: event.type,
    data: event.data,
    eventId: event.id,
  });

  res.status(200).send('OK');
});

// Separate worker process:
async function processStripeEventWorker() {
  while (true) {
    const job = await redisQueue.pop('stripe-events');
    await processSubscriptionUpdate(job);
  }
}

Your Redis push succeeds. Redis accepted the job. The 200 is sent. 400 milliseconds later, your worker process crashes on an unrelated error — a memory leak, a configuration issue, a dependency that returned null where the worker expected an object.

The job is in the queue. The worker is down. The job stays in the queue until the worker restarts — which might be seconds (if you have a process supervisor) or minutes (if you're restarting manually) or hours (if the crash happened at 3 AM and nobody is paged).

But Stripe's perspective: 200 OK, delivery confirmed. No retry queued. The event is Stripe's problem no longer.

The "at-least-once delivery" guarantee that queues provide is about network reliability, not worker crashes. If your worker crashes after popping a job, the job may or may not be lost depending on your queue's acknowledgment semantics. Redis without persistence loses jobs on crash. Bull queues with acknowledgment preserve jobs — but only if your worker calls the ack before crashing, not after.

The failure mode is real in production. The refactor you shipped Thursday changed a data shape. The worker expects event.data.object.id and the field moved to event.data.object.stripe_id. The worker crashes on every event. Every crash logs an error — but in the worker's log, not the handler's log. If nobody is monitoring the worker's crash rate independently, the silent failures accumulate.

Silent failure mode 4: Transaction rollback

Three database writes in a transaction. The first two succeed. The third fails. The transaction is rolled back. The first two writes are undone.

async function processSubscriptionUpdate(event) {
  const { customerId, newTier, newPriceId } = extractEventData(event);

  const trx = await db.transaction();

  try {
    // Write 1: Update subscription tier
    await trx('subscriptions')
      .where({ stripe_customer_id: customerId })
      .update({ tier: newTier, updated_at: new Date() });

    // Write 2: Log the tier change
    await trx('subscription_history').insert({
      customer_id: customerId,
      old_tier: currentTier,
      new_tier: newTier,
      changed_at: new Date(),
    });

    // Write 3: Update feature flags (added in the Thursday refactor)
    await trx('feature_flags')
      .where({ customer_id: customerId })
      .update({ pro_features: newTier === 'pro' });

    await trx.commit();
  } catch (err) {
    await trx.rollback();
    // The error is caught. The rollback is logged internally.
    // But this function was called fire-and-forget.
    // The caller already sent 200 OK.
    logger.warn('Subscription update transaction rolled back', { customerId, err: err.message });
  }
}

The Thursday refactor added the feature flags write. The feature_flags table was added in a migration that ran in staging but not yet in production. Production doesn't have the table. The trx('feature_flags') call throws a "relation does not exist" error. The transaction rolls back. The subscription tier is not updated.

The logger.warn fires. It's in the logs. But it's a warning, not an error. Your alert threshold is error level. No page fires.

Stripe saw 200 OK twelve seconds ago. No retry will come.

Six customers upgraded to Pro, paid Stripe, triggered this handler, got back 200 OK from your endpoint, and have no Pro features because the transaction rolled back silently.

The only fix: outcome receipts

Every one of these failure modes has the same shape: the 200 response is decoupled from the actual work. The delivery layer reports success. The application layer fails silently.

You can add better error handling to each of these cases — uncaught exception handlers, worker health checks, transaction logging. And you should. But the fundamental problem is that your monitoring doesn't know whether the database write committed. It knows whether the event was received. Those are different things.

Outcome receipts close this gap.

The pattern works like this: after your handler successfully commits the database write, it sends a signed callback to HookTunnel. The callback says "event ID X was applied at time T, transaction committed." HookTunnel records this as Applied Confirmed for that event.

If HookTunnel doesn't receive the receipt within the SLA window (configurable, default 60 seconds), the event is flagged Applied Unknown. Not failure — unknown. You don't know if it applied or not. That's the correct state to report when you have no evidence either way.

app.post('/webhooks/stripe', async (req, res) => {
  const event = stripe.webhooks.constructEvent(
    req.body,
    req.headers['stripe-signature'],
    process.env.STRIPE_WEBHOOK_SECRET
  );

  res.status(200).send('OK');

  try {
    const trx = await db.transaction();

    await trx('subscriptions')
      .where({ stripe_customer_id: event.data.object.customer })
      .update({ tier: newTier });

    await trx('feature_flags')
      .where({ customer_id: customerId })
      .update({ pro_features: newTier === 'pro' });

    await trx.commit();

    // Commit succeeded. Send the outcome receipt.
    await sendOutcomeReceipt({
      eventId: event.id,
      hookTunnelUrl: process.env.HOOKTUNNEL_RECEIPT_URL,
      secret: process.env.HOOKTUNNEL_RECEIPT_SECRET,
      status: 'applied_confirmed',
    });

  } catch (err) {
    // Write failed. Send failure receipt so HookTunnel knows immediately.
    await sendOutcomeReceipt({
      eventId: event.id,
      hookTunnelUrl: process.env.HOOKTUNNEL_RECEIPT_URL,
      secret: process.env.HOOKTUNNEL_RECEIPT_SECRET,
      status: 'applied_failed',
      errorCode: classifyError(err),
    });
  }
});

The receipt is HMAC-signed with a secret shared between your application and HookTunnel. HookTunnel verifies the signature before recording the receipt — it can't be spoofed by a third party.

Now the monitoring picture changes:

Applied Confirmed: the transaction committed. The customer has Pro features. You have cryptographic proof.
Applied Unknown: the SLA window passed without a receipt. Investigate. Something in the failure modes above may have fired.
Applied Failed: the handler explicitly reported failure. Investigate and replay.

The silent failures in modes 1-4 above all produce Applied Unknown rather than Applied Confirmed. The silence becomes visible. The SLA window is your detection time — if the receipt doesn't arrive within 60 seconds of delivery, you know to look.

From "everything looks fine" to "I have proof"

The emotional arc of a silent webhook failure has a particular texture. Everything looks fine in your dashboards. Stripe says delivered. Your handler returned 200. Your error rate is 0%. And somewhere between 6 and 20 customers have a problem.

The investigation is painful because you have to prove a negative — you have to find the evidence that something didn't happen. You cross-reference Stripe event IDs against database records manually. You check timestamps and look for gaps. You eventually find the 6 missing records and trace backward through logs to find the warn-level message that nobody was paged about. This is exactly the forensics problem our webhook debugging checklist is designed to help you work through systematically.

With outcome receipts, the detection is immediate. The event appears as Applied Unknown after the SLA window expires. You see the exact event IDs, the exact timestamps, the exact delivery status. You can replay the specific events against your fixed handler and verify that they move to Applied Confirmed.

You go from "I have to prove that 6 things didn't happen" to "here are the 6 events that didn't apply, here is the evidence, here is the replay result." The postmortem writes itself.

The proof matters for more than debugging. It matters for customer support (you can tell the 6 affected customers exactly when their accounts were recovered and show the evidence). It matters for compliance audits (you have an immutable log of every event, its delivery status, and its application status). It matters for your own confidence that the system is working — not "it seems fine," but "I have receipts." Silent failures that go undetected are a primary driver of webhook revenue leakage — the customers who paid but didn't get what they paid for.

That's the shift. Not just monitoring delivery. Monitoring outcomes.

The Silent Webhook Failure: When Your Handler Returns 200 But Nothing Happens