Webhook Debugging Checklist: Fix Missing Webhook Events

Jamie has been staring at the same three browser tabs for two hours. Stripe's event log says the payment_intent.succeeded webhook was delivered. The request log says 200 OK, response time 43ms. No exceptions in Datadog. But the order table is empty, and the customer just filed a support ticket: "I paid but my order never arrived."

This scenario plays out constantly. The provider says it worked. The application says it worked. The database disagrees. The gap between those three statements is where most webhook debugging goes wrong — because engineers start at the wrong layer. If you want to understand silent failures at a deeper level, the webhook revenue leakage post covers the business impact of these gaps in detail.

Here is the checklist, ordered by likelihood. Start at the top and work down. Each layer has a distinct failure signature. For a broader guide to the full debugging workflow, see the webhook debugging guide.

Before You Dig In: Confirm the Event Exists

Before assuming something is broken, confirm the event actually fired and reached your server.

Check the provider's event log first. In Stripe's dashboard, go to Developers > Webhooks > your endpoint. Look at the event delivery history. You want to see the event ID, the timestamp, and the response code your server returned. If the response is blank or timed out, you never received it. If it shows 200, you received it and told Stripe you were fine.

Test with a known-good payload. Use Stripe's "Send test webhook" feature or route a test event through Webhook.site before touching your application. This tells you whether the problem is in your endpoint registration, your firewall, or your handler code.

If the event shows up in Webhook.site but not in your logs, the problem is network-level: firewall rule, load balancer config, or the endpoint URL in Stripe is wrong (staging URL in production, http vs https, trailing slash mismatch).

If the event shows up in your logs with a 200 response, keep reading.

Layer 1: Signature Verification Failure

This is the most common silent failure, and it fails silently because most implementations catch the verification error and return 200 anyway to avoid retries. See Stripe's signature verification docs for the complete spec, and HMAC RFC 2104 for the underlying cryptographic standard.

Stripe signs webhooks with HMAC-SHA256. The signature is computed over the raw request body — the literal bytes as they arrived on the wire. If anything touches that body before verification, the signature check fails. The full guide on webhook signature verification covers implementation for Stripe, Twilio, and GitHub.

The raw body problem. Express's express.json() middleware parses the body and replaces req.body with a JavaScript object. Stringifying that object back does not produce the original bytes. Any whitespace difference, key ordering change, or Unicode normalization will break the signature.

The correct pattern is to capture the raw body before JSON parsing:

// CORRECT: capture raw body for signature verification
app.use('/webhooks/stripe', express.raw({ type: 'application/json' }));

// WRONG: this runs json() first, then tries to verify — signature will fail
app.use(express.json());
app.post('/webhooks/stripe', (req, res) => {
  // req.body is already a parsed object here — signature check will fail
  const event = stripe.webhooks.constructEvent(req.body, sig, secret);
});

The express.raw() middleware gives you a Buffer in req.body. Pass that Buffer directly to stripe.webhooks.constructEvent(). Do not convert it to a string first, do not JSON.parse it, do not log it (logging can trigger toString() in some frameworks).

The failure mode. When the signature check fails and your code catches the error:

try {
  const event = stripe.webhooks.constructEvent(req.body, sig, secret);
} catch (err) {
  // Many codebases log here and return 200 to "avoid retries"
  console.error('Signature verification failed', err.message);
  return res.status(200).send('ok'); // <-- THIS IS THE BUG
}

Returning 200 on a verification failure tells Stripe "got it, processed it." The event is marked delivered. Your handler did nothing. The database is empty.

Return 400 on signature failure. Let Stripe retry. Yes, this means Stripe will retry the event — that is the correct behavior. Fix the raw body issue, not the response code.

Layer 2: Handler Throwing but Caught

Your handler returned 200. That means the code ran to the point of calling res.send(). But the processing logic may have thrown an error between receiving the event and that response.

The swallowed error pattern.

app.post('/webhooks/stripe', (req, res) => {
  const event = stripe.webhooks.constructEvent(req.body, req.headers['stripe-signature'], process.env.STRIPE_WEBHOOK_SECRET);

  // Respond immediately — this is good for avoiding timeouts
  res.status(200).send('ok');

  // But if processPayment throws, nothing catches it
  processPayment(event.data.object); // async function, not awaited, not caught
});

If processPayment is an async function and you do not await it (and you should not, if you want to avoid timeouts), any error it throws becomes an unhandled promise rejection. In Node.js before version 15, unhandled promise rejections were silently swallowed. In later versions they crash the process — but only if you have not registered a process.on('unhandledRejection') handler, and most production codebases do register that handler and log-and-continue.

The fix is explicit error handling in your async processor:

async function processPayment(paymentIntent) {
  try {
    await db.orders.create({ ... });
    await sendConfirmationEmail({ ... });
  } catch (err) {
    // Log with enough context to diagnose later
    logger.error({
      err,
      paymentIntentId: paymentIntent.id,
      msg: 'Failed to process payment_intent.succeeded'
    });
    // Optionally: push to a dead letter queue, alert on-call
    await deadLetterQueue.push({ event: 'payment_intent.succeeded', data: paymentIntent, error: err.message });
  }
}

Check your error tracking tool (Sentry, Datadog, etc.) for unhandled promise rejections. They often have a different fingerprint from handled errors and show up in a different view.

Layer 3: Async Processing Failure

You acknowledged the event immediately with 200 and pushed the work to a queue. The 200 was genuine — you did receive the event. But the queue job failed after the response. The async processing layer is where most silent failures live. For the full pattern on structuring async handlers, see webhook handler async patterns.

This layer is harder to debug because the failure happens asynchronously, potentially minutes after the webhook arrived. The logs for the HTTP request show success. The failure is in a completely different process.

Failure modes here:

Worker crashed after dequeuing the job but before completing it
Queue was full and the push operation failed silently (check your queue library's error handling on enqueue)
Job was processed but hit a timeout and was retried, and the retry logic has a bug
Worker connected to the wrong database (wrong environment variable in the worker process)

How to detect this layer. If you have queue metrics, check job completion rate vs enqueue rate. If they do not match, jobs are being lost. Check your worker logs specifically — not the web process logs.

The deeper problem: even if you fix the queue failure, how do you verify the job completed? The webhook handler returned 200 three hours ago. The customer is on the phone. Did the job run? Did it fail? Did it run and fail silently?

This is the gap that outcome receipts address. If the worker sends a signed receipt after the DB write commits, you have a timestamp and a confirmation that the work actually finished. If the receipt never arrives, you know the async processing failed somewhere.

Layer 4: DB Write Silently Failing

The queue job ran. No exceptions. The worker completed. But the row is not in the database. A silent DB write failure is the hardest layer to diagnose after the fact — and the one outcome receipts are specifically designed to surface.

This is rarer but harder to diagnose. Here are the specific patterns:

Connection pool exhaustion. Your application has a pool of, say, 10 database connections. Under load, all 10 are in use. A new query waits for a connection. If the wait timeout is short, the query fails with a timeout error. If you caught that error and logged it at DEBUG level, you may never see it.

// pg pool with a short timeout — this throws under load
const pool = new Pool({
  max: 10,
  connectionTimeoutMillis: 2000, // fails silently if pool is exhausted
});

// The error goes here — are you checking this?
pool.on('error', (err) => {
  console.error('Unexpected error on idle client', err);
});

Check your connection pool metrics. If utilization is above 80%, you are dropping queries under load.

Transaction rollback. Your handler wraps the write in a transaction. An error occurs inside the transaction (constraint violation, foreign key mismatch, trigger rejection). The transaction rolls back. If the error handling does not distinguish "transaction rolled back" from "operation completed," the caller may believe the write succeeded.

// This pattern can lose the rollback
const client = await pool.connect();
try {
  await client.query('BEGIN');
  await client.query('INSERT INTO orders ...');
  await client.query('INSERT INTO order_items ...');  // FK constraint fails
  await client.query('COMMIT');
} catch (err) {
  await client.query('ROLLBACK');
  // If you swallow err here and return normally, the caller thinks it worked
  logger.warn('Order creation failed, rolled back'); // DEBUG-level log, easy to miss
} finally {
  client.release();
}

Constraint violation on duplicate. If you have a unique constraint on something like stripe_payment_intent_id and the event was retried, the second insert fails with a constraint error. If you catch that error and treat it as "already processed" without verifying the original write actually committed, you may incorrectly conclude success.

The Missing Layer: How Do You Know Which Layer Failed?

Without outcome receipts, you cannot triangulate which layer caused a specific event to fail. Learn about why delivered doesn't mean applied for the full explanation.

The checklist above gives you four layers to check. But in practice, when Jamie is two hours in and the customer is waiting, you need to answer quickly: which layer did this specific event fail at?

This is the diagnostic value of outcome receipts.

The receipt is a signed HTTP callback that your application sends back to the webhook infrastructure after the database write commits — not after the handler returns 200, not after the queue job starts, but after the data is durably written. If the receipt arrives, you know the write committed. If it does not arrive, you know the failure was at Layer 3 or Layer 4.

Combined with the delivery record (the 200 response from your handler), you can now triangulate:

Event shows in provider log, no delivery record: Layer 0 — network, firewall, wrong endpoint
Delivery record exists, no receipt: Layer 2, 3, or 4 — handler ran, processing failed
Signature verification error in logs: Layer 1 — raw body issue
Delivery record exists, receipt exists: genuinely processed and confirmed

HookTunnel shows this as three explicit states: Delivered (you returned 200), Applied (the receipt arrived and was verified), and the gap between them. When Jamie pulls up the event in HookTunnel, the state is Delivered, not Applied. That immediately scopes the investigation to Layers 2-4 and eliminates the signature issue entirely.

Two hours of debugging collapses to five minutes of reading the state.

Summary Checklist

In order:

Confirm the event exists in the provider's dashboard
Check the response code your server returned (not "is there a 200 in your logs" — check the provider's record)
Verify raw body middleware is in place before JSON parsing
Confirm signature verification returns 400 on failure, not 200
Check for unhandled promise rejections in your error tracker
Check queue job completion metrics against enqueue metrics
Check worker logs specifically (not web process logs)
Check connection pool utilization and timeout errors
Check for transaction rollback warnings at non-ERROR log levels
Check for constraint violations being caught and silently discarded

If you have outcome receipts instrumented, step 2 tells you whether the gap is delivery vs application and you can skip directly to the relevant layers.

The Webhook Debugging Checklist: What to Check When Events Go Missing