The Only Webhook Replay That Knows When to Stop

Every webhook replay tool does the same thing: resend the HTTP request. Point at an event. Click replay. The tool sends the same payload to the same URL again.

This is fine as a primitive. It is catastrophically insufficient as a recovery workflow.

Because the question "should I replay this event?" is not "did the HTTP request land?" — it is "did my application already process this?" And no replay tool on the market knows the answer to that question. They all assume you know. They all put the burden of preventing duplicates entirely on you.

Guardrailed replay is different. It knows the outcome state of every event before executing. It skips confirmed events automatically. It stops mid-batch if a receipt arrives. It requires an audit note before letting you override any guardrail. It is the only replay system that uses outcome state — because it is the only replay system built on top of receipt infrastructure.

The Horror Story That Every Engineering Team Has

It was a Tuesday afternoon. Your webhook handler went down for 90 minutes — a bad deploy, a misconfigured dependency, the usual. One hundred events backed up during the outage. Your handler came back up. Stripe retried some automatically. You decided to replay the rest from your webhook tool.

What you did not know — could not know, because no tool told you — was that thirty of those hundred events had already been processed. Stripe retried them before your handler went down. The retries landed, returned 200, updated the database. Those thirty customers had correct state.

You replayed all 100.

Thirty duplicate orders. Thirty customers charged twice for the same subscription upgrade. Thirty simultaneous database writes fighting each other for the same rows. Three of them landed in a race condition and created corrupted subscription records.

The refund processing took four hours. The database repair took another two. Customer support handled forty inbound tickets. The incident postmortem identified "replay without idempotency checks" as the root cause. The action item was "add idempotency keys to our handler." You did. The duplicate problem still happened eighteen months later because the new engineer did not know about the idempotency requirement and wrote a handler that did not implement it.

This is not an unusual story. It is a standard story. The solution the industry proposes — idempotency keys in your application code — is correct but incomplete. It prevents the database-level duplicate. It does not prevent the operational confusion, the paralysis before replay, or the anxiety of not knowing which events are safe to resend.

Why Most Teams Do Not Replay At All

Ask any engineering team with more than six months of production webhook experience how they recover from outages. The answer is almost always: manually.

They look at the events that came in during the bad window. They write a script to check which customer IDs have correct state in the database. They replay only the ones that do not. They test the script on staging. They run it at 2am when traffic is low. They monitor the database writes in real time with a second terminal window open. They fix the two that went wrong.

This is the process. It is not good, but it is what happens when replay carries more risk than benefit. When every replay tool is "resend the HTTP request with no knowledge of whether it succeeded the first time," the only safe option is to bring human judgment into every decision.

Human judgment at 2am, after an incident, under pressure, is not a good place to put safety-critical decisions.

Guardrailed replay replaces that process with a system that carries the same knowledge you are trying to reconstruct manually — but gets it from receipt data, not from database queries you run yourself.

The Three Rules of Guardrailed Replay

Rule 1: Never Replay Applied Confirmed

If a receipt has arrived for an event — if your application code has sent a cryptographically verified confirmation that the database write committed — that event is excluded from replay automatically. Not optionally. Not with a warning you can click through. Excluded.

The guardrail is enforced at the replay engine level, not the UI level. You cannot accidentally replay an Applied Confirmed event through the dashboard. You cannot replay it via the API without sending an explicit override. The override requires an audit note that is permanently recorded. This is not a soft guard. It is a hard constraint on the replay execution path.

Why? Because a confirmed event that is replayed is a duplicate by definition. Your application code said it committed the write. If you replay the event, you are asking it to commit the write again. The idempotency key may catch it. The database constraint may catch it. Or it may not — depending on how your handler is written and how long ago the first write occurred. Guardrailed replay does not bet on your idempotency implementation being perfect.

Rule 2: Stop Mid-Batch When a Receipt Arrives

Batch replay is a time-sequential operation. You select 80 events. Guardrailed replay starts sending them. Your handler processes them. Receipts arrive. The replay engine watches the receipt stream in real time.

If a receipt arrives for an event that is still in the queue — an event you are about to replay — the engine removes it from the batch automatically. The event was already processed, probably by a retry that came in while the batch was running. No duplicate sent.

This is the feature that makes batch replay safe in a messy recovery scenario. Outage recovery is not clean. Events arrive from multiple directions — Stripe retries, your previous replay attempt that partially executed, your handler processing some events before it went down. The receipt stream captures all of it. Mid-batch stoppage uses that stream to make real-time decisions about what to send next.

Rule 3: Override Requires an Audit Note

There are legitimate reasons to replay a confirmed event. Your handler has a bug that wrote incorrect data — confirmed, but wrong. Your audit process requires re-running the business logic for compliance review. You are doing a data migration.

These are valid. Guardrailed replay does not prevent them. It requires you to say, in writing, why you are doing it. The note is mandatory. The note is permanent. It is attached to the replay event in the audit log along with your identity, the timestamp, and the outcome of the replay.

When someone asks "why was this replayed?" six months later, the answer is in the log. This is not punitive — it is the kind of documentation that makes post-incident review possible and that finance and compliance teams require.

Saturday On-Call

It is 11:42am on a Saturday. You are the engineer on call. Your phone went off twenty minutes ago: webhook handler is returning 500s. You got the handler back up at 11:38am. The handler was down for two hours and six minutes.

In your HookTunnel dashboard, you can see 156 events that arrived during the outage window. Without guardrails, you are looking at a paralysis problem. You do not know which of these events Stripe retried and your handler processed before it went down. You do not know which ones landed on the retries that happened in the two-hour window while you were asleep. You do not know which ones are safe to replay. You know that if you get it wrong, you create duplicates. On a Saturday. With no one else awake.

With guardrailed replay:

You open the replay view for the two-hour window. HookTunnel shows you the breakdown before you execute anything:

43 events: Applied Confirmed (receipts arrived, excluded from replay automatically)
89 events: Applied Unknown (delivered, no receipt, eligible for replay)
24 events: Applied Failed (your handler sent explicit failure receipts before it went down)

Dry run first. You click "Preview Replay" on the 89 unknown events. The dry run shows you: 89 events, estimated duration 4 minutes at current handler throughput, risk assessment Low (no confirmed events in set, no events with prior duplicate flags). The 24 failed events are listed separately — they need individual review because explicit failure receipts indicate your handler ran but the write did not commit.

You execute the 89. The dashboard shows the replay running in real time. Receipts start arriving. The confirmed count climbs: 50, 61, 73, 81. Three events stop mid-flight — they received receipts while in the queue, removed automatically. 86 events executed, 3 skipped by mid-batch guardrail.

Of the 86 executed: 78 succeed (receipts arrive within 10 seconds). 8 fail with a new error code: DB_CONSTRAINT_VIOLATION. Different bug. Something your fix did not address. Those 8 events are now in Applied Failed with explicit failure receipts and specific error codes. You know exactly which events, which customers, which error. You create a ticket, update the on-call handoff, and send the 8 affected customers a proactive email before they notice.

The 24 events with prior failure receipts go into your gap review queue. You resolve 21 manually after confirming the database state is correct for those customers. You replay the remaining 3.

Done by 12:43pm. One hour, including the incident response and the gap review. On a Saturday. Without waking anyone up. Without writing a single database query by hand.

The emotional arc of this scenario is the point. You went from "I don't know what to replay and I'm afraid to try" to "I know exactly what to replay, and the system is watching every event as it executes and making real-time decisions for me." That change in cognitive state — from anxiety to confidence — is what good operations infrastructure produces.

Why No One Else Can Build This

Guardrailed replay requires receipt infrastructure. Receipt infrastructure requires your application code to send signed confirmations after commits. That requires a protocol, a verification layer, and integration with the developer's handler.

Every other webhook tool on the market stops at delivery. They capture the request. They show you the response. They let you resend. They have no concept of application-layer outcome state because they have no mechanism to capture it.

This is not a feature gap they can close by adding a checkbox. It is an architectural gap. To have guardrailed replay, you need receipts. To have receipts, you need a verification protocol that your application participates in. That protocol is the product, and it is what makes everything in HookTunnel's outcome layer possible — the reconciliation, the SLA timers, the batch safety, the audit trail.

No competitor has it. Which means no competitor can have guardrails.

FAQ

What happens if my handler does not have receipt code?

Events land in Applied Unknown. Guardrailed replay treats Applied Unknown events as eligible for replay — they are not confirmed, so the guardrail does not block them. You lose the hard safety guarantee (you could still replay something that actually succeeded), but you gain the exponential backoff, the dry-run preview, the mid-batch monitoring, and the audit trail. Adding receipt code to your handler is the path to the hard guarantee.

Can I disable the guardrails temporarily?

No. You can override individual guardrail decisions with an audit note, but you cannot disable the system globally. The guardrails are not a feature flag — they are the architecture of how replay executes. This is intentional. "Disable safety checks temporarily" is how duplicates happen.

What is the batch size limit?

500 events per batch. Larger recovery operations are split into sequential batches automatically. Between batches, the system re-evaluates outcome state — events that received confirmations during the previous batch are removed from the next batch.

Does guardrailed replay work with non-Stripe webhooks?

Yes. Guardrailed replay works on any webhook event in HookTunnel — GitHub, Twilio, your own internal events. The guardrail logic is receipt-state based, not event-type based. If an event has a confirmed receipt, it is protected.

What is the exponential backoff for failed events?

Failed events scheduled for replay use exponential backoff with jitter. The default policy is: 30s, 2m, 8m, 30m, 2h, 8h, 24h. The maximum retry count is configurable per hook. Each attempt is recorded. If the final attempt fails, the event moves to a terminal Applied Failed state and appears in your reconciliation gap view.