Replaying Into a Recovering System Is the Riskiest Thing You Do

The outage is over. Your handler is back. You have 200 events that did not deliver during the downtime. You open your webhook tool, select the failed events, and click Execute.

This is the moment that feels like relief. The incident is resolved. You are cleaning up. You are in control.

What you cannot see is whether the system on the other side of that replay is actually ready. Recovery is not binary. A service that just restarted is not the same as a service that has been running stably for three hours. A database connection pool that just recovered from exhaustion is not at the same risk level as one that has been healthy all day. A service that passed one health check probe is not the same as a service that has been returning 2xx for the last 300 consecutive requests.

When you replay 200 events into a half-recovered system, some of them land correctly. Some of them hit the service during a brief secondary failure and return 500 — but you already acknowledged them as replayed, so there is no automatic retry. Some of them create duplicate processing because a previous delivery attempt actually succeeded but the acknowledgement was lost during the outage, and now the same event is being processed a second time with no idempotency check.

You end up with partial data, unclear customer state, and a cleanup task that is harder than the original outage. You would have been better off waiting 15 more minutes before replaying.

Replay Prediction gives you the information you need to make that decision, automatically, before you click Execute.

The Story That Motivated This Feature

An engineering team at a SaaS company had a 45-minute outage on their payment processing service. During the outage, 200 webhook events backed up — subscription upgrades, payment confirmations, provisioning events. When the service came back, the team replayed the full batch immediately.

The service had recovered from the primary failure — a database connection pool issue — but had not fully drained its retry queue from the outage. Connection pool utilization was at 85% of capacity when the replay began. The first 60 events processed normally. Then the connection pool hit its limit again under the combined pressure of the replay and the lingering retry queue. Another brief failure.

Of the 200 replayed events, 80 were processed twice (they had partially landed during the original outage but the team did not know because no receipt infrastructure was in place). 120 were not processed at all — they hit the service during the secondary failure window and returned 503, but the replay tool marked them as "sent" and did not retry again.

The cleanup took four hours. Customer support handled 60 inbound tickets. The incident review identified "replaying too early into a recovering system" as the root cause.

Here is what Replay Prediction would have shown at the time:

Circuit breaker state: half-open (probing, not confirmed healthy)
Error rate: 12% in the last 5 minutes (well above baseline)
Target latency: p99 at 4.2 seconds (3x normal)
Historical replay success rate for this target: 91%
Delivery confidence: moderate (events are 47 minutes old, payloads are stable)

Combined score: Risky. The tool would have shown this rating with an explanation: "Target is in half-open circuit state with elevated error rate. Recommend waiting for circuit to close and error rate to return to baseline before replaying."

The team would have waited 15 minutes. By then, the connection pool had fully drained. The circuit would have closed. The error rate would have returned to near zero. A replay at that point would have been scored Safe. All 200 events would have delivered correctly.

The Five Factors

Factor 1: Circuit Breaker State

The circuit breaker has more information about your target's recent behavior than any other signal. If the circuit is open, the target has been failing hard enough and long enough to cross the automatic threshold. Replaying into an open circuit is not a replay — it is queuing, because delivery is blocked. The circuit must close before any replay events can land.

If the circuit is half-open, the target passed one probe request. One. That is weak evidence of recovery. Replay into a half-open target is possible, but it compounds the confidence validation — you are sending many more events than the probe, into a system that has only proven it can handle one.

If the circuit is closed and has been closed for at least a configurable stability window (default: 5 minutes), this factor contributes positively to the prediction. Closed for 5 minutes with no failure events is meaningful evidence that the target is stable, not just alive.

Factor 2: Recent Error Rate

The percentage of non-2xx responses in the last 5 minutes compared to the 1-hour rolling baseline. Even if the circuit is closed, a recent error rate above the baseline indicates residual instability. A service that is returning 3% errors right now versus its 0.1% baseline might have recovered from the primary failure but still be struggling with edge cases.

The comparison to the rolling baseline matters. A target that normally returns 1% errors at 2am when it is doing maintenance operations might have a 3% error rate right now simply because it is 2am and maintenance is running. Replay into that target is not risky — it is normal. The prediction compares current behavior to historical normal, not to an absolute threshold.

Factor 3: Target Latency Trend

Current p99 response latency compared to the 24-hour baseline. A recovering service often shows elevated latency before its error rate fully normalizes — in-flight requests that were waiting during the outage are still being processed, caches that were cleared during the restart are being rebuilt, connection pools are warming up.

Elevated latency is not a firm block on replay. It is a signal that the service is not yet operating at full capacity. Replay at elevated latency means each event takes longer to process, which extends the total replay window and increases the window during which a secondary failure could affect in-flight events.

If latency is above 2x the 24-hour baseline, this factor contributes negatively to the prediction. If latency is within normal range, it contributes positively.

Factor 4: Historical Replay Success Rate

Every replay that has been executed through HookTunnel contributes to a historical success rate for the target URL. This rate tracks: of all replay attempts to this target over the last 30 days, what percentage completed with a 2xx response?

This factor exists because some targets are structurally harder to replay into than others. A target that processes events with a synchronous database write has a different replay risk profile than a target that enqueues events for async processing. A target with aggressive idempotency enforcement will handle replay cleanly. A target without idempotency checks may produce duplicate side effects even when replay "succeeds" from an HTTP response code perspective.

Historical replay success rate is an empirical measure of how this specific target has behaved during past replays. It is the most grounded signal in the prediction — it is not a model of what should happen, it is a record of what has actually happened.

Cold start default: if no replay history exists for a target, this factor defaults to a conservative 75% assumed success rate, which contributes a small negative influence to the prediction score. Conservative cold-start defaults mean the first replay into any target is appropriately cautious.

Factor 5: Delivery Confidence

Two components:

Event age: How long ago was the event originally received? Very recent events (under 1 hour) have maximum confidence — the payload is fresh, the business context is current, the target's handlers have not had time to produce stale state from other sources. Events older than 24 hours have reduced confidence, because the business state they represent may have changed through other means. If you are replaying a subscription.upgraded event from 48 hours ago, your target may have already processed that subscription change through a manual admin action or a different webhook path.

Payload freshness: Certain payload fields signal that the event was designed with eventual delivery in mind (idempotency keys, explicit timestamps, replay-safe field structures). Other payload structures suggest the event was designed to be processed immediately and may not be safe to replay into a system with hours of intervening state changes.

Delivery confidence does not block replay. It informs the total score and appears in the explanation so you understand the full picture.

The Combined Score

The five factors are combined into a weighted score. The weights are:

Circuit breaker state: 30%
Recent error rate: 25%
Target latency trend: 20%
Historical replay success rate: 15%
Delivery confidence: 10%

The combined score maps to four ratings:

Safe — All factors are in normal range. Proceed with confidence.

Moderate — One or two factors are slightly elevated. Replay is likely fine but watch for anomalies in the first 20-30 events before committing the full batch.

Risky — Multiple factors are elevated, or one factor is significantly elevated (for example, half-open circuit state or error rate above 5x baseline). Wait before replaying, or replay a small test batch first.

Unlikely — The circuit is open, the target is actively failing. Replay is blocked at the UI level until the circuit closes or the circuit is force-closed with an explicit override.

Each rating is accompanied by a plain-language explanation: which factors contributed negatively, what they mean, and what action is recommended. The explanation is not a raw dump of numbers — it is a sentence or two that an engineer can read in 5 seconds and understand immediately.

No Setup Required

Replay Prediction runs automatically using metrics that HookTunnel is already collecting for every hook. Circuit state, error rates, latency, replay history — all of these are produced as a side effect of normal operation.

You do not configure a prediction model. You do not tell HookTunnel anything about your target's expected behavior. The prediction is based on what has actually been observed.

The first time you replay through HookTunnel, some factors will have limited history. Cold-start defaults kick in for those factors. As your hooks accumulate history, the prediction becomes more calibrated to your actual environment. By the time you need the prediction most — during a real outage recovery — it has weeks or months of context about your targets' normal behavior.

Batch Replay Aggregate Score

When you use Batch Replay to replay a group of events, the prediction score is computed as an aggregate across all events in the batch. The aggregate considers:

The individual event scores weighted by event age
The target URL distribution across the batch (different events may target different URLs)
The total volume of the replay and its expected duration at current target latency

The aggregate score is shown as a risk distribution: what percentage of events in the batch are Safe, Moderate, Risky, or Unlikely. You can filter the batch to replay only the Safe subset first, watch the results, and then proceed with the Moderate events once you have confirmed stability.

Comparison

| Capability | HookTunnel | ngrok | Webhook.site | Hookdeck | Svix | |---|---|---|---|---|---| | Pre-replay risk scoring | Yes | No | No | No | No | | Circuit breaker state as replay input | Yes | No | No | No | No | | Historical replay success rate tracking | Yes | No | No | No | No | | Plain-language replay risk explanation | Yes | No | No | No | No | | Batch replay aggregate risk score | Yes | No | No | No | No | | Replay blocked when circuit is open | Yes | No | No | No | No |

No other webhook tool models replay risk. All competitors expose a replay button and leave the risk assessment entirely to the operator.

FAQ

Can I override the Risky or Unlikely rating and replay anyway?

Yes. Risky allows immediate replay with an acknowledged warning. Unlikely (open circuit) requires either waiting for the circuit to close or using the Force Close action on the circuit breaker, which itself requires confirmation and creates an audit log entry. There is no silent override — every override is logged.

How often is the risk score updated?

The factors that change most rapidly — error rate and latency — are updated with each incoming event delivery. Circuit state updates on state transitions. Historical replay success rate updates after each completed replay. When you open the replay confirmation dialog, the score is computed from the freshest available data at that moment.

Does the prediction account for my application's idempotency handling?

Indirectly, through the historical replay success rate. If your application handles replay cleanly — because it has strong idempotency enforcement — replays to that target will consistently return 2xx and not produce duplicate side effects. That pattern is reflected in the historical success rate over time. If your application has idempotency gaps, some replays will produce anomalous behavior that gets captured in the success rate and in Outcome Receipt state.

What if I have no replay history at all?

Cold-start defaults give you a prediction that is appropriately conservative but not blocking. You can still replay — the prediction will show Moderate for most normal-circuit-state scenarios with no history, prompting you to proceed cautiously and watch the first batch of results closely.

How does this interact with Guardrailed Replay?

Replay Prediction is the pre-replay decision layer. Guardrailed Replay is the in-flight safety layer. Prediction tells you whether to start the replay and at what confidence level. Guardrailed Replay monitors the replay as it executes, skipping events that have confirmed receipts and stopping if a receipt arrives mid-batch that changes the risk picture. They are complementary — Prediction is the before, Guardrailed Replay is the during.