Temporal Webhook Retry Policy: Unlimited Retries by Default

What is your webhook handler's retry policy right now?

Most teams answer that question by describing their provider's policy. Stripe retries failed deliveries over 72 hours with an exponential schedule — about 6 times total. Shopify retries up to 19 times over 48 hours. GitHub sends three attempts. Those numbers describe what happens when the webhook cannot reach your endpoint at all.

They say nothing about what happens after your handler receives the payload and crashes.

Your application received the request. It returned 200. The provider's delivery is logged as successful. And then — inside your handler — the database write failed. The downstream API call timed out. The message queue rejected the publish. The provider does not know and does not care. From the outside, the delivery succeeded. From the inside, the processing failed silently.

That second failure mode — post-receipt processing failure — is what Temporal's retry policy addresses. And it addresses it with a default that should make most framework authors pause: unlimited retries, exponential backoff, until you configure it down.

Temporal's Retry Policy, Precisely

Temporal's default of unlimited retries is a design statement: failure is expected, retries should be automatic. The Temporal documentation covers the full retry configuration, and Temporal on GitHub shows the engineering behind the framework.

Temporal's activity retry policy has four primary knobs:

initialInterval — How long to wait before the first retry. Default: 1 second.

backoffCoefficient — The multiplier applied to the interval on each retry. Default: 2.0 (exponential doubling).

maximumInterval — The cap on the retry interval so it does not grow unboundedly. Default: 100x the initial interval (100 seconds at the defaults).

maximumAttempts — How many total attempts before the activity is considered failed. Default: 0, which means unlimited.

The design philosophy encoded in these defaults: transient failures should be retried indefinitely. If your database is down, Temporal will keep retrying — 1s, 2s, 4s, 8s, 16s, up to 100s, then 100s on each subsequent attempt — until the database comes back up and the activity succeeds. Your workflow continues. Your application code does not handle this. The framework does.

// Explicit retry policy — but these are already the defaults
const activities = wf.proxyActivities<typeof myActivities>({
  retry: {
    initialInterval: '1s',
    backoffCoefficient: 2,
    maximumInterval: '100s',
    maximumAttempts: 0, // unlimited — the default
    nonRetryableErrorTypes: ['BusinessLogicError', 'ValidationError'],
  },
  scheduleToCloseTimeout: '30m',
});

// Non-retryable errors are thrown with a specific type
export async function parseWebhookPayload(raw: unknown) {
  try {
    return webhookSchema.parse(raw);
  } catch (err) {
    // Schema validation failures should not be retried
    throw ApplicationFailure.create({
      type: 'ValidationError',
      message: `Invalid payload: ${err.message}`,
      nonRetryable: true,
    });
  }
}

The nonRetryableErrorTypes list is important. You do not want unlimited retries on a malformed payload — the payload will be malformed on attempt 1 and attempt 1,000. Business logic failures, validation errors, and explicit application rejections should be marked non-retryable. Everything else — network timeouts, temporary database unavailability, downstream service 503s — gets retried.

The distinction between workflow-level retries and activity-level retries matters. Activities are the units of work that get retried individually. A workflow can have its own retry policy, but typically you rely on activity retries — each step in the workflow can fail and retry independently without re-running the steps that already succeeded.

What Temporal Protects

A concrete failure scenario:

Shopify fires orders/paid at your webhook endpoint.
Your endpoint starts a Temporal workflow and returns 200.
The workflow's first activity, persistOrderToDatabase, runs. The database is experiencing elevated latency. The query times out.
Temporal retries persistOrderToDatabase after 1 second. Still timing out.
Temporal retries after 2 seconds. 4 seconds. 8 seconds. The database recovers after 23 seconds of elevated latency.
On the 6th attempt (about 30 seconds later), persistOrderToDatabase succeeds.
The workflow continues to the next activity: notifyFulfillmentService.

From Shopify's perspective, the webhook was delivered successfully. From your application's perspective, the database write succeeded. From a developer's perspective, no code was written to handle the retry. Temporal handled it.

That is the promise. For infrastructure failures measured in seconds or minutes, Temporal's unlimited retries with exponential backoff mean your processing is resilient by default.

Celery: The Python Alternative

Celery is the standard for background task processing in Python stacks — mature, battle-tested, and integrates cleanly with Redis and RabbitMQ. The tradeoff against Temporal is abstraction level: Celery tasks handle their own retries explicitly, while Temporal activities don't know retries exist. It is mature, battle-tested, and integrates cleanly with Redis, RabbitMQ, and a range of message brokers. For Python teams, it is often the natural choice.

Celery's retry default is more conservative: max_retries=3 (configurable, including None for infinite). Backoff requires explicit configuration — you implement it with countdown or retry(countdown=2**self.request.retries) in your task. There is no built-in exponential backoff; it is a pattern you apply.

from celery import shared_task

@shared_task(
    bind=True,
    max_retries=None,  # None for unlimited — you opt in to infinite
    default_retry_delay=1,
)
def process_webhook(self, payload: dict):
    try:
        persist_to_database(payload)
        notify_fulfillment(payload)
    except TemporaryDatabaseError as exc:
        # Exponential backoff implemented manually
        retry_delay = 2 ** self.request.retries
        raise self.retry(exc=exc, countdown=min(retry_delay, 100))
    except ValidationError:
        # Non-retryable — do not retry, just fail
        raise

This works well and is straightforward for any experienced Python developer. The retry logic is visible and explicit, which some teams prefer to Temporal's framework-managed approach.

The difference in abstraction level is significant. Celery's task is a function that handles its own retries. Temporal's activity is a function that does not know retries exist — the framework handles it. For complex workflows with multiple steps, Temporal's model means each step's retry behavior is isolated: a failure in step 3 does not re-run steps 1 and 2. With Celery, the entire task is the retry unit — if you want step-level idempotency, you implement it yourself.

Celery is the right choice for Python teams who want a proven, well-understood task queue with full control over retry behavior and no new architectural paradigm to learn. Temporal is the right choice when you need step-level durability guarantees, cross-language support, or workflow executions that span more than a few minutes.

For webhook processing specifically: if your handlers are simple (one or two operations) and your stack is Python, Celery is a serious option. If your webhook handlers are complex multi-step processes or run on multiple language stacks, Temporal's approach is more powerful.

The Failure Moment Neither Tool Covers

Temporal's unlimited retries protect the processing that happens after your handler receives the webhook. Celery's retries protect the same moment. Both tools assume the payload has been successfully received before any retry logic can engage — which is the failure mode neither framework was designed for. See HookTunnel's webhook inspection features for boundary capture, the webhook debugging checklist for the full diagnostic workflow, and the flat $19/month Pro plan for what Pro replay provides at this layer.

But there is a prior failure: your handler is down when the webhook arrives.

Your Temporal workers are restarting after a deploy. Your Celery workers are being replaced by a rolling update. Your serverless function is cold-starting and the first request times out. Your entire application is down for a database migration.

In any of these cases, the webhook arrives at a moment when nothing is running to receive it. The provider's response — a 503, a connection refused, a timeout — puts the event in the provider's retry queue. And then your retry window depends entirely on the provider: Stripe gives you 72 hours with about 6 retries, Shopify gives you 48 hours, GitHub gives you three attempts.

If your application has been down for longer than the provider's retry window — or if the provider's retries exhaust before you are back up — the payload is gone. Temporal never saw it. Celery never queued it. There is no durable execution for an event that was never received.

HookTunnel operates at this boundary. It is an HTTPS endpoint that receives the inbound webhook, stores the raw payload immediately, and keeps it for the history window — 24 hours on the Free tier, 30 days on Pro. It does this before any worker processing, before any task queue, before any durable execution framework.

When your workers come back up, the event is in HookTunnel's history. Pro users can replay it — re-send the original HTTP payload to your webhook endpoint. From there, your Temporal workflow or Celery task receives it normally, and all of your retry logic and durability guarantees apply.

The two layers, in sequence:

Provider fires webhook
    ↓
HookTunnel receives + stores payload (boundary capture)
    ↓
HookTunnel forwards to your endpoint
    ↓ [your workers may be down here — HookTunnel has the payload]
Your endpoint receives (when healthy)
    ↓
Temporal workflow starts / Celery task enqueues
    ↓
Activity retries handle transient processing failures

HookTunnel covers the left side of that diagram. Temporal and Celery cover the right side. For true end-to-end reliability, you need both layers.

The Full Failure Scenario, Mapped

What Temporal's unlimited retries protect:

Your database is temporarily unavailable after the webhook arrives
A downstream API returns 503 while your handler is processing
A network partition causes intermittent timeouts during a multi-step workflow
A worker crash mid-activity — Temporal replays, re-runs only the failed step

What HookTunnel protects:

Your workers are down during a deploy when the webhook arrives
Your serverless function cold-starts and the request times out at the provider
A webhook arrives during a database migration window
The provider fires a retry 4 hours after the original and your history has already been inspected and closed

These failure modes do not overlap. They are different moments in the same event lifecycle. Temporal's retry guarantees are meaningless for events that never reached the execution layer. HookTunnel's boundary capture is not relevant to processing failures that happen after receipt.

The teams who have encountered both failure modes — and most teams operating at scale have — end up combining both approaches. Not because any one tool is incomplete, but because these are genuinely different parts of the problem.

A Direct Assessment

Temporal's retry defaults are a design statement: failure is expected, retries should be automatic, and developers should write business logic not retry logic. That is a compelling philosophy. The implementation is serious — the team thought carefully about what should be retryable, what the default interval and coefficient should be, and what "unlimited" should mean in practice.

For teams ready to invest in the execution model, Temporal is one of the most reliable webhook processing backends available.

Celery is a legitimate alternative for Python shops that want simpler infrastructure and explicit control. It is not the safer choice — it is the simpler choice, with different tradeoffs.

And for the moments before your durable execution layer even starts — before Temporal has a workflow to retry, before Celery has a task to queue — there is a boundary layer worth knowing. It is new, it is inexpensive, and it covers the part of the problem that neither retry framework was designed for.

Not mutually exclusive. The retry policy matters after receipt. The boundary capture matters before it.

How Temporal's Retry Defaults Protect Your Webhook Handlers from Transient Failures