Kafka Exactly-Once Semantics: EOS Explained for Webhooks

Exactly-once delivery is the most abused phrase in distributed systems. Most implementations that claim it are actually providing at-least-once delivery with deduplication bolted on, or at-most-once with an acknowledgment that sometimes gets lost, or something in between that the vendor's marketing team decided to call exactly-once because it sounds better than "usually once."

Kafka's exactly-once semantics, shipped in version 0.11 and matured through Kafka Streams, is one of the honest implementations. The Confluent team did careful, documented work: idempotent producers with sequence number-based deduplication, a transactions API for atomic multi-partition operations, and exactly-once processing mode in Kafka Streams that composes these guarantees into a coherent application model. See the Apache Kafka documentation for the technical specification, and the Confluent blog for implementation deep-dives.

Let's talk about what it actually guarantees — and what it deliberately does not.

The Mechanics of Kafka EOS

Kafka's exactly-once semantics rests on two primitives: idempotent producers and the transactions API — both introduced in version 0.11 and composed into a coherent model by Kafka Streams. Each producer is assigned a producer ID (PID) by the broker. Every message batch sent by the producer carries a monotonically increasing sequence number. The broker tracks the last sequence number received from each PID per partition. If a batch arrives with a sequence number the broker has already seen — because the producer retried after a network error — the broker discards the duplicate. The producer does not know whether the original delivery succeeded or failed; it just retries. The broker does the deduplication.

This eliminates duplicates from producer retries. It does not eliminate duplicates from application-level reprocessing, consumer restarts, or rebalances.

Transactions API. The transactions API allows a producer to atomically commit a batch of messages across multiple partitions. Combined with the consumer coordination protocol, it enables the critical pattern: read from a source partition, process, write results to output partitions, commit the source offset — atomically. Either all of that happens, or none of it does.

// Producer transaction for exactly-once processing
producer.initTransactions();

try {
    producer.beginTransaction();

    // Write results to output topic
    producer.send(new ProducerRecord<>("order-events-processed", orderId, result));

    // Commit input offset atomically with the output write
    Map<TopicPartition, OffsetAndMetadata> offsets = ...;
    producer.sendOffsetsToTransaction(offsets, consumerGroupMetadata);

    producer.commitTransaction();
} catch (ProducerFencedException e) {
    // Another producer instance has taken over — stop processing
    producer.close();
} catch (Exception e) {
    producer.abortTransaction();
}

Kafka Streams exactly-once processing mode. Kafka Streams composites these two primitives. When you enable processing.guarantee=exactly_once_v2 (the improved version from Kafka 2.5+), Kafka Streams handles the transaction management for you. Your stream processor reads input records, produces output records, and commits offsets — all atomically. If the processor fails between a write and a commit, the transaction is aborted. On restart, the processor re-reads and re-processes from the last committed offset.

StreamsConfig config = new StreamsConfig(Map.of(
    StreamsConfig.APPLICATION_ID_CONFIG, "webhook-processor",
    StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka:9092",
    StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE_V2,
    // v2 uses a single transaction per task per commit interval
    // significantly better performance than EXACTLY_ONCE (v1)
));

The result: each input record is processed exactly once, even across failures, restarts, and rebalances. For the Kafka Streams topology — the read-process-write cycle — this is a genuine exactly-once guarantee.

What Exactly-Once Does Not Cover

The exactly-once guarantee applies to the Kafka Streams processing topology. It does not apply to side effects outside Kafka.

If your stream processor writes to a PostgreSQL database, Kafka's exactly-once guarantee cannot include that write. The database knows nothing about Kafka transactions. If your processor crashes after writing to the database but before committing the Kafka transaction, the transaction is aborted and Kafka Streams re-processes the record. Your database now has the record written twice — or has a write that was never committed in Kafka's view.

This is not a flaw in Kafka's implementation. It is an honest scope statement. Exactly-once is guaranteed for the Kafka topic I/O. For external systems, you need idempotent application code.

// You are responsible for idempotency in external writes
public void processWebhookEvent(WebhookEvent event) {
    // Kafka guarantees exactly-once delivery of this event to our processor.
    // We are responsible for exactly-once semantics in our database write.
    orderRepository.upsert(
        event.orderId(),
        event.payload(),
        event.timestamp()
    );
    // upsert = idempotent by design: same orderId, same result
    // INSERT ... ON CONFLICT (order_id) DO UPDATE SET ...
}

Idempotent external writes are the developer's job. The pattern is well-understood — use upsert semantics, use natural business keys as idempotency keys, make state transitions safe to apply twice. But it requires explicit design. Kafka's exactly-once does not eliminate the need for it; it eliminates duplicates within the Kafka topology so that your external write logic only needs to handle the cases it generates, not cases created by Kafka's own processing.

SQS FIFO: The Managed Alternative

Amazon SQS FIFO queues provide exactly-once processing within a 5-minute deduplication window. You send a message with a MessageDeduplicationId, and within 5 minutes, any message with the same ID is deduplicated at the broker level. Consumers receive each unique message exactly once during the dedup window.

SQS FIFO is significantly simpler to operate. There is no Kafka cluster to manage, no replication to configure, no consumer group coordination to understand. You pay per message, set up a queue in the AWS console, and start processing. For teams already in the AWS ecosystem, the operational overhead is minimal.

The tradeoffs are real:

Throughput. SQS FIFO queues have a throughput ceiling of 3,000 messages per second per queue (with batching). Kafka partitions can handle hundreds of thousands of messages per second. For high-volume event processing, the ceiling matters.

Dedup window. The 5-minute deduplication window means that if you receive a duplicate message more than 5 minutes after the original, SQS will process it again. Kafka's idempotent producer deduplication has no time window — it is based on sequence numbers tracked per producer ID per partition.

Stream processing. Kafka Streams provides a rich stream processing API — windowed aggregations, join operations, stateful processing with local stores. SQS is a queue, not a stream processor. For complex event processing topologies, you would combine SQS with a separate compute layer (Lambda, ECS tasks). Kafka Streams is the compute layer.

Ordering. SQS FIFO provides per-message-group ordering. Kafka provides ordering per partition. For webhook events where ordering matters (status transitions, sequence-dependent operations), both can handle it — the configuration looks different.

# SQS FIFO — deduplication is message-level, handled by the broker
sqs.send_message(
    QueueUrl=FIFO_QUEUE_URL,
    MessageBody=json.dumps(payload),
    MessageGroupId=f"order-{payload['order_id']}",
    MessageDeduplicationId=payload['event_id'],  # provider-supplied event ID
)

For teams that need managed simplicity, reasonable throughput, and existing AWS investment — SQS FIFO is a serious option. For teams that need high throughput, stream processing, partition ordering, and no time-bounded dedup window — Kafka EOS is the better fit.

Where Both Tools Face the Same Problem

Neither Kafka EOS nor SQS FIFO can prevent duplicate events that originate before the message queue — at the HTTP boundary itself. This is the gap both tools share, and understanding it is critical for webhook system design. See also: webhook retry storms and Stripe duplicate webhook events for concrete examples.

Stripe fires payment.succeeded at your webhook endpoint. Your endpoint is down — a deploy is in progress, a worker is restarting, a database migration is locking the application. Stripe logs a failed delivery and schedules a retry in 1 hour.

Sixty minutes later, Stripe fires the retry. Your application is back up. The event is received, parsed, and published to your Kafka topic (or SQS FIFO queue). Your Kafka Streams processor picks it up and processes it exactly once. Perfect.

Two days later, a different Stripe engineer manually re-fires the event to check that your integration is working after a reported issue. Stripe fires payment.succeeded again. Your endpoint receives it. Your application parses it and publishes it to Kafka. This is a new message with a new sequence number — from Kafka's perspective, it is a distinct event. Your Kafka Streams processor processes it exactly once.

You have now processed the same logical payment event twice. Kafka's EOS guarantee was upheld — each Kafka message was processed exactly once. The duplicate originated before Kafka was involved, at the HTTP boundary.

The same problem exists with SQS FIFO. If the duplicate arrives more than 5 minutes after the original — which it almost certainly will — the MessageDeduplicationId is expired and SQS treats it as a new message.

Deduplication at the HTTP Boundary

HookTunnel operates at the point where the webhook arrives — before it enters your message queue, before your stream processor, before any EOS guarantees engage. This is the layer that Kafka and SQS cannot reach on their own. See HookTunnel features for how boundary deduplication works.

For Shopify, this is straightforward. Shopify includes a stable event ID in the X-Shopify-Webhook-Id header. HookTunnel stores this ID per hook and deduplicates at receipt: if an event with the same ID has already been captured, the duplicate is flagged. Your Kafka topic never sees it.

POST /h/wh_abc123 HTTP/1.1
X-Shopify-Webhook-Id: a7b8c9d0-1234-5678-90ab-cdef01234567
X-Shopify-Topic: orders/paid

{"id": 820982911946154500, ...}

For Stripe, the idempotency key is inside the event payload — data.object.id combined with the event type gives you a business-level dedup key. HookTunnel captures the full payload history per hook, giving you the forensic record you need to implement dedup before publishing to Kafka. You can inspect HookTunnel's history, identify the duplicate (same Stripe event ID appearing twice), and discard it before publishing.

The full reliable stack for webhook event processing:

Provider fires webhook
    ↓
HookTunnel receives + stores (boundary dedup, capture, forensics)
    ↓
HookTunnel forwards to your endpoint
    ↓
Your endpoint publishes to Kafka topic
    ↓
Kafka Streams processes (EOS: exactly-once within the topology)
    ↓
Idempotent external writes (database upsert, etc.)

Each layer handles what it is designed for:

HookTunnel: boundary dedup, payload capture, replay for missed deliveries
Kafka: partition ordering, high-throughput ingestion, exactly-once stream processing
Application code: idempotent external writes, business dedup logic

No layer is redundant. Boundary dedup (HookTunnel) eliminates provider-level duplicates. Kafka EOS eliminates processing-layer duplicates caused by Kafka's own internal retries. Application-level idempotency handles the external systems Kafka cannot control.

A Direct Assessment

Kafka's exactly-once semantics is some of the most careful distributed systems engineering in any open-source project. The idempotent producer design — sequence numbers, epoch-based producer fencing, broker-side dedup — is elegant and technically thorough. The exactly_once_v2 processing mode in Kafka Streams represents years of iteration on the original EOS implementation, with dramatically better performance and cleaner semantics.

If you are building high-throughput event processing and need partition ordering, exactly-once guarantees in the stream processing layer, and a rich stateful processing API — Kafka belongs in your stack. There is no managed alternative that matches it at scale.

SQS FIFO is the right answer for teams that need exactly-once processing with managed infrastructure and do not need Kafka's throughput or stream processing capabilities. Both are legitimate and mature.

The duplicate problem often starts at the HTTP boundary, before either system sees anything. A provider re-fires an event. A retry schedule overlaps with a manual re-test. A webhook is delivered twice because of a provider-side bug. These events arrive at your endpoint as distinct HTTP requests — with distinct sequence numbers, with distinct message IDs — and both Kafka and SQS will process them.

There is a new layer that closes that gap at the boundary. It costs $0 to start, $19/month for production-grade history and replay. It does not replace Kafka. It does not try to. It handles the part of the problem that exists before any message queue is involved.

Worth knowing it exists.

How Kafka's Exactly-Once Semantics Transformed Event Processing Reliability