Five lessons from running event-driven systems in production

Event-driven architecture is sold as the cure for tight coupling. It is — but it trades synchronous bugs you can see for asynchronous ones you can’t. Here are five lessons I keep relearning.

The shape of a resilient consumer — success commits, repeated failures go to a dead-letter topic, and a fixed bug lets you replay:

flowchart LR P[Producer] --> T{{Kafka topic}} T --> C[Consumer] C -->|success| D[(Datastore)] C -->|fails x N| DLQ[[Dead-letter topic]] DLQ -. replay after fix .-> C

1. Consumers must be idempotent

At-least-once delivery is the default almost everywhere. Your consumer will see the same message twice. Design every handler so reprocessing is a no-op:

def handle(event):
    if already_processed(event.id):
        return
    with transaction():
        apply(event)
        mark_processed(event.id)

2. Order is a feature you pay for

Kafka guarantees order within a partition, not across them. If two events must be processed in order, they must share a partition key. Pick the key carefully — usually the entity id (account, order, user).

3. Schemas are contracts

A producer changing a field breaks every consumer silently. Put a schema registry in front of the topic and make compatibility checks part of CI.

Change	Safe?
Add optional field	✅
Remove a field	❌
Rename a field	❌
Widen an enum	⚠️

4. Dead letters are not optional

A poison message will block a partition forever if you let it. Route failures to a dead-letter topic after N retries, alert on it, and keep the original payload so you can replay once the bug is fixed.

5. You can’t debug what you can’t trace

In a synchronous system a stack trace tells the whole story. In an event-driven one, a single user action fans out across services. Propagate a correlation id on every event and log it everywhere — it’s the thread that stitches the story back together.

Eventual consistency is fine. Eventual observability is not.

Closing

Event-driven systems reward teams that treat reliability as a first-class feature: idempotent consumers, explicit ordering, versioned schemas, dead-letter handling, and tracing from day one. Skip those and the decoupling you bought will cost you the debuggability you lost.

# 1. Consumers must be idempotent

# 2. Order is a feature you pay for

# 3. Schemas are contracts

# 4. Dead letters are not optional

# 5. You can’t debug what you can’t trace

# Closing

Designing idempotent payment APIs that never double-charge

Design patterns for multi-provider APIs: a payments case study